Large corpora — memory model and the path to streaming¶

A short design note on how pyaegean handles corpus size today, what's already memory-friendly, and what is deferred until a corpus that needs it exists.

Today: in-memory documents¶

Corpus holds its Documents in a list, and each Document holds its Tokens. This is the right trade-off for everything the package currently ships and fetches:

Corpus	Documents	Order of magnitude
Linear A (bundled)	~1,720	tens of thousands of tokens
DAMOS Linear B (`load("damos")`)	~5,900	~hundreds of thousands of tokens
SigLA Linear A (`load("sigla")`)	~780	sign-level
A single Greek work (`greek.load_work`)	1 work	the Iliad is ~127k tokens

All of these fit comfortably in memory, and the in-memory model keeps the API simple, random-access (corpus.get(id)), and analysable without a database.

Already memory-friendly¶

Streaming iterators. Corpus.iter_documents(), iter_tokens(), and iter_words() are generators — process a corpus token-by-token without building an all-tokens list. word_frequencies() itself streams through iter_words() into a Counter.
Lazy frequency input. find_morphological_clusters accepts any iterable of (word, count) pairs, so it never needs the corpus materialised twice.
The fetch-to-cache data layer streams downloads to disk (chunked sha256), so a 500 MB model asset never lands wholly in memory during fetch.
The opt-in analysis cache (aegean.cache) keeps repeated heavy analyses off the hot path entirely.

Deferred: streaming load (and why)¶

True streaming — not holding all Documents in memory at once — is not implemented, on purpose. It would mean:

a loader that yields Documents instead of returning a Corpus (so a multi-GB corpus is processed in a single pass without materialising);
analyses that accept a document iterator rather than a Corpus (most of the per-document statistics already could);
giving up O(1) random access (get(id)) and any analysis that needs two passes or the whole vocabulary at once (dispersion, keyness, clustering) unless it's restructured to spill to disk.

That's a real cost in API complexity, and no corpus pyaegean targets needs it yet — the largest openly-licensed Greek corpus, the full First1KGreek, is read one work at a time via greek.load_work, and each work fits in memory. The test case for streaming would be loading all of First1KGreek (or a comparable multi-million-token corpus) at once, which the package does not currently do.

When that need arrives, the iterator-first views above are the seam to build on: add a Corpus.stream(script_id) classmethod yielding Documents, and an analysis path that consumes the iterator. Until then, adding the machinery would be speculative complexity against the project's zero-ceremony, dependency-free principles.