Large corpora — memory model and the path to streaming¶
A short design note on how pyaegean handles corpus size today, what's already memory-friendly, and what is deferred until a corpus that needs it exists.
Today: in-memory documents¶
Corpus holds its Documents in a list, and each Document holds its Tokens.
This is the right trade-off for everything the package currently ships and
fetches:
| Corpus | Documents | Order of magnitude |
|---|---|---|
| Linear A (bundled) | ~1,720 | tens of thousands of tokens |
DAMOS Linear B (load("damos")) |
~5,900 | ~hundreds of thousands of tokens |
SigLA Linear A (load("sigla")) |
~780 | sign-level |
A single Greek work (greek.load_work) |
1 work | the Iliad is ~127k tokens |
All of these fit comfortably in memory, and the in-memory model keeps the API
simple, random-access (corpus.get(id)), and analysable without a database.
Already memory-friendly¶
- Streaming iterators.
Corpus.iter_documents(),iter_tokens(), anditer_words()are generators — process a corpus token-by-token without building an all-tokens list.word_frequencies()itself streams throughiter_words()into aCounter. - Lazy frequency input.
find_morphological_clustersaccepts any iterable of(word, count)pairs, so it never needs the corpus materialised twice. - The fetch-to-cache data layer streams downloads to disk (chunked sha256), so a 500 MB model asset never lands wholly in memory during fetch.
- The opt-in analysis cache (
aegean.cache) keeps repeated heavy analyses off the hot path entirely.
Deferred: streaming load (and why)¶
True streaming — not holding all Documents in memory at once — is not
implemented, on purpose. It would mean:
- a loader that yields
Documents instead of returning aCorpus(so a multi-GB corpus is processed in a single pass without materialising); - analyses that accept a document iterator rather than a
Corpus(most of the per-document statistics already could); - giving up O(1) random access (
get(id)) and any analysis that needs two passes or the whole vocabulary at once (dispersion, keyness, clustering) unless it's restructured to spill to disk.
That's a real cost in API complexity, and no corpus pyaegean targets needs it
yet — the largest openly-licensed Greek corpus, the full First1KGreek, is read
one work at a time via greek.load_work, and each work fits in memory. The test
case for streaming would be loading all of First1KGreek (or a comparable
multi-million-token corpus) at once, which the package does not currently do.
When that need arrives, the iterator-first views above are the seam to build on:
add a Corpus.stream(script_id) classmethod yielding Documents, and an
analysis path that consumes the iterator. Until then, adding the machinery would
be speculative complexity against the project's zero-ceremony, dependency-free
principles.