aegean.core¶
core ¶
Script-agnostic core: data model, script plugin contract, corpus, numerals.
Corpus ¶
Corpus(documents: list[Document], sign_inventory: SignInventory | None = None, provenance: Provenance | None = None, script_id: str = '')
A collection of Document s plus shared inventory + provenance.
load
classmethod
¶
Load a bundled corpus by script id, e.g. Corpus.load("lineara").
get ¶
get(doc_id: str) -> Document | None
The document with id doc_id, or None if there is no such document.
fingerprint ¶
A stable content hash of this corpus — its script, documents (ids and
token text), and any subset: provenance note. Cheap relative to the
analyses it keys: one pass over the tokens, no model build. Two corpora
with the same fingerprint have the same analysable content, so it's the
cache key for aegean.cache-memoised analyses.
filter ¶
Return a new Corpus whose documents match all given metadata fields
(AND-combination), e.g. corpus.filter(site="HT", period="LMIB").
The subset's provenance records what was filtered (a subset: note),
so cite on the result cites the exact subset used.
cite ¶
Cite this corpus — or the exact filtered subset — in one call.
style: "plain" (one line), "bibtex" (a @misc entry), or
"apa". Filtered subsets (see filter) carry a subset: note that
all three styles include, so the citation states exactly what was used.
iter_documents ¶
iter_documents() -> Iterator[Document]
Iterate documents (the explicit-name form of iter(corpus)).
iter_tokens ¶
iter_tokens() -> Iterator[Token]
Every Token, in document then in-document order — a memory-friendly
stream that never builds an all-tokens list (useful on a large corpus).
iter_words ¶
Every lexical (WORD) token's text, in order, lazily. The unit
word_frequencies counts — stream it to feed your own Counter or a
running statistic without materialising a list.
word_frequencies ¶
(word, count) for every lexical word, sorted by descending count.
to_dataframe ¶
A pandas DataFrame at document, token, or word level.
pandas is an optional dependency — install with pip install 'pyaegean[data]'.
to_dict ¶
A compact, lossy export (_meta + per-document words/metadata) for quick
interop. For a complete, reversible serialization use to_json/from_json.
to_json ¶
Serialize the whole corpus to JSON losslessly — every token (with its kind,
signs, glyphs, line/position), the physical lines, full document metadata, the sign
inventory, and provenance all survive. from_json reverses it exactly.
Returns the JSON string, or writes it to path and returns None when path
is given. (Unlike to_dict, which is a compact lossy summary.)
from_json
classmethod
¶
Reconstruct a Corpus from to_json output: a JSON string, a Path to a
.json file, or a path-like string (anything not beginning with {).
from_records
classmethod
¶
from_records(records: Sequence[dict[str, Any]], *, script_id: str = 'custom', provenance: Provenance | None = None, sign_inventory: SignInventory | None = None) -> 'Corpus'
Build a corpus from plain dict records — your own inscriptions get the full API (filter, query, DataFrames, citation, export).
Each record needs an "id" and its text as one of:
"lines": a list of physical lines, each a list of tokens;"words": a flat token list (treated as one line);"text": a whitespace-tokenized string (one line).
A token is a string, or a dict {"text": …} with optional "kind"
(a TokenKind value; inferred when omitted — numerals by parseability,
the rest words), "status" (a ReadingStatus value), and "alt"
(alternate readings). Hyphenated tokens get their signs split.
Optional record keys: "meta" (site/period/scribe/support/findspot/
name), "translations". Example::
corpus = Corpus.from_records([
{"id": "X1", "text": "KU-RO 10", "meta": {"site": "My site"}},
{"id": "X2", "lines": [["A-DU", {"text": "5", "status": "unclear"}]]},
], script_id="lineara")
To make it loadable by name, register a loader:
aegean.core.corpus.register_loader("myfind", lambda: corpus).
from_dict
classmethod
¶
Reconstruct a Corpus from the dict to_json serializes (its json.loads).
query ¶
query(filters: Sequence[FilterRow], output: Output = 'inscriptions', *, annotated_ids: set[str] | None = None) -> QueryResults
Run the compound-query predicate engine over this corpus.
filters is a sequence of aegean.analysis.FilterRow rows (a field id, a
value, and optional connector/negate); output selects "inscriptions"
or "words". Returns aegean.analysis.QueryResults (.inscriptions and
.words) carrying this corpus's provenance and a summary of the filters,
so results.cite() cites the exact result set. The available fields are
in aegean.analysis.FIELDS. Unlike filter (exact metadata match), this
supports text/prefix/sign-pattern/co-occurrence predicates with AND/OR/NOT.
Document
dataclass
¶
Document(id: str, script_id: str, tokens: list[Token], lines: list[list[int]], glyphs: str = '', transcription: str = '', translations: list[str] = list(), meta: DocumentMeta = DocumentMeta())
DocumentMeta
dataclass
¶
DocumentMeta(site: str = '', support: str = '', scribe: str = '', findspot: str = '', period: str = '', name: str = '', images: tuple[str, ...] = (), notes: tuple[str, ...] = ())
Bibliographic / archaeological metadata for a document.
ReadingStatus ¶
Bases: str, Enum
Editorial certainty of a token's reading (Leiden / EpiDoc conventions).
CERTAIN is the default. The others mark the apparatus an epigraphic edition must
preserve — damaged, restored, or lost text. The bundled corpora are normalized
transcriptions (almost entirely CERTAIN; see the Linear A provenance note); a
bring-your-own EpiDoc corpus populates these from <unclear> / <supplied> /
<gap> markup, and the EpiDoc writer emits them back.
Sign
dataclass
¶
Sign(label: str, glyph: str | None = None, codepoint: int | None = None, phonetic: str | None = None, script_id: str = '', attrs: dict[str, Any] = dict())
One graphic unit of a script (syllabogram, letter, or logogram).
SignInventory ¶
SignInventory(signs: list[Sign], script_id: str = '')
The set of signs for a script, indexed by label / glyph / codepoint.
Token
dataclass
¶
Token(text: str, kind: TokenKind, signs: tuple[str, ...] = (), glyphs: str | None = None, line_no: int | None = None, position: int | None = None, status: ReadingStatus = CERTAIN, alt: tuple[str, ...] = ())
One unit in a document's transliterated text stream.
TokenKind ¶
Bases: str, Enum
The role a token plays in a document's text stream.
Provenance
dataclass
¶
Provenance(source: str, license: str = '', citation: str = '', url: str = '', schema_version: int = SCHEMA_VERSION, notes: tuple[str, ...] = tuple(), data_version: str = '')
Where a corpus came from and how to cite it.
bibtex ¶
A BibTeX @misc entry for this source.
Best-effort formatting of the recorded free-text provenance: only fields
actually known are emitted; the first year found in the citation string
(if any) becomes year; the license and any provenance notes (e.g.
the subset note Corpus.filter records) go into note.
apa ¶
An APA-style reference line (n.d. when no year is recoverable).
Best-effort formatting of the recorded free-text provenance; notes
(e.g. the subset note Corpus.filter records) follow in brackets.
Script ¶
Bases: ABC
A writing system the package can read and analyse.
register_loader ¶
Register a corpus loader so Corpus.load(script_id) / aegean.load(script_id) works.
get_script ¶
get_script(script_id: str) -> Script
Return the registered Script for script_id (raises KeyError if unknown).
register ¶
register(script: Script) -> None
Register a script plugin under its id (each built-in plugin calls this on import).
registered_scripts ¶
The sorted ids of all registered scripts, e.g. ['cypriot', 'cyprominoan', 'greek', 'lineara', 'linearb'].