Skip to content

aegean.core

core

Script-agnostic core: data model, script plugin contract, corpus, numerals.

Corpus

Corpus(documents: list[Document], sign_inventory: SignInventory | None = None, provenance: Provenance | None = None, script_id: str = '')

A collection of Document s plus shared inventory + provenance.

load classmethod

load(script_id: str) -> 'Corpus'

Load a bundled corpus by script id, e.g. Corpus.load("lineara").

get

get(doc_id: str) -> Document | None

The document with id doc_id, or None if there is no such document.

fingerprint

fingerprint() -> str

A stable content hash of this corpus — its script, documents (ids and token text), and any subset: provenance note. Cheap relative to the analyses it keys: one pass over the tokens, no model build. Two corpora with the same fingerprint have the same analysable content, so it's the cache key for aegean.cache-memoised analyses.

cache_key

cache_key() -> str

Alias for fingerprint, the protocol aegean.cache keys on.

filter

filter(**meta: Any) -> 'Corpus'

Return a new Corpus whose documents match all given metadata fields (AND-combination), e.g. corpus.filter(site="HT", period="LMIB").

The subset's provenance records what was filtered (a subset: note), so cite on the result cites the exact subset used.

cite

cite(style: str = 'plain') -> str

Cite this corpus — or the exact filtered subset — in one call.

style: "plain" (one line), "bibtex" (a @misc entry), or "apa". Filtered subsets (see filter) carry a subset: note that all three styles include, so the citation states exactly what was used.

iter_documents

iter_documents() -> Iterator[Document]

Iterate documents (the explicit-name form of iter(corpus)).

iter_tokens

iter_tokens() -> Iterator[Token]

Every Token, in document then in-document order — a memory-friendly stream that never builds an all-tokens list (useful on a large corpus).

iter_words

iter_words() -> Iterator[str]

Every lexical (WORD) token's text, in order, lazily. The unit word_frequencies counts — stream it to feed your own Counter or a running statistic without materialising a list.

word_frequencies

word_frequencies() -> list[tuple[str, int]]

(word, count) for every lexical word, sorted by descending count.

to_dataframe

to_dataframe(level: str = 'document')

A pandas DataFrame at document, token, or word level.

pandas is an optional dependency — install with pip install 'pyaegean[data]'.

to_dict

to_dict() -> dict[str, Any]

A compact, lossy export (_meta + per-document words/metadata) for quick interop. For a complete, reversible serialization use to_json/from_json.

to_json

to_json(path: str | Path | None = None, *, indent: int | None = 2) -> str | None

Serialize the whole corpus to JSON losslessly — every token (with its kind, signs, glyphs, line/position), the physical lines, full document metadata, the sign inventory, and provenance all survive. from_json reverses it exactly.

Returns the JSON string, or writes it to path and returns None when path is given. (Unlike to_dict, which is a compact lossy summary.)

from_json classmethod

from_json(source: str | Path) -> 'Corpus'

Reconstruct a Corpus from to_json output: a JSON string, a Path to a .json file, or a path-like string (anything not beginning with {).

from_records classmethod

from_records(records: Sequence[dict[str, Any]], *, script_id: str = 'custom', provenance: Provenance | None = None, sign_inventory: SignInventory | None = None) -> 'Corpus'

Build a corpus from plain dict records — your own inscriptions get the full API (filter, query, DataFrames, citation, export).

Each record needs an "id" and its text as one of:

  • "lines": a list of physical lines, each a list of tokens;
  • "words": a flat token list (treated as one line);
  • "text": a whitespace-tokenized string (one line).

A token is a string, or a dict {"text": …} with optional "kind" (a TokenKind value; inferred when omitted — numerals by parseability, the rest words), "status" (a ReadingStatus value), and "alt" (alternate readings). Hyphenated tokens get their signs split. Optional record keys: "meta" (site/period/scribe/support/findspot/ name), "translations". Example::

corpus = Corpus.from_records([
    {"id": "X1", "text": "KU-RO 10", "meta": {"site": "My site"}},
    {"id": "X2", "lines": [["A-DU", {"text": "5", "status": "unclear"}]]},
], script_id="lineara")

To make it loadable by name, register a loader: aegean.core.corpus.register_loader("myfind", lambda: corpus).

from_dict classmethod

from_dict(data: dict[str, Any]) -> 'Corpus'

Reconstruct a Corpus from the dict to_json serializes (its json.loads).

query

query(filters: Sequence[FilterRow], output: Output = 'inscriptions', *, annotated_ids: set[str] | None = None) -> QueryResults

Run the compound-query predicate engine over this corpus.

filters is a sequence of aegean.analysis.FilterRow rows (a field id, a value, and optional connector/negate); output selects "inscriptions" or "words". Returns aegean.analysis.QueryResults (.inscriptions and .words) carrying this corpus's provenance and a summary of the filters, so results.cite() cites the exact result set. The available fields are in aegean.analysis.FIELDS. Unlike filter (exact metadata match), this supports text/prefix/sign-pattern/co-occurrence predicates with AND/OR/NOT.

Document dataclass

Document(id: str, script_id: str, tokens: list[Token], lines: list[list[int]], glyphs: str = '', transcription: str = '', translations: list[str] = list(), meta: DocumentMeta = DocumentMeta())

One inscription / tablet / text.

line_tokens property

line_tokens: list[list[Token]]

Tokens regrouped by physical line.

DocumentMeta dataclass

DocumentMeta(site: str = '', support: str = '', scribe: str = '', findspot: str = '', period: str = '', name: str = '', images: tuple[str, ...] = (), notes: tuple[str, ...] = ())

Bibliographic / archaeological metadata for a document.

ReadingStatus

Bases: str, Enum

Editorial certainty of a token's reading (Leiden / EpiDoc conventions).

CERTAIN is the default. The others mark the apparatus an epigraphic edition must preserve — damaged, restored, or lost text. The bundled corpora are normalized transcriptions (almost entirely CERTAIN; see the Linear A provenance note); a bring-your-own EpiDoc corpus populates these from <unclear> / <supplied> / <gap> markup, and the EpiDoc writer emits them back.

Sign dataclass

Sign(label: str, glyph: str | None = None, codepoint: int | None = None, phonetic: str | None = None, script_id: str = '', attrs: dict[str, Any] = dict())

One graphic unit of a script (syllabogram, letter, or logogram).

SignInventory

SignInventory(signs: list[Sign], script_id: str = '')

The set of signs for a script, indexed by label / glyph / codepoint.

Token dataclass

Token(text: str, kind: TokenKind, signs: tuple[str, ...] = (), glyphs: str | None = None, line_no: int | None = None, position: int | None = None, status: ReadingStatus = CERTAIN, alt: tuple[str, ...] = ())

One unit in a document's transliterated text stream.

TokenKind

Bases: str, Enum

The role a token plays in a document's text stream.

Provenance dataclass

Provenance(source: str, license: str = '', citation: str = '', url: str = '', schema_version: int = SCHEMA_VERSION, notes: tuple[str, ...] = tuple(), data_version: str = '')

Where a corpus came from and how to cite it.

cite

cite() -> str

A one-line citation string for papers / logs.

bibtex

bibtex(key: str = 'aegean-corpus') -> str

A BibTeX @misc entry for this source.

Best-effort formatting of the recorded free-text provenance: only fields actually known are emitted; the first year found in the citation string (if any) becomes year; the license and any provenance notes (e.g. the subset note Corpus.filter records) go into note.

apa

apa() -> str

An APA-style reference line (n.d. when no year is recoverable).

Best-effort formatting of the recorded free-text provenance; notes (e.g. the subset note Corpus.filter records) follow in brackets.

Script

Bases: ABC

A writing system the package can read and analyse.

sign_inventory abstractmethod property

sign_inventory: SignInventory

The script's SignInventory.

tokenize abstractmethod

tokenize(raw: str) -> list[Token]

Split a raw transliteration string into typed tokens.

register_loader

register_loader(script_id: str, fn: Callable[[], 'Corpus']) -> None

Register a corpus loader so Corpus.load(script_id) / aegean.load(script_id) works.

get_script

get_script(script_id: str) -> Script

Return the registered Script for script_id (raises KeyError if unknown).

register

register(script: Script) -> None

Register a script plugin under its id (each built-in plugin calls this on import).

registered_scripts

registered_scripts() -> list[str]

The sorted ids of all registered scripts, e.g. ['cypriot', 'cyprominoan', 'greek', 'lineara', 'linearb'].