Skip to content

aegean.io

io

I/O adapters — export the corpus model to interchange formats.

EpiDoc reading (bring-your-own Linear B corpora) lives in aegean.scripts.linearb and Corpus.load("linearb"); this package provides the EpiDoc writer plus CSV/Parquet exporters. For pyaegean's own lossless archive format, use Corpus.to_json / Corpus.from_json.

to_epidoc

to_epidoc(document: Document) -> str

Serialize a single Document to an EpiDoc TEI XML string.

The transliteration lives in a TEI <div type="edition"> (EpiDoc's required edition division), as <lb/>-delimited lines of <w>/<num>/<g> tokens. A token whose aegean.ReadingStatus is not CERTAIN is wrapped in the matching EpiDoc apparatus element (<unclear> or <supplied>), so editorial certainty survives the round trip through aegean.scripts.linearb.parse_epidoc. The output validates against the EpiDoc RelaxNG schema (see tests/test_io.py).

write_epidoc

write_epidoc(obj: Corpus | Document, path: str | Path) -> None

Write EpiDoc TEI XML to disk.

A single Document is written to the file path; a Corpus is written as one {id}.xml file per document into the directory path (created if needed) — the layout aegean.scripts.linearb.parse_epidoc reads back.

to_csv

to_csv(corpus: Corpus, path: str | Path, *, level: str = 'document') -> None

Write the corpus's level DataFrame ("document"/"token"/"word") to CSV.

to_parquet

to_parquet(corpus: Corpus, path: str | Path, *, level: str = 'document') -> None

Write the corpus's level DataFrame to Parquet (needs a parquet engine).

from_workbench_export

from_workbench_export(source: str | Path | dict[str, Any] | list[Any]) -> Corpus

Load a workbench corpus export into a Corpus.

source is a path to a JSON file, a JSON string, or already-parsed JSON. Both forms the workbench produces are accepted: the schema-v1 export object (records under "inscriptions", provenance under "_meta", per-record "derived" analyses — ignored here) and a plain array of inscription records.

Token kinds are inferred the Corpus.from_records way (numerals by parseability, everything else a word); glyphs, transcription, and image references are carried onto the documents. The export's own metadata (app version, generation time, scope) lands in the corpus provenance.

to_workbench

to_workbench(corpus: Corpus, path: str | Path | None = None) -> list[dict[str, Any]]

Emit workbench-shaped inscription records (optionally writing JSON).

Each document becomes one record with the fields the workbench renders: id/site/support/scribe/findspot/context (its name for the dating period)/name, the flat words list, per-line lines, translations, glyphs, transcription, and image references. Image files are never embedded — the workbench treats the references as paths under its own mirror, so corpora without one simply show no imagery.

With path, the records are also written as UTF-8 JSON — the file the app loads via ?corpus=<url> or its corpus file picker.