aegean.io¶
io ¶
I/O adapters — export the corpus model to interchange formats.
EpiDoc reading (bring-your-own Linear B corpora) lives in aegean.scripts.linearb and
Corpus.load("linearb"); this package provides the EpiDoc writer plus CSV/Parquet exporters.
For pyaegean's own lossless archive format, use Corpus.to_json / Corpus.from_json.
to_epidoc ¶
to_epidoc(document: Document) -> str
Serialize a single Document to an EpiDoc TEI XML string.
The transliteration lives in a TEI <div type="edition"> (EpiDoc's required edition
division), as <lb/>-delimited lines of <w>/<num>/<g> tokens. A token whose
aegean.ReadingStatus is not CERTAIN is wrapped in the matching EpiDoc apparatus
element (<unclear> or <supplied>), so editorial certainty survives the round trip
through aegean.scripts.linearb.parse_epidoc. The output validates against the EpiDoc
RelaxNG schema (see tests/test_io.py).
write_epidoc ¶
Write EpiDoc TEI XML to disk.
A single Document is written to the file path; a
Corpus is written as one {id}.xml file per document into the
directory path (created if needed) — the layout
aegean.scripts.linearb.parse_epidoc reads back.
to_csv ¶
to_csv(corpus: Corpus, path: str | Path, *, level: str = 'document') -> None
Write the corpus's level DataFrame ("document"/"token"/"word") to CSV.
to_parquet ¶
to_parquet(corpus: Corpus, path: str | Path, *, level: str = 'document') -> None
Write the corpus's level DataFrame to Parquet (needs a parquet engine).
from_workbench_export ¶
from_workbench_export(source: str | Path | dict[str, Any] | list[Any]) -> Corpus
Load a workbench corpus export into a Corpus.
source is a path to a JSON file, a JSON string, or already-parsed
JSON. Both forms the workbench produces are accepted: the schema-v1
export object (records under "inscriptions", provenance under
"_meta", per-record "derived" analyses — ignored here) and a
plain array of inscription records.
Token kinds are inferred the Corpus.from_records way (numerals by
parseability, everything else a word); glyphs, transcription, and image
references are carried onto the documents. The export's own metadata
(app version, generation time, scope) lands in the corpus provenance.
to_workbench ¶
to_workbench(corpus: Corpus, path: str | Path | None = None) -> list[dict[str, Any]]
Emit workbench-shaped inscription records (optionally writing JSON).
Each document becomes one record with the fields the workbench renders:
id/site/support/scribe/findspot/context (its name
for the dating period)/name, the flat words list, per-line
lines, translations, glyphs, transcription, and image
references. Image files are never embedded — the workbench treats the
references as paths under its own mirror, so corpora without one simply
show no imagery.
With path, the records are also written as UTF-8 JSON — the file the
app loads via ?corpus=<url> or its corpus file picker.