aegean.greek¶
greek ¶
Greek NLP pipeline — composable, individually-callable stages.
The dependency-free core covers normalize (NFC/NFD + Beta Code ↔ Unicode, with
a lenient OCR-repair mode), tokenize (word/sentence), syllabify, accent
analysis (accentuation), prosody/meter scansion, phonology (IPA),
a seed lemmatize, baseline pos, and a rule-based morphology analyzer
(analyze). pipeline runs the whole stack over a text in one call.
Opt-in backends layer on richer data and models:
use_neural_pipeline(the[neural]extra) loads the joint neural model — one pass serving UPOS, full morphology (UD FEATS), UD dependency trees, and lemmas, state of the art on the UD Ancient Greek benchmarks (measured numbers indocs/benchmarks.md). Once active,pos_tag/pos_tags,lemmatize,parse, andpipelineall use it.use_treebank(Perseus AGDT) supplies attested, correctly-accented lemmas and full features for known forms.use_lsj(Perseus Liddell-Scott-Jones) provides glossing (gloss/lookup).use_parser(parse; arc-eager + averaged perceptron, trained on the AGDT) is a projective dependency parser (~0.67 UAS / 0.57 LAS).use_taggeris an averaged-perceptron POS tagger (~84% on unseen forms).use_lemmatizeris an edit-tree lemmatizer (~40% on unseen forms).use_neural_lemmatizer(the[neural]extra) is a GreTa T5 seq2seq model served as int8 ONNX without torch; it pairs a gold lookup with seq2seq decoding and reaches 76.3% on unseen forms.lemmatizecascades neural pipeline -> treebank -> neural -> edit-tree -> seed.
Every stage is a plain function so it can be used standalone::
from aegean import greek
greek.betacode_to_unicode("mh=nin") # 'μῆνιν'
greek.syllabify("ἄνθρωπος") # ['ἄν', 'θρω', 'πος']
greek.accentuation("λόγος").classification # 'paroxytone'
greek.pipeline("ἐν ἀρχῇ ἦν ὁ λόγος.") # per-token records, one call
AccentInfo
dataclass
¶
AccentInfo(syllables: tuple[str, ...], accent_type: str | None, position_from_end: int | None, classification: str | None)
The accent analysis of one word.
Analysis
dataclass
¶
Analysis(lemma: str, pos: str, case: str | None = None, number: str | None = None, gender: str | None = None, tense: str | None = None, voice: str | None = None, mood: str | None = None, person: str | None = None, degree: str | None = None, lemma_certain: bool = True)
One candidate morphological reading of a form.
TreebankLexicon ¶
An attested form→analyses lexicon built from the AGDT treebank.
load
classmethod
¶
Load a built lexicon JSON (defaults to the cached one).
analyze ¶
analyze(form: str) -> tuple[Analysis, ...]
Attested analyses for a form (frequency-ordered), or () if unknown.
lemmatize ¶
The most-attested lemma for a form, or None if unknown.
pos ¶
The most-attested part-of-speech tag for a form, or None if unknown.
LSJEntry
dataclass
¶
A Liddell-Scott-Jones entry.
LSJLexicon ¶
LexiconNotLoadedError ¶
Bases: RuntimeError
Raised when gloss/lookup is called before use_lsj.
DepToken
dataclass
¶
One token in a dependency tree (1-based id; head=0 is the root).
ParserNotLoadedError ¶
Bases: RuntimeError
Raised when parse is called before use_parser.
TaggerNotLoadedError ¶
Bases: RuntimeError
Raised when tag_pos is used before use_tagger.
LemmatizerNotLoadedError ¶
Bases: RuntimeError
Raised when the trained lemmatizer is used before use_lemmatizer.
NeuralLemmatizerNotLoadedError ¶
Bases: RuntimeError
Raised when the neural lemmatizer is used before use_neural_lemmatizer,
or when the [neural] extra (onnxruntime/tokenizers) is not installed.
NeuralPipelineNotLoadedError ¶
Bases: RuntimeError
Raised when the neural pipeline is used before use_neural_pipeline, or when
the [neural] extra (onnxruntime/tokenizers/numpy) is not installed.
SentenceAnalysis
dataclass
¶
SentenceAnalysis(tokens: tuple[str, ...], upos: tuple[str, ...], xpos: tuple[str, ...], feats: tuple[str, ...], head: tuple[int, ...], deprel: tuple[str, ...], lemma: tuple[str, ...])
The joint model's full analysis of one sentence (parallel, per-token lists).
NormalizationWarning ¶
Bases: UserWarning
Emitted by normalize(..., lenient=True) for each class of repair.
TokenRecord
dataclass
¶
TokenRecord(sentence: int, index: int, text: str, upos: str, lemma: str, lemma_known: bool, head: int | None = None, relation: str | None = None, xpos: str | None = None, feats: str | None = None)
One token's full analysis from pipeline.
head refers to the index of another record in the same sentence
(0 = sentence root, None = no parse). xpos/feats are filled
only by the neural pipeline; lemma_known is False when the lemma is a
fallback (the normalized form itself, from an unknown word).
Foot
dataclass
¶
One metrical foot: its name and the syllables/quantities it spans.
LineScansion
dataclass
¶
LineScansion(line: str, meter: str, feet: tuple[Foot, ...], syllables: tuple[str, ...], quantities: tuple[str, ...], caesura: str | None, caesura_index: int | None, ambiguous: bool)
The scansion of one verse line.
ScansionError ¶
Bases: ValueError
Raised when a line cannot be fit to the requested meter.
lemmatize_verbose ¶
Return (lemma, known). known is False when the form wasn't found and
the (normalized) input is returned unchanged.
When the AGDT treebank backend is active (see aegean.greek.use_treebank),
its attested, correctly-accented lemma is preferred; next, when the neural backend is
active (see aegean.greek.use_neural_lemmatizer), its GreTa seq2seq prediction is
used — it generalizes well to unseen forms (76.3%); next the trained edit-tree lemmatizer
(see aegean.greek.use_lemmatizer); otherwise the bundled seed table is consulted.
analyze ¶
analyze(word: str) -> tuple[Analysis, ...]
All candidate morphological analyses of word (possibly several, given
Greek's ambiguity; empty only for unanalysable tokens).
Closed-class words (article, prepositions, conjunctions, particles, pronouns,
the copula) resolve to a single high-confidence analysis; open-class words
yield the readings their ending permits. When the AGDT treebank backend is
active (see aegean.greek.use_treebank), an attested form's analyses —
correctly accented and covering irregular forms the rule engine can't — are
returned instead, with the rule engine as the fallback for unattested forms.
best_pos ¶
A single best part-of-speech guess from morphology, or None when the
form yields no analysis. Returns the most likely reading's tag (verbal and
closed-class readings, which are listed first, take precedence over the
nominal default), or ADJ when a degree is marked.
lemmas ¶
The distinct lemma candidates for a form (closed-class or rule-derived).
disable_treebank ¶
Deactivate the treebank lexicon; restore the default rule/seed behaviour.
use_treebank ¶
use_treebank(*, build: bool = True, force: bool = False) -> TreebankLexicon
Activate the AGDT lexicon for this session.
Downloads + builds it on first use (build=True); pass force=True to
rebuild. Once active, aegean.greek.lemmatize / analyze prefer
its attested analyses and fall back to the rule/seed engines on a miss.
gloss ¶
Concise LSJ gloss for a word; requires use_lsj. None if unknown.
lookup ¶
lookup(word: str) -> LSJEntry | None
Full LSJ entry for a word; requires use_lsj. None if unknown.
use_lsj ¶
use_lsj(*, build: bool = True, force: bool = False) -> LSJLexicon
Activate the LSJ lexicon for this session.
Downloads (~270 MB) + builds the index on first use (build=True); pass
force=True to rebuild. Then gloss / lookup resolve words
against it.
evaluate_parser ¶
evaluate_parser(*, source_dir: Path | str | None = None, holdout: float = 0.1, epochs: int = 5) -> dict[str, Any]
Train on a split and score the held-out trees → {"uas","las","tokens","sentences"}
(gold POS/lemma; measures parsing in isolation). Exposed as greek.evaluate_parser.
parse ¶
parse(sentence: str | list[str]) -> DepTree
Parse a Greek sentence (a string or a list of tokens) into a DepTree.
Uses the neural pipeline when it is active (aegean.greek.use_neural_pipeline) —
relations are then UD (nsubj, obj, advcl, …) and postag carries the predicted
9-char tag. Otherwise requires use_parser (the arc-eager baseline, AGDT/Prague
relations), with POS/lemma from the (treebank-aware) pipeline.
use_parser ¶
Activate the dependency parser for this session — training the model on first
use (train=True; from the cached AGDT, a few minutes) or loading the cache.
disable_tagger ¶
Deactivate the POS tagger; restore the lookup/rule behaviour.
evaluate_tagger ¶
evaluate_tagger(*, source_dir: str | None = None, holdout: float = 0.1, epochs: int = 8) -> dict[str, float]
Train on the train split and score POS on the held-out split (overall + unseen),
via aegean.greek.heldout — the honest generalization number. Returns
pos_all/pos_unseen plus the token counts (this tagger predicts POS only, so the
lemma metrics are omitted).
use_tagger ¶
Activate the generalizing POS tagger. With train=True (default) it trains on
first use — from the cached AGDT, a few minutes — then caches the model; later calls
load the cache. train=False loads an existing cached model without training (raises
TaggerNotLoadedError if none exists). force=True retrains even if cached.
disable_lemmatizer ¶
Deactivate the lemmatizer; restore the lookup/seed/identity behaviour.
evaluate_lemmatizer ¶
evaluate_lemmatizer(*, source_dir: str | None = None, holdout: float = 0.1, epochs: int = 8) -> dict[str, float]
Train on the train split and score lemma accuracy on the held-out split (overall +
unseen), via aegean.greek.heldout — the honest generalization number. A POS
tagger is trained on the same split so the dev set is scored with predicted POS (the
realistic pipeline), not gold. Returns lemma_all/lemma_unseen plus token counts
(POS metrics are omitted).
use_lemmatizer ¶
Activate the generalizing lemmatizer. With train=True (default) it trains on
first use — from the cached AGDT, a few minutes — then caches the model; later calls
load the cache. train=False loads an existing cached model (raises
LemmatizerNotLoadedError if none exists). force=True retrains even if cached.
disable_neural_lemmatizer ¶
Deactivate the neural lemmatizer; the cascade falls back to the edit-tree/seed/identity.
use_neural_lemmatizer ¶
Activate the neural (GreTa seq2seq) lemmatizer.
Fetches the model bundle (ONNX encoder/decoder + tokenizer + gold lookup) to the cache on
first use — never bundled in the wheel — then loads it via onnxruntime. Requires the
[neural] extra (pip install 'pyaegean[neural]'). Best paired with
aegean.greek.use_treebank, whose attested lemmas take precedence for seen forms.
Raises aegean.data.DataNotAvailableError if the model URL is not yet pinned (set
PYAEGEAN_GRC_LEMMA_NEURAL_URL to fetch from your own mirror) or the download fails, and
NeuralLemmatizerNotLoadedError if the optional dependencies are missing.
analyze_sentence ¶
analyze_sentence(words: list[str]) -> SentenceAnalysis
The full joint analysis of one pre-tokenized sentence (raises if not active).
disable_neural_pipeline ¶
Deactivate the neural pipeline; every function falls back to its prior cascade.
use_neural_pipeline ¶
Activate the neural pipeline (tags + morphology + trees + lemmas, one model).
Fetches the model bundle to the cache on first use — never bundled in the wheel —
then loads it via onnxruntime. Requires the [neural] extra
(pip install 'pyaegean[neural]'). Once active, aegean.greek.pos_tags /
pos_tag, aegean.greek.parse (UD relations), and aegean.greek.lemmatize
all use it; analyze_sentence returns the full joint analysis in one call.
Raises aegean.data.DataNotAvailableError if the model URL is not yet pinned (set
PYAEGEAN_GRC_JOINT_URL to fetch from your own mirror) or the download fails, and
NeuralPipelineNotLoadedError if the optional dependencies are missing.
evaluate_on_proiel ¶
evaluate_on_proiel(tag_sentence: TagSentence | None = None, *, source_dir: Path | str | None = None, files: tuple[str, ...] = _GREEK_FILES) -> dict[str, float]
Score a tagger on PROIEL gold — the neutral, out-of-AGDT generalization number.
tag_sentence maps a sentence's forms to (lemma, pos) per token; it defaults to
pyaegean's current pipeline (lemmatize + pos_tag, honouring whichever backends
are active — enable use_treebank/use_neural_lemmatizer first to measure them).
Returns {"lemma", "pos", "n"}: lemma and POS accuracy over the scored tokens. Lemma
is the clean metric; POS is compared under a reconciled tagset (PROPN→NOUN, SCONJ→CCONJ).
The PROIEL files are fetched on first use unless source_dir points at local XML.
load_proiel_gold ¶
load_proiel_gold(*, source_dir: Path | str | None = None, files: tuple[str, ...] = _GREEK_FILES) -> tuple[tuple[HeldoutToken, ...], ...]
Parse the PROIEL Greek treebank into gold sentences of (form, lemma, POS) tokens.
Fetches the pinned PROIEL files into the cache unless source_dir is given (tests
pass a local fixture for an offline run). Empty tokens are dropped, lemmas cleaned
(#N homograph suffix removed), and POS mapped to pyaegean's tagset convention.
Every token is flagged seen=False — PROIEL is wholly outside pyaegean's training.
proiel_dir ¶
The cache directory of PROIEL Greek XML files, fetching any missing on first use. The data is CC BY-NC-SA 3.0 — kept in the cache for evaluation only, never bundled.
agdt_ud_overlap ¶
agdt_ud_overlap(*, splits: tuple[str, ...] = ('dev', 'test'), source: Path | str | None = None, agdt_source: Path | str | None = None, verify: bool = True, write: bool = True) -> dict[str, Any]
Build the AGDT ↔ UD-Perseus leakage-exclusion manifest.
UD Perseus sentence ids are <agdt-file>@<sentence-id> — direct references into the
AGDT source pyaegean trains on. This collects every AGDT sentence appearing in the
given UD splits (default: dev + test, the folds that must stay unseen), verifies
the reference by comparing NFC form sequences against the actual AGDT files, caches
the manifest as JSON, and returns it. Every Stage A+ training split must exclude
these sentences — see docs/benchmarks.md.
source overrides the UD fold path(s) and agdt_source the AGDT directory (used
by offline tests); with defaults, both fetch to the cache on first use.
evaluate_on_ud ¶
evaluate_on_ud(treebank: str = 'perseus', split: str = 'test', *, source: Path | str | None = None, parse: bool | None = None) -> dict[str, Any]
Score the active pipeline on a UD Ancient Greek fold with the official evaluator.
Runs over the fold's gold tokens (gold-tokenization protocol), emits CoNLL-U, and
scores it against the gold file with conll18_ud_eval. Activate the backends you
want measured first (use_treebank, use_tagger, use_lemmatizer,
use_neural_lemmatizer, use_parser). parse defaults to whether the parser
is active; with parse=False UAS/LAS are returned as None.
Returns {"upos", "lemma", "uas", "las", "n_words", "n_sentences", "treebank",
"split", "parsed"} — accuracies in [0, 1]. Read the module docstring's leakage
caveat before quoting the Perseus fold for an AGDT-trained model.
betacode_to_unicode ¶
Convert a Beta Code string to precomposed (NFC) polytonic Greek.
strip_diacritics ¶
Remove all combining diacritics (accents, breathings, subscripts), keeping the base letters. Returns NFC.
unicode_to_betacode ¶
Convert polytonic Greek to Beta Code (capitals as *; final sigma as
s). Round-trips with betacode_to_unicode for supported text.
load_work ¶
load_work(work: str, *, ref: str | None = None, source: str = 'auto', edition: str | None = None, force: bool = False) -> 'Corpus'
Load one Greek work from Perseus canonical-greekLit / First1KGreek.
work is the CTS-style id ("tlg0012.tlg001" = the Iliad). source
is "perseus", "first1k", or "auto" (try both, in that order);
edition picks a specific edition file when a work has several. The TEI
file is fetched once into the cache (network on first use only).
ref selects a sub-section instead of the whole work — a citation address
matching the work's structure: a textpart number ("1" = Iliad book 1),
a nested div path ("1.2" = book 1, chapter 2 of a prose work), or a verse
line-range ("1.1-1.50" = book 1, lines 1–50). Without it, the corpus is
one Document per top-level textpart. <note>/<bibl> ride along in
Document.meta.notes. Raises aegean.data.DataNotAvailableError when the
work can't be found/fetched, or ValueError when ref matches nothing.
syllable_quantities ¶
The metrical quantity of each syllable: "heavy" / "light" /
"common" (in syllable order).
scan_hexameter ¶
scan_hexameter(line: str) -> LineScansion
Scan a line of dactylic hexameter (six feet; feet 1–5 dactyl or
spondee, foot 6 — ×), resolving quantities and the main caesura.
Raises ScansionError if the line does not fit (e.g. it needs
synizesis, which is not inferred).
scan_line ¶
scan_line(line: str, meter: str = 'hexameter') -> LineScansion
Scan line against meter ("hexameter" or "pentameter").
scan_pentameter ¶
scan_pentameter(line: str) -> LineScansion
Scan a line of elegiac pentameter: two dactyls-or-spondees, a longum,
the central diaeresis, then two obligatory dactyls and a final longum
(— ⏑⏑ — ⏑⏑ — ‖ — ⏑⏑ — ⏑⏑ —).
Raises ScansionError if the line does not fit.
scan_trimeter ¶
scan_trimeter(line: str) -> LineScansion
Scan a line of iambic trimeter — three metra of x – ⏑ – (the final
element anceps), with resolution of long elements into two shorts.
Raises ScansionError if the line does not fit (e.g. it needs synizesis on
a word not in the lexicon).
syllable_options ¶
(syllable, [possible quantities]) across the whole line — the raw,
pre-metrical analysis, with cross-word position and correptio applied.
to_ipa ¶
Transcribe Greek text to reconstructed IPA. Whitespace-separated
words are transcribed independently and rejoined with spaces.
pos_tag ¶
Tag a single token. Closed classes come from the lexicon; when the treebank
backend is active (see aegean.greek.use_treebank), an attested form's
gold tag is used next; otherwise open-class words get a suffix heuristic (a few
verb endings, else NOUN). Non-letter tokens are NUM (numeric) or PUNCT.
pos_tags ¶
(token, tag) pairs for a text, in order (punctuation tagged PUNCT). When the
trained tagger is active it tags the whole sentence in context, with the
closed-class lexicon and the treebank lookup still taking precedence per token.
sentences ¶
Split into trimmed sentences on Greek sentence-final punctuation.
tokenize_words ¶
Just the word strings, in order (punctuation dropped).