Skip to content

aegean.greek

greek

Greek NLP pipeline — composable, individually-callable stages.

The dependency-free core covers normalize (NFC/NFD + Beta Code ↔ Unicode, with a lenient OCR-repair mode), tokenize (word/sentence), syllabify, accent analysis (accentuation), prosody/meter scansion, phonology (IPA), a seed lemmatize, baseline pos, and a rule-based morphology analyzer (analyze). pipeline runs the whole stack over a text in one call.

Opt-in backends layer on richer data and models:

  • use_neural_pipeline (the [neural] extra) loads the joint neural model — one pass serving UPOS, full morphology (UD FEATS), UD dependency trees, and lemmas, state of the art on the UD Ancient Greek benchmarks (measured numbers in docs/benchmarks.md). Once active, pos_tag/pos_tags, lemmatize, parse, and pipeline all use it.
  • use_treebank (Perseus AGDT) supplies attested, correctly-accented lemmas and full features for known forms.
  • use_lsj (Perseus Liddell-Scott-Jones) provides glossing (gloss/lookup).
  • use_parser (parse; arc-eager + averaged perceptron, trained on the AGDT) is a projective dependency parser (~0.67 UAS / 0.57 LAS).
  • use_tagger is an averaged-perceptron POS tagger (~84% on unseen forms).
  • use_lemmatizer is an edit-tree lemmatizer (~40% on unseen forms).
  • use_neural_lemmatizer (the [neural] extra) is a GreTa T5 seq2seq model served as int8 ONNX without torch; it pairs a gold lookup with seq2seq decoding and reaches 76.3% on unseen forms. lemmatize cascades neural pipeline -> treebank -> neural -> edit-tree -> seed.

Every stage is a plain function so it can be used standalone::

from aegean import greek
greek.betacode_to_unicode("mh=nin")      # 'μῆνιν'
greek.syllabify("ἄνθρωπος")              # ['ἄν', 'θρω', 'πος']
greek.accentuation("λόγος").classification  # 'paroxytone'
greek.pipeline("ἐν ἀρχῇ ἦν ὁ λόγος.")    # per-token records, one call

AccentInfo dataclass

AccentInfo(syllables: tuple[str, ...], accent_type: str | None, position_from_end: int | None, classification: str | None)

The accent analysis of one word.

Analysis dataclass

Analysis(lemma: str, pos: str, case: str | None = None, number: str | None = None, gender: str | None = None, tense: str | None = None, voice: str | None = None, mood: str | None = None, person: str | None = None, degree: str | None = None, lemma_certain: bool = True)

One candidate morphological reading of a form.

features

features() -> dict[str, str]

The non-empty morphological features, in a stable order.

TreebankLexicon

TreebankLexicon(data: dict[str, list[dict[str, str]]])

An attested form→analyses lexicon built from the AGDT treebank.

load classmethod

load(path: Path | str | None = None) -> 'TreebankLexicon'

Load a built lexicon JSON (defaults to the cached one).

analyze

analyze(form: str) -> tuple[Analysis, ...]

Attested analyses for a form (frequency-ordered), or () if unknown.

lemmatize

lemmatize(form: str) -> str | None

The most-attested lemma for a form, or None if unknown.

pos

pos(form: str) -> str | None

The most-attested part-of-speech tag for a form, or None if unknown.

LSJEntry dataclass

LSJEntry(headword: str, raw_key: str, lead: str, senses: tuple[Sense, ...], short: str)

A Liddell-Scott-Jones entry.

LSJLexicon

LSJLexicon(data: dict[str, dict[str, Any]])

A lemma→entry view of the Perseus LSJ, with lemmatize-on-miss lookup.

lookup

lookup(word: str) -> LSJEntry | None

The full LSJ entry for a word (form or lemma), or None if unknown.

gloss

gloss(word: str) -> str | None

A concise gloss — headword: <first sense> — or None if unknown.

LexiconNotLoadedError

Bases: RuntimeError

Raised when gloss/lookup is called before use_lsj.

DepToken dataclass

DepToken(id: int, form: str, lemma: str, upos: str, head: int, relation: str, postag: str = '')

One token in a dependency tree (1-based id; head=0 is the root).

DepTree dataclass

DepTree(tokens: tuple[DepToken, ...])

A dependency tree over a sentence's tokens (AGDT/Prague relation labels).

root

root() -> DepToken | None

The token whose head is the artificial root (0).

is_projective

is_projective() -> bool

Whether the tree has no crossing arcs (arc-eager can only build these).

ParserNotLoadedError

Bases: RuntimeError

Raised when parse is called before use_parser.

TaggerNotLoadedError

Bases: RuntimeError

Raised when tag_pos is used before use_tagger.

LemmatizerNotLoadedError

Bases: RuntimeError

Raised when the trained lemmatizer is used before use_lemmatizer.

NeuralLemmatizerNotLoadedError

Bases: RuntimeError

Raised when the neural lemmatizer is used before use_neural_lemmatizer, or when the [neural] extra (onnxruntime/tokenizers) is not installed.

NeuralPipelineNotLoadedError

Bases: RuntimeError

Raised when the neural pipeline is used before use_neural_pipeline, or when the [neural] extra (onnxruntime/tokenizers/numpy) is not installed.

SentenceAnalysis dataclass

SentenceAnalysis(tokens: tuple[str, ...], upos: tuple[str, ...], xpos: tuple[str, ...], feats: tuple[str, ...], head: tuple[int, ...], deprel: tuple[str, ...], lemma: tuple[str, ...])

The joint model's full analysis of one sentence (parallel, per-token lists).

NormalizationWarning

Bases: UserWarning

Emitted by normalize(..., lenient=True) for each class of repair.

TokenRecord dataclass

TokenRecord(sentence: int, index: int, text: str, upos: str, lemma: str, lemma_known: bool, head: int | None = None, relation: str | None = None, xpos: str | None = None, feats: str | None = None)

One token's full analysis from pipeline.

head refers to the index of another record in the same sentence (0 = sentence root, None = no parse). xpos/feats are filled only by the neural pipeline; lemma_known is False when the lemma is a fallback (the normalized form itself, from an unknown word).

Foot dataclass

Foot(name: str, syllables: tuple[str, ...], quantities: tuple[str, ...])

One metrical foot: its name and the syllables/quantities it spans.

LineScansion dataclass

LineScansion(line: str, meter: str, feet: tuple[Foot, ...], syllables: tuple[str, ...], quantities: tuple[str, ...], caesura: str | None, caesura_index: int | None, ambiguous: bool)

The scansion of one verse line.

pattern property

pattern: str

The classic glyph pattern, feet separated by |.

ScansionError

Bases: ValueError

Raised when a line cannot be fit to the requested meter.

accentuation

accentuation(word: str) -> AccentInfo

Analyse the accent of a single Greek word.

lemmatize_verbose

lemmatize_verbose(word: str) -> tuple[str, bool]

Return (lemma, known). known is False when the form wasn't found and the (normalized) input is returned unchanged.

When the AGDT treebank backend is active (see aegean.greek.use_treebank), its attested, correctly-accented lemma is preferred; next, when the neural backend is active (see aegean.greek.use_neural_lemmatizer), its GreTa seq2seq prediction is used — it generalizes well to unseen forms (76.3%); next the trained edit-tree lemmatizer (see aegean.greek.use_lemmatizer); otherwise the bundled seed table is consulted.

analyze

analyze(word: str) -> tuple[Analysis, ...]

All candidate morphological analyses of word (possibly several, given Greek's ambiguity; empty only for unanalysable tokens).

Closed-class words (article, prepositions, conjunctions, particles, pronouns, the copula) resolve to a single high-confidence analysis; open-class words yield the readings their ending permits. When the AGDT treebank backend is active (see aegean.greek.use_treebank), an attested form's analyses — correctly accented and covering irregular forms the rule engine can't — are returned instead, with the rule engine as the fallback for unattested forms.

best_pos

best_pos(word: str) -> str | None

A single best part-of-speech guess from morphology, or None when the form yields no analysis. Returns the most likely reading's tag (verbal and closed-class readings, which are listed first, take precedence over the nominal default), or ADJ when a degree is marked.

lemmas

lemmas(word: str) -> list[str]

The distinct lemma candidates for a form (closed-class or rule-derived).

disable_treebank

disable_treebank() -> None

Deactivate the treebank lexicon; restore the default rule/seed behaviour.

use_treebank

use_treebank(*, build: bool = True, force: bool = False) -> TreebankLexicon

Activate the AGDT lexicon for this session.

Downloads + builds it on first use (build=True); pass force=True to rebuild. Once active, aegean.greek.lemmatize / analyze prefer its attested analyses and fall back to the rule/seed engines on a miss.

disable_lsj

disable_lsj() -> None

Deactivate the LSJ lexicon.

gloss

gloss(word: str) -> str | None

Concise LSJ gloss for a word; requires use_lsj. None if unknown.

lookup

lookup(word: str) -> LSJEntry | None

Full LSJ entry for a word; requires use_lsj. None if unknown.

use_lsj

use_lsj(*, build: bool = True, force: bool = False) -> LSJLexicon

Activate the LSJ lexicon for this session.

Downloads (~270 MB) + builds the index on first use (build=True); pass force=True to rebuild. Then gloss / lookup resolve words against it.

disable_parser

disable_parser() -> None

Deactivate the dependency parser.

evaluate_parser

evaluate_parser(*, source_dir: Path | str | None = None, holdout: float = 0.1, epochs: int = 5) -> dict[str, Any]

Train on a split and score the held-out trees → {"uas","las","tokens","sentences"} (gold POS/lemma; measures parsing in isolation). Exposed as greek.evaluate_parser.

parse

parse(sentence: str | list[str]) -> DepTree

Parse a Greek sentence (a string or a list of tokens) into a DepTree.

Uses the neural pipeline when it is active (aegean.greek.use_neural_pipeline) — relations are then UD (nsubj, obj, advcl, …) and postag carries the predicted 9-char tag. Otherwise requires use_parser (the arc-eager baseline, AGDT/Prague relations), with POS/lemma from the (treebank-aware) pipeline.

use_parser

use_parser(*, train: bool = True, force: bool = False) -> None

Activate the dependency parser for this session — training the model on first use (train=True; from the cached AGDT, a few minutes) or loading the cache.

disable_tagger

disable_tagger() -> None

Deactivate the POS tagger; restore the lookup/rule behaviour.

evaluate_tagger

evaluate_tagger(*, source_dir: str | None = None, holdout: float = 0.1, epochs: int = 8) -> dict[str, float]

Train on the train split and score POS on the held-out split (overall + unseen), via aegean.greek.heldout — the honest generalization number. Returns pos_all/pos_unseen plus the token counts (this tagger predicts POS only, so the lemma metrics are omitted).

use_tagger

use_tagger(*, train: bool = True, force: bool = False) -> None

Activate the generalizing POS tagger. With train=True (default) it trains on first use — from the cached AGDT, a few minutes — then caches the model; later calls load the cache. train=False loads an existing cached model without training (raises TaggerNotLoadedError if none exists). force=True retrains even if cached.

disable_lemmatizer

disable_lemmatizer() -> None

Deactivate the lemmatizer; restore the lookup/seed/identity behaviour.

evaluate_lemmatizer

evaluate_lemmatizer(*, source_dir: str | None = None, holdout: float = 0.1, epochs: int = 8) -> dict[str, float]

Train on the train split and score lemma accuracy on the held-out split (overall + unseen), via aegean.greek.heldout — the honest generalization number. A POS tagger is trained on the same split so the dev set is scored with predicted POS (the realistic pipeline), not gold. Returns lemma_all/lemma_unseen plus token counts (POS metrics are omitted).

use_lemmatizer

use_lemmatizer(*, train: bool = True, force: bool = False) -> None

Activate the generalizing lemmatizer. With train=True (default) it trains on first use — from the cached AGDT, a few minutes — then caches the model; later calls load the cache. train=False loads an existing cached model (raises LemmatizerNotLoadedError if none exists). force=True retrains even if cached.

disable_neural_lemmatizer

disable_neural_lemmatizer() -> None

Deactivate the neural lemmatizer; the cascade falls back to the edit-tree/seed/identity.

use_neural_lemmatizer

use_neural_lemmatizer(*, force: bool = False) -> None

Activate the neural (GreTa seq2seq) lemmatizer.

Fetches the model bundle (ONNX encoder/decoder + tokenizer + gold lookup) to the cache on first use — never bundled in the wheel — then loads it via onnxruntime. Requires the [neural] extra (pip install 'pyaegean[neural]'). Best paired with aegean.greek.use_treebank, whose attested lemmas take precedence for seen forms.

Raises aegean.data.DataNotAvailableError if the model URL is not yet pinned (set PYAEGEAN_GRC_LEMMA_NEURAL_URL to fetch from your own mirror) or the download fails, and NeuralLemmatizerNotLoadedError if the optional dependencies are missing.

analyze_sentence

analyze_sentence(words: list[str]) -> SentenceAnalysis

The full joint analysis of one pre-tokenized sentence (raises if not active).

disable_neural_pipeline

disable_neural_pipeline() -> None

Deactivate the neural pipeline; every function falls back to its prior cascade.

use_neural_pipeline

use_neural_pipeline(*, force: bool = False) -> None

Activate the neural pipeline (tags + morphology + trees + lemmas, one model).

Fetches the model bundle to the cache on first use — never bundled in the wheel — then loads it via onnxruntime. Requires the [neural] extra (pip install 'pyaegean[neural]'). Once active, aegean.greek.pos_tags / pos_tag, aegean.greek.parse (UD relations), and aegean.greek.lemmatize all use it; analyze_sentence returns the full joint analysis in one call.

Raises aegean.data.DataNotAvailableError if the model URL is not yet pinned (set PYAEGEAN_GRC_JOINT_URL to fetch from your own mirror) or the download fails, and NeuralPipelineNotLoadedError if the optional dependencies are missing.

evaluate_on_proiel

evaluate_on_proiel(tag_sentence: TagSentence | None = None, *, source_dir: Path | str | None = None, files: tuple[str, ...] = _GREEK_FILES) -> dict[str, float]

Score a tagger on PROIEL gold — the neutral, out-of-AGDT generalization number.

tag_sentence maps a sentence's forms to (lemma, pos) per token; it defaults to pyaegean's current pipeline (lemmatize + pos_tag, honouring whichever backends are active — enable use_treebank/use_neural_lemmatizer first to measure them). Returns {"lemma", "pos", "n"}: lemma and POS accuracy over the scored tokens. Lemma is the clean metric; POS is compared under a reconciled tagset (PROPN→NOUN, SCONJ→CCONJ). The PROIEL files are fetched on first use unless source_dir points at local XML.

load_proiel_gold

load_proiel_gold(*, source_dir: Path | str | None = None, files: tuple[str, ...] = _GREEK_FILES) -> tuple[tuple[HeldoutToken, ...], ...]

Parse the PROIEL Greek treebank into gold sentences of (form, lemma, POS) tokens.

Fetches the pinned PROIEL files into the cache unless source_dir is given (tests pass a local fixture for an offline run). Empty tokens are dropped, lemmas cleaned (#N homograph suffix removed), and POS mapped to pyaegean's tagset convention. Every token is flagged seen=False — PROIEL is wholly outside pyaegean's training.

proiel_dir

proiel_dir(*, download: bool = True, files: tuple[str, ...] = _GREEK_FILES) -> Path

The cache directory of PROIEL Greek XML files, fetching any missing on first use. The data is CC BY-NC-SA 3.0 — kept in the cache for evaluation only, never bundled.

agdt_ud_overlap

agdt_ud_overlap(*, splits: tuple[str, ...] = ('dev', 'test'), source: Path | str | None = None, agdt_source: Path | str | None = None, verify: bool = True, write: bool = True) -> dict[str, Any]

Build the AGDT ↔ UD-Perseus leakage-exclusion manifest.

UD Perseus sentence ids are <agdt-file>@<sentence-id> — direct references into the AGDT source pyaegean trains on. This collects every AGDT sentence appearing in the given UD splits (default: dev + test, the folds that must stay unseen), verifies the reference by comparing NFC form sequences against the actual AGDT files, caches the manifest as JSON, and returns it. Every Stage A+ training split must exclude these sentences — see docs/benchmarks.md.

source overrides the UD fold path(s) and agdt_source the AGDT directory (used by offline tests); with defaults, both fetch to the cache on first use.

evaluate_on_ud

evaluate_on_ud(treebank: str = 'perseus', split: str = 'test', *, source: Path | str | None = None, parse: bool | None = None) -> dict[str, Any]

Score the active pipeline on a UD Ancient Greek fold with the official evaluator.

Runs over the fold's gold tokens (gold-tokenization protocol), emits CoNLL-U, and scores it against the gold file with conll18_ud_eval. Activate the backends you want measured first (use_treebank, use_tagger, use_lemmatizer, use_neural_lemmatizer, use_parser). parse defaults to whether the parser is active; with parse=False UAS/LAS are returned as None.

Returns {"upos", "lemma", "uas", "las", "n_words", "n_sentences", "treebank", "split", "parsed"} — accuracies in [0, 1]. Read the module docstring's leakage caveat before quoting the Perseus fold for an AGDT-trained model.

betacode_to_unicode

betacode_to_unicode(text: str) -> str

Convert a Beta Code string to precomposed (NFC) polytonic Greek.

strip_diacritics

strip_diacritics(text: str) -> str

Remove all combining diacritics (accents, breathings, subscripts), keeping the base letters. Returns NFC.

unicode_to_betacode

unicode_to_betacode(text: str) -> str

Convert polytonic Greek to Beta Code (capitals as *; final sigma as s). Round-trips with betacode_to_unicode for supported text.

load_work

load_work(work: str, *, ref: str | None = None, source: str = 'auto', edition: str | None = None, force: bool = False) -> 'Corpus'

Load one Greek work from Perseus canonical-greekLit / First1KGreek.

work is the CTS-style id ("tlg0012.tlg001" = the Iliad). source is "perseus", "first1k", or "auto" (try both, in that order); edition picks a specific edition file when a work has several. The TEI file is fetched once into the cache (network on first use only).

ref selects a sub-section instead of the whole work — a citation address matching the work's structure: a textpart number ("1" = Iliad book 1), a nested div path ("1.2" = book 1, chapter 2 of a prose work), or a verse line-range ("1.1-1.50" = book 1, lines 1–50). Without it, the corpus is one Document per top-level textpart. <note>/<bibl> ride along in Document.meta.notes. Raises aegean.data.DataNotAvailableError when the work can't be found/fetched, or ValueError when ref matches nothing.

scan

scan(word: str) -> list[tuple[str, str]]

(syllable, quantity) pairs for a word.

syllable_quantities

syllable_quantities(word: str) -> list[str]

The metrical quantity of each syllable: "heavy" / "light" / "common" (in syllable order).

scan_hexameter

scan_hexameter(line: str) -> LineScansion

Scan a line of dactylic hexameter (six feet; feet 1–5 dactyl or spondee, foot 6 — ×), resolving quantities and the main caesura.

Raises ScansionError if the line does not fit (e.g. it needs synizesis, which is not inferred).

scan_line

scan_line(line: str, meter: str = 'hexameter') -> LineScansion

Scan line against meter ("hexameter" or "pentameter").

scan_pentameter

scan_pentameter(line: str) -> LineScansion

Scan a line of elegiac pentameter: two dactyls-or-spondees, a longum, the central diaeresis, then two obligatory dactyls and a final longum (— ⏑⏑ — ⏑⏑ — ‖ — ⏑⏑ — ⏑⏑ —).

Raises ScansionError if the line does not fit.

scan_trimeter

scan_trimeter(line: str) -> LineScansion

Scan a line of iambic trimeter — three metra of x – ⏑ – (the final element anceps), with resolution of long elements into two shorts.

Raises ScansionError if the line does not fit (e.g. it needs synizesis on a word not in the lexicon).

syllable_options

syllable_options(line: str) -> list[tuple[str, list[str]]]

(syllable, [possible quantities]) across the whole line — the raw, pre-metrical analysis, with cross-word position and correptio applied.

to_ipa

to_ipa(text: str, period: Period = 'attic') -> str

Transcribe Greek text to reconstructed IPA. Whitespace-separated words are transcribed independently and rejoined with spaces.

pos_tag

pos_tag(word: str) -> str

Tag a single token. Closed classes come from the lexicon; when the treebank backend is active (see aegean.greek.use_treebank), an attested form's gold tag is used next; otherwise open-class words get a suffix heuristic (a few verb endings, else NOUN). Non-letter tokens are NUM (numeric) or PUNCT.

pos_tags

pos_tags(text: str) -> list[tuple[str, str]]

(token, tag) pairs for a text, in order (punctuation tagged PUNCT). When the trained tagger is active it tags the whole sentence in context, with the closed-class lexicon and the treebank lookup still taking precedence per token.

sentences

sentences(text: str) -> list[str]

Split into trimmed sentences on Greek sentence-final punctuation.

tokenize_words

tokenize_words(text: str) -> list[str]

Just the word strings, in order (punctuation dropped).