Greek NLP benchmarks — protocol and numbers¶
How pyaegean is scored on the standard Ancient Greek benchmarks: the protocol, the leakage controls that keep the comparison honest, the field's published numbers, and pyaegean's own measured results. The README and wiki carry only pyaegean's own numbers; the cross-tool tables live here, with citations.
Protocol¶
- Test sets: the Universal Dependencies Ancient Greek test folds —
UD_Ancient_Greek-Perseus(commit331ddef) andUD_Ancient_Greek-PROIEL(commita4ab8d4), both CC BY-NC-SA 3.0, fetched to the cache for evaluation only (never bundled, never trained on). - Scorer: the official CoNLL 2018 shared-task evaluator (
conll18_ud_eval.py, MPL 2.0), fetched sha256-pinned (1072e02af00b1a56205b5e8216d51dee9b8944a104d80744afaccc78859fcb16) and imported from the cache. Reported figures are the evaluator's F1 per metric. - Gold tokenization: pyaegean runs over each fold's gold FORM column, so its scores measure tagging/lemma/parsing quality, not tokenizer agreement. (The published numbers below let each pipeline tokenize raw text; their token accuracy on these folds is ≈100%, so the protocols are close but not identical — noted for precision.) The neural pipeline is also measured end-to-end from raw text (below), and the numbers hold.
- No tagset reconciliation: UPOS and lemmas are scored exactly as emitted. Convention
gaps (e.g. the AGDT scheme has no PROPN/SCONJ on the PROIEL fold's conventions) count
against pyaegean here, unlike
greek.evaluate_on_proiel, which reconciles tagsets to isolate real errors. - Reproduce the shipped pipeline with:
from aegean import greek
greek.use_neural_pipeline()
greek.evaluate_on_ud("perseus", "test")
greek.evaluate_on_ud("proiel", "test")
Leakage controls¶
UD Perseus is converted from the AGDT — the treebank pyaegean's Greek backends are built from — so a naïve evaluation would leak the test set into training. Two controls keep the neural pipeline's numbers honest:
- The UD-Perseus exclusion manifest.
greek.agdt_ud_overlap()resolves every UD-Perseus dev+test sentence to its AGDT source and verifies the reference by NFC form-sequence comparison: 2,443 sentences across 5 AGDT files, all form-identical. The neural model's training split excludes all of them (cached atud-grc/agdt-ud-exclusion.json). - PROIEL is held out entirely. No pyaegean model trains on PROIEL, so it is a genuine
out-of-domain fold. The combined-corpus model adds the Gorman (CC0) and Pedalion
(CC BY-SA) treebanks; the overlap audit excluded 1,591 Gorman + 155 Pedalion sentences
matching either evaluation fold, and Gorman's Herodotus files (the same work as PROIEL's
hdt.xml) are excluded at source.
One caveat applies to the pure-Python baseline below, not the neural pipeline: its tagger, edit-tree lemmatizer, arc-eager parser, and treebank lookup are built from the full AGDT, which contains the UD-Perseus test sentences. Their Perseus-fold scores are therefore an in-training upper bound, reported for orientation; the PROIEL fold is their honest number.
The field's published numbers¶
From Kostkan, Kardos, Mortensen & Nielbo, “OdyCy — A general-purpose NLP pipeline for Ancient Greek”, LaTeCH-CLfL 2023 (https://aclanthology.org/2023.latechclfl-1.14.pdf), Tables 1–2 — each pipeline's own tokenization, spaCy evaluation scripts. Best per metric in bold.
UD Perseus test fold:
| Pipeline | POS | Morph | Lemma | UAS | LAS |
|---|---|---|---|---|---|
| odyCy (joint) | 95.39 | 92.56 | 83.20 | 78.80 | 73.09 |
| odyCy (perseus) | 95.00 | 91.98 | 82.56 | 76.71 | 70.31 |
| greCy (perseus) | 93.50 | 90.59 | 75.10 | 76.34 | 70.20 |
| Stanza (perseus) | 91.05 | 91.03 | 87.58 | 78.69 | 71.82 |
| UDPipe (perseus) | 80.95 | 85.70 | 82.73 | 63.97 | 55.81 |
| CLTK | 80.50 | 61.49 | 79.46 | 33.05 | 24.25 |
UD PROIEL test fold:
| Pipeline | POS | Morph | Lemma | UAS | LAS |
|---|---|---|---|---|---|
| greCy (proiel) | 98.23 | 94.05 | 98.06 | 85.74 | 82.28 |
| odyCy (joint) | 97.81 | 93.46 | 94.41 | 83.17 | 79.03 |
| Stanza (proiel) | 97.39 | 92.20 | 97.21 | 81.51 | 77.48 |
| CLTK | 96.95 | 90.76 | 96.50 | 57.61 | 54.57 |
| UDPipe (proiel) | 95.97 | 88.62 | 93.17 | 72.40 | 67.48 |
(The same paper shows every single-treebank model collapsing on the other treebank — e.g. Stanza-perseus scores 59.00 UAS on PROIEL — which is why pyaegean keeps out-of-domain and unseen-form measurement first-class.)
A newer baseline raises the parsing bar above that table: Riemenschneider & Frank 2024, “A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek” (https://arxiv.org/abs/2410.12055 — the GreBERTa/GreTa authors), reports on the UD Perseus test fold (models trained on the UD train fold; gold tokenization): GreBERTa-based parsing UAS 88.20 / LAS 83.98, POS 95.83, XPOS 91.09; and a GreTa lemmatizer at 91.17 lemma accuracy on their own (AGDT + Gorman + Pedalion, normalized) folds. Their main models train on AGDT + Gorman + Pedalion (~1.26 M tokens) — the same license-clean data lever (Gorman CC0, Pedalion CC BY-SA) the pyaegean joint model uses.
pyaegean — the neural pipeline (shipped)¶
The shipped joint model (grc-joint-v1, activated by greek.use_neural_pipeline(), the
[neural] extra) is one GreBerta-encoder checkpoint serving UPOS, XPOS, UD FEATS,
dependency trees (single-root Chu-Liu/Edmonds MST decoding, so non-projectivity is handled
natively), and lemmas. Trained leakage-clean on the audited AGDT + Gorman + Pedalion
corpus (1.41 M tokens). Measured through the package's own inference code, fetching the
release asset (sha256-verified, onnxruntime CPU):
| Test fold | Lemma | UAS | LAS | UPOS | UFeats | XPOS |
|---|---|---|---|---|---|---|
| UD Perseus | 94.40 | 89.16 | 84.38 | 96.94 | 96.12 | 93.56 |
| UD PROIEL | 90.57 | 82.52 | 63.51 | 87.16 | 59.49 | n/a |
On UD Perseus test, every metric sits above the best published number I could find:
| Metric | pyaegean | best published | margin |
|---|---|---|---|
| UPOS | 96.94 | 95.83 (2024) | +1.11 |
| XPOS | 93.56 | 91.09 (2024) | +2.47 |
| UFeats | 96.12 | 92.56 (odyCy 2023) | +3.56 |
| Lemma | 94.40 | 87.58 (Stanza, same fold) | +6.82 |
| UAS | 89.16 | 88.20 (2024) | +0.96 |
| LAS | 84.38 | 83.98 (2024) | +0.40 |
Three things keep these honest:
- The LAS margin is thin — +0.40, about 4× the run-to-run spread (±0.09 across three seeds). It clears the bar, but barely; stated plainly.
- PROIEL is out of domain. The in-domain published systems train on the PROIEL fold itself; pyaegean never does. Against the Perseus-trained published systems — the like-for-like out-of-domain comparison — pyaegean leads by ~17 UAS. The remaining PROIEL LAS and UFeats gaps are largely deprel- and feature-convention divergence between the two treebanks' UD conversions (PROIEL annotates five feature types the Perseus scheme lacks, and PROIEL XPOS is a different tagset entirely).
- Raw text, end to end. Removing the gold-tokenization asterisk — from each sentence's raw text through pyaegean's own tokenizer (tokens F1 99.97) to the evaluator — UD Perseus holds at lemma 94.38 / UAS 89.15 / LAS 84.38 / UPOS 96.91 / UFeats 96.09. Throughput is ≈450 words/s on plain CPU (the whole Perseus fold in 46 s).
The model ships fp32 (~518 MB): int8 dynamic quantization broke it on the dev set (UPOS 97.97 → 16.75), so the quantization gate rejected it. Selective quantization is a known follow-up; correctness ships first.
pyaegean — the pure-Python baseline¶
The zero-dependency stack (use_treebank() + use_tagger() + use_lemmatizer() +
use_parser()) is the offline, no-heavy-deps path. It is a baseline, and reads like one:
| Fold | UPOS | Lemma | UAS |
|---|---|---|---|
| Perseus test ⚠ | 87.05 | 97.65 ⚠ | 37.89 |
| PROIEL test | 75.03 | 85.26 (90.38 with the neural lemmatizer) | 33.51 |
⚠ = in-training upper bound (see Leakage controls); the 97.65 Perseus lemma is the lookup memorizing the fold. LAS is not comparable here — the arc-eager parser emits Prague labels, not UD relations. The baseline exists for the zero-install path; the neural pipeline carries the accuracy claims.
(Perseus: 1,306 sentences / 20,959 words; PROIEL: 1,047 / 13,314.)