aegean.analysis¶
analysis ¶
Script-agnostic + Aegean-specific analysis over the core model.
Ported faithfully from the Linear A Research Workbench (src/lib/*.ts) and
checked against shared golden fixtures so the Python port can't silently
diverge. Methods over the undeciphered Linear A material are exploratory —
see each function's docstring.
BalanceCheck
dataclass
¶
BalanceCheck(stated_total: float, computed_sum: float, item_count: int, difference: float, balances: bool, marker: str, total_line_index: int)
One total line reconciled against the item lines feeding it: the stated total, the computed
sum, their signed difference (computed − stated), whether they balance, the total marker
(e.g. KU-RO), and the index of the total line.
AlignCell
dataclass
¶
One aligned position: a char from a (or "" for ins), a char from
b (or "" for del), and the operation that relates them.
PhoneticComparison
dataclass
¶
PhoneticComparison(word_a: str, word_b: str, script_a: str, script_b: str, phonemes_a: str, phonemes_b: str, distance: float, similarity: float, alignment: tuple[AlignCell, ...])
One cross-script comparison: the two words, their script ids, their
romanized phoneme strings, the normalized distance (0 = identical, 1 = wholly
different), its similarity complement, and the per-segment alignment.
PhoneticClasses
dataclass
¶
Concrete vowel set + consonant-class tables for the distance metric.
PhoneticScheme
dataclass
¶
PhoneticScheme(interdentals: str = 'dental', pharyngeal_h: str = 'velar', voiced_postalveolar: str = 'sibilant', strip_notation: bool = True)
The four typologically ambiguous decisions, exposed for tuning.
PhoneticWeights
dataclass
¶
Tunable substitution / indel costs, kept in [0,1] so the normalized distance stays in [0,1].
ClusterMember
dataclass
¶
A word in a cluster, with the signs it appends beyond the cluster stem
("" for the stem itself; "≠" flags a member that doesn't actually
extend the stem).
MorphCluster
dataclass
¶
MorphCluster(stem: str, members: tuple[ClusterMember, ...], total_count: int, suffixes: tuple[str, ...])
A stem and its productive-suffix derivations.
CompiledSignPattern
dataclass
¶
A parsed sign-pattern query: the normalized sign tokens (with * = one sign and ** =
zero-or-more wildcards) and whether the pattern contains a **.
FieldDef
dataclass
¶
A queryable field: its display label, scope (inscription/word), and value kind.
FilterRow
dataclass
¶
One query row. connector joins this row to the running result within
its scope (ignored on the first row); negate flips the row's own test.
QueryResults
dataclass
¶
QueryResults(inscriptions: list[Document], words: list[tuple[str, int]], provenance: Provenance | None = None, description: str = '')
A query's result set: the matching inscriptions and/or (word, count) pairs.
Corpus.query attaches the corpus's provenance and a description
of the filters, so cite can cite the exact result set used in a paper.
cite ¶
Cite this exact result set: the source plus the query that produced it.
style: "plain" (one line), "bibtex" (a @misc entry), or
"apa". Raises ValueError when the results carry no provenance
(results from eval_query directly rather than Corpus.query).
WordEntry
dataclass
¶
Per-word index entry: the documents a (multi-sign) word appears in.
BootstrapCI
dataclass
¶
A percentile bootstrap interval: the statistic on the full corpus
(estimate) and the [low, high] band holding level of the
resampled values.
Dispersion
dataclass
¶
How evenly one item spreads over the documents of a corpus.
dp is Gries' deviation of proportions: 0 = the item is distributed
exactly as the document sizes predict; values toward 1 = concentrated in
few documents. dp_norm rescales by the attainable maximum
(Lijffijt & Gries 2012) so corpora with different size profiles compare.
range is the count of documents attesting the item (out of parts
documents that have any items at all).
KeynessRow
dataclass
¶
KeynessRow(item: str, target_count: int, target_total: int, reference_count: int, reference_total: int, log_likelihood: float, log_ratio: float, p_value: float)
One item's keyness in a target (sub)corpus against a reference.
log_likelihood is Dunning's G² (significance: is the imbalance more
than chance?); p_value its χ²₁ tail. log_ratio is Hardie's log₂
ratio of relative frequencies (effect size: how big is the difference?)
— positive = overused in the target, negative = underused; each whole
point is a doubling. Zero counts are smoothed (default +0.5) for the
ratio only, never for G².
StructureCategory
dataclass
¶
A heuristic tablet-structure category (e.g. accounting/libation/list): key, label, description.
account_lines ¶
account_lines(document: Document) -> list[list[str]]
The document's physical lines as token-text lists.
balance_check ¶
balance_check(document: Document) -> list[BalanceCheck]
Verify every total line on a document against its summed item lines.
Uses the script's total markers (Linear A's KU-RO, Linear B's TO-SO/TO-SA).
add_sequence ¶
Add one word sequence to a growing alignment via Needleman–Wunsch at the
word level (exact-token match rewarded, substitution columns allowed, gaps
penalized). prior_n is how many sequences are already in the alignment.
align_phonetic ¶
align_phonetic(a: str, b: str, w: PhoneticWeights = DEFAULT_WEIGHTS, cl: PhoneticClasses = DEFAULT_PHONETIC_CLASSES) -> list[AlignCell]
Run the weighted Levenshtein, then backtrace to emit a per-position alignment classifying each substitution as vowel / same-class / far.
align_sequences ¶
Progressive multiple alignment of word sequences (e.g. several inscriptions). Returns aligned positions, one column per input sequence.
chi_squared_2x2 ¶
Yates-corrected chi-squared test statistic for the 2×2 table.
The continuity correction subtracts N/2 from |ad − bc| and clamps the
corrected deviation at 0 (so near-independent pairs score ~0, not a small
spurious positive). Returns 0 for degenerate tables.
chi_squared_p_value ¶
p-value for chi-squared with 1 degree of freedom: P(X² ≥ x). In [0,1],
and non-increasing in x.
fishers_exact ¶
Fisher's exact test, two-sided, for the 2×2 table: the summed hypergeometric probability of all tables with the same marginals whose probability is ≤ the observed table's. More accurate than χ² for small expected counts but O(N) per pair. Returns 1 for a degenerate margin.
log_likelihood_ratio_2x2 ¶
Log-likelihood ratio (G²) for the 2×2 table — Dunning (1993), the
corpus-linguistics standard. G² = 2 · Σ O·ln(O/E) over the four cells;
more robust than χ² for the sparse, low-count pairs of a small corpus.
Returns 0 for degenerate tables; larger = stronger association.
pmi_interval ¶
Propagate a Wilson interval on the joint probability into a pointwise mutual information confidence interval (log₂ space), holding the marginals fixed. A zero lower joint clamps PMI low to a finite floor (−20).
wilson_interval ¶
Wilson score interval for a binomial proportion p̂ = k/n. Stays inside
[0,1] with good coverage even at small/extreme p̂. z = 1.96 ≈ 95%.
nearest ¶
nearest(word: str, script: str, candidates: Iterable[str], candidate_script: str, *, top: int = 5, weights: PhoneticWeights = DEFAULT_WEIGHTS, classes: PhoneticClasses = DEFAULT_PHONETIC_CLASSES, fold_aspiration: bool = False) -> list[tuple[str, float]]
Rank candidates (in candidate_script) by phonetic distance to
word (in script), nearest first; returns (candidate, distance)
for the top closest (top=0 = all).
The intended use is decipherment-adjacent triage — e.g. which alphabetic Greek words sound closest to a Linear B form — where the ordering is the result and the absolute distances are secondary (see the module caution). Candidates that cannot be romanized are skipped.
phonetic_compare ¶
phonetic_compare(word_a: str, script_a: str, word_b: str, script_b: str, *, weights: PhoneticWeights = DEFAULT_WEIGHTS, classes: PhoneticClasses = DEFAULT_PHONETIC_CLASSES, fold_aspiration: bool = False, overrides_a: dict[str, str] | None = None, overrides_b: dict[str, str] | None = None) -> PhoneticComparison
Compare two words across scripts by sound: romanize each, then run the weighted phonetic distance and the per-segment alignment.
The classic bridge is phonetic_compare("po-me", "linearb", "ποιμήν",
"greek") — Linear B po-me against Greek poimēn 'shepherd'. Tune the
metric with weights/classes (see distance) and meet defective
syllabic spelling with fold_aspiration.
romanize_greek ¶
Romanize alphabetic Greek to the Latin phoneme alphabet.
Strips accents, breathings, iota subscript, and diaeresis (NFD, then drop
combining marks), lowercases, and maps each letter: θ→th, φ→ph, χ→kh,
ξ→ks, ψ→ps, η→ē, ω→ō, and γ→n before a velar (γγ/γκ/γχ/γξ). Rough breathing
(the /h/) is dropped with the other diacritics — the syllabaries don't write
it either. fold_aspiration further maps θ/φ/χ → t/p/k for a fairer match
against aspiration-blind syllabic spelling. Non-Greek letters pass through.
to_phonemes ¶
to_phonemes(word: str, script: str, *, fold_aspiration: bool = False, overrides: dict[str, str] | None = None) -> str
Reduce word (in script) to a Latin phoneme string.
greek romanizes alphabetic text; lineara / linearb / cypriot
map a hyphenated transliteration through their sign→sound tables (overrides
tests alternative sign values). Raises ValueError for an unsupported
script (e.g. undeciphered Cypro-Minoan).
build_phonetic_classes ¶
build_phonetic_classes(scheme: PhoneticScheme = DEFAULT_PHONETIC_SCHEME) -> PhoneticClasses
Assemble concrete class tables from a scheme.
describe_phonetic_scheme ¶
describe_phonetic_scheme(s: PhoneticScheme) -> str
One-line scheme description, for stamping into saved findings/reports so a match ranking stays reproducible.
extract_root ¶
The consonant skeleton of a word's phonetic form (vowels stripped),
e.g. KU-RO → kr. Exploratory root-cognate heuristic.
is_numeral_token ¶
True for digit / superscript / subscript / approx numeral tokens.
phonetic_distance ¶
phonetic_distance(a: str, b: str, w: PhoneticWeights = DEFAULT_WEIGHTS, cl: PhoneticClasses = DEFAULT_PHONETIC_CLASSES) -> float
Weighted Levenshtein over phonetic strings, normalized to [0,1] by the
longer length. Vowel↔vowel swaps cost 0.3, same-class consonants 0.5,
everything else 1 (see PhoneticWeights).
reference_key ¶
Bare comparison key for a reference word: drop hyphens (so syllables
concatenate like the Linear A side) and lowercase. With strip_notation,
also remove pure-notation marks (reconstruction *, PIE laryngeal
subscripts ₁₂₃, the labialization/aspiration modifiers ʰ ʷ, and the
combining syllabic ring U+0325). So PIE *ǵʰésr̥ → ǵésr.
sequence_distance ¶
Standard Levenshtein over arbitrary token sequences — compares whole inscriptions as ordered bags of words.
sequence_similarity ¶
Sequence distance normalized to a 0–1 similarity (1 = identical).
find_morphological_clusters ¶
find_morphological_clusters(words: Iterable[Mapping[str, object] | tuple[str, int]], min_suffix_productivity: int = 5, min_cluster_size: int = 2, max_suffix_len: int = 2) -> list[MorphCluster]
Cluster stems with their productive-suffix derivations.
words is an iterable of {"word": str, "count": int} mappings or
(word, count) pairs (e.g. straight from Corpus.word_frequencies).
A suffix is productive when it ends at least min_suffix_productivity
distinct words; clusters smaller than min_cluster_size are dropped;
suffixes are considered up to max_suffix_len signs long.
compile_sign_pattern ¶
compile_sign_pattern(raw: str) -> CompiledSignPattern | None
Parse a wildcard sign pattern (KU-*-RO) into a CompiledSignPattern, or None if empty.
match_sign_pattern ¶
match_sign_pattern(signs: list[str], pattern: CompiledSignPattern) -> bool
Match a word's sign sequence against a compiled pattern.
normalize_sign_label ¶
Fold subscript digits to ASCII (RA₂ → RA2).
word_matches_sign_pattern ¶
Compile and match in one call. False for single-sign words / empty patterns.
build_cooccurrence_map ¶
build_cooccurrence_map(documents: Iterable[Document]) -> dict[str, set[str]]
Map each multi-sign word to the set of multi-sign words it shares a document with.
build_word_index ¶
Index every multi-sign word to the documents it appears in.
eval_query ¶
eval_query(filters: list[FilterRow], output: Output, documents: list[Document], word_index: dict[str, WordEntry], annotated_ids: set[str], cooccur_map: dict[str, set[str]]) -> QueryResults
Run a query (filters + output mode) over pre-built indices and return the result set in canonical shape.
inscription_matches ¶
True if a document satisfies the inscription-scope filter rows (AND/OR/NOT-combined).
run_query ¶
run_query(corpus: Any, filters: list[FilterRow], output: Output = 'inscriptions', annotated_ids: set[str] | None = None) -> QueryResults
Build the indices from a Corpus and evaluate filters.
Convenience over eval_query for the common whole-corpus case. The result
carries the corpus's provenance and a filter summary, so it is citable
via QueryResults.cite.
summarize_filters ¶
summarize_filters(filters: list[FilterRow]) -> str
One-line, human-readable label for a filter set.
word_matches ¶
word_matches(word: str, filters: Iterable[FilterRow], cooccur_map: dict[str, set[str]]) -> bool
True if a word satisfies the word-scope filter rows (AND/OR/NOT-combined).
bootstrap_ci ¶
bootstrap_ci(corpus: Any, statistic: Callable[[Sequence[Document]], float], *, n_resamples: int = 999, level: float = 0.95, seed: int = 0) -> BootstrapCI
Percentile bootstrap CI for statistic(documents).
Documents are the resampling unit (drawn with replacement, original size),
the right grain for corpus questions where tokens within a document are
not independent (Efron & Tibshirani 1993). The seed makes the interval
reproducible by default — vary it to see Monte-Carlo wobble. The band
quantifies sampling variability given these documents; it cannot speak
to what was never excavated.
mean_doc_words = lambda docs: sum( ... len([t for t in d.tokens if t.kind is TokenKind.WORD]) for d in docs ... ) / len(docs) bootstrap_ci(corpus, mean_doc_words) # doctest: +SKIP BootstrapCI(estimate=7.1, low=6.4, high=7.9, level=0.95, n_resamples=999)
dispersion ¶
dispersion(corpus: Any, item: str, *, kind: str = 'words') -> Dispersion
Gries' DP for one item over the documents of corpus.
DP = ½ · Σᵢ |vᵢ − sᵢ| where sᵢ is document i's share of the
corpus (in items of this kind) and vᵢ the share of the item's
occurrences falling in document i (Gries 2008). dp_norm divides by
the attainable maximum 1 − min(sᵢ) (Lijffijt & Gries 2012). Raises
ValueError if the item never occurs.
dispersions ¶
dispersions(corpus: Any, *, kind: str = 'words', min_frequency: int = 2, top: int = 0) -> list[Dispersion]
DP for every item with frequency ≥ min_frequency, most evenly
dispersed first (ties: higher frequency first). top truncates (0 = all).
Reading the ranking: a frequent item with low dp_norm is corpus-wide
vocabulary; a frequent item with high dp_norm lives in few documents
(a formulaic or genre/site-bound term) — on Aegean material often the more
interesting case.
keyness ¶
keyness(target: Any, reference: Any, *, kind: str = 'words', min_target: int = 2, smoothing: float = 0.5) -> list[KeynessRow]
Key items of target against reference, strongest first.
For each item the 2×2 table is (count in target, rest of target, count in
reference, rest of reference); G² follows Rayson & Garside (2000) and the
log-ratio Hardie (2014). Items need target_count ≥ min_target or to
be similarly frequent in the reference (so marked under-use surfaces
too). Sorted by G² descending — filter log_ratio > 0 for the target's
own vocabulary, < 0 for what it conspicuously lacks.
The two corpora must be distinct texts (a subset vs its complement is the
classic design: keyness(c.filter(site="Pylos"), rest)).
classify_corpus ¶
Classify every document in a corpus, returning {category_key:
[doc_id, ...]} with every category present (empty lists included) and
documents in corpus order.
classify_structure ¶
classify_structure(document: Document) -> str
The heuristic category key for one inscription, from its content shape.
Mirrors the workbench precedence exactly: a KU-RO total marker (or numerals with several multi-sign words) ⇒ accounting; otherwise a libation formula ⇒ libation; otherwise many separators and no numerals ⇒ list; otherwise an extended hyphenated text with no numerals ⇒ text; else other.