Skip to content

aegean.analysis

analysis

Script-agnostic + Aegean-specific analysis over the core model.

Ported faithfully from the Linear A Research Workbench (src/lib/*.ts) and checked against shared golden fixtures so the Python port can't silently diverge. Methods over the undeciphered Linear A material are exploratory — see each function's docstring.

BalanceCheck dataclass

BalanceCheck(stated_total: float, computed_sum: float, item_count: int, difference: float, balances: bool, marker: str, total_line_index: int)

One total line reconciled against the item lines feeding it: the stated total, the computed sum, their signed difference (computed − stated), whether they balance, the total marker (e.g. KU-RO), and the index of the total line.

AlignCell dataclass

AlignCell(a: str, b: str, op: AlignOp)

One aligned position: a char from a (or "" for ins), a char from b (or "" for del), and the operation that relates them.

PhoneticComparison dataclass

PhoneticComparison(word_a: str, word_b: str, script_a: str, script_b: str, phonemes_a: str, phonemes_b: str, distance: float, similarity: float, alignment: tuple[AlignCell, ...])

One cross-script comparison: the two words, their script ids, their romanized phoneme strings, the normalized distance (0 = identical, 1 = wholly different), its similarity complement, and the per-segment alignment.

PhoneticClasses dataclass

PhoneticClasses(vowels: str, consonant_classes: tuple[tuple[str, ...], ...])

Concrete vowel set + consonant-class tables for the distance metric.

PhoneticScheme dataclass

PhoneticScheme(interdentals: str = 'dental', pharyngeal_h: str = 'velar', voiced_postalveolar: str = 'sibilant', strip_notation: bool = True)

The four typologically ambiguous decisions, exposed for tuning.

PhoneticWeights dataclass

PhoneticWeights(vowel: float = 0.3, same_class: float = 0.5, far: float = 1.0, indel: float = 1.0)

Tunable substitution / indel costs, kept in [0,1] so the normalized distance stays in [0,1].

ClusterMember dataclass

ClusterMember(word: str, count: int, suffix: str)

A word in a cluster, with the signs it appends beyond the cluster stem ("" for the stem itself; "≠" flags a member that doesn't actually extend the stem).

MorphCluster dataclass

MorphCluster(stem: str, members: tuple[ClusterMember, ...], total_count: int, suffixes: tuple[str, ...])

A stem and its productive-suffix derivations.

CompiledSignPattern dataclass

CompiledSignPattern(tokens: tuple[str, ...], has_double_star: bool)

A parsed sign-pattern query: the normalized sign tokens (with * = one sign and ** = zero-or-more wildcards) and whether the pattern contains a **.

FieldDef dataclass

FieldDef(label: str, scope: Scope, kind: FieldKind)

A queryable field: its display label, scope (inscription/word), and value kind.

FilterRow dataclass

FilterRow(field: str, value: Any, connector: Connector | None = None, negate: bool = False)

One query row. connector joins this row to the running result within its scope (ignored on the first row); negate flips the row's own test.

QueryResults dataclass

QueryResults(inscriptions: list[Document], words: list[tuple[str, int]], provenance: Provenance | None = None, description: str = '')

A query's result set: the matching inscriptions and/or (word, count) pairs.

Corpus.query attaches the corpus's provenance and a description of the filters, so cite can cite the exact result set used in a paper.

cite

cite(style: str = 'plain') -> str

Cite this exact result set: the source plus the query that produced it.

style: "plain" (one line), "bibtex" (a @misc entry), or "apa". Raises ValueError when the results carry no provenance (results from eval_query directly rather than Corpus.query).

WordEntry dataclass

WordEntry(count: int, inscription_ids: tuple[str, ...], sites: frozenset[str] = frozenset())

Per-word index entry: the documents a (multi-sign) word appears in.

BootstrapCI dataclass

BootstrapCI(estimate: float, low: float, high: float, level: float, n_resamples: int)

A percentile bootstrap interval: the statistic on the full corpus (estimate) and the [low, high] band holding level of the resampled values.

Dispersion dataclass

Dispersion(item: str, frequency: int, range: int, parts: int, dp: float, dp_norm: float)

How evenly one item spreads over the documents of a corpus.

dp is Gries' deviation of proportions: 0 = the item is distributed exactly as the document sizes predict; values toward 1 = concentrated in few documents. dp_norm rescales by the attainable maximum (Lijffijt & Gries 2012) so corpora with different size profiles compare. range is the count of documents attesting the item (out of parts documents that have any items at all).

KeynessRow dataclass

KeynessRow(item: str, target_count: int, target_total: int, reference_count: int, reference_total: int, log_likelihood: float, log_ratio: float, p_value: float)

One item's keyness in a target (sub)corpus against a reference.

log_likelihood is Dunning's G² (significance: is the imbalance more than chance?); p_value its χ²₁ tail. log_ratio is Hardie's log₂ ratio of relative frequencies (effect size: how big is the difference?) — positive = overused in the target, negative = underused; each whole point is a doubling. Zero counts are smoothed (default +0.5) for the ratio only, never for G².

StructureCategory dataclass

StructureCategory(key: str, label: str, description: str)

A heuristic tablet-structure category (e.g. accounting/libation/list): key, label, description.

account_lines

account_lines(document: Document) -> list[list[str]]

The document's physical lines as token-text lists.

balance_check

balance_check(document: Document) -> list[BalanceCheck]

Verify every total line on a document against its summed item lines.

Uses the script's total markers (Linear A's KU-RO, Linear B's TO-SO/TO-SA).

add_sequence

add_sequence(aln: list[AlnPos], seq: list[str], prior_n: int) -> list[AlnPos]

Add one word sequence to a growing alignment via Needleman–Wunsch at the word level (exact-token match rewarded, substitution columns allowed, gaps penalized). prior_n is how many sequences are already in the alignment.

align_phonetic

align_phonetic(a: str, b: str, w: PhoneticWeights = DEFAULT_WEIGHTS, cl: PhoneticClasses = DEFAULT_PHONETIC_CLASSES) -> list[AlignCell]

Run the weighted Levenshtein, then backtrace to emit a per-position alignment classifying each substitution as vowel / same-class / far.

align_sequences

align_sequences(seqs: list[list[str]]) -> list[AlnPos]

Progressive multiple alignment of word sequences (e.g. several inscriptions). Returns aligned positions, one column per input sequence.

chi_squared_2x2

chi_squared_2x2(joint: int, count_a: int, count_b: int, total: int) -> float

Yates-corrected chi-squared test statistic for the 2×2 table.

The continuity correction subtracts N/2 from |ad − bc| and clamps the corrected deviation at 0 (so near-independent pairs score ~0, not a small spurious positive). Returns 0 for degenerate tables.

chi_squared_p_value

chi_squared_p_value(x: float) -> float

p-value for chi-squared with 1 degree of freedom: P(X² ≥ x). In [0,1], and non-increasing in x.

fishers_exact

fishers_exact(joint: int, count_a: int, count_b: int, total: int) -> float

Fisher's exact test, two-sided, for the 2×2 table: the summed hypergeometric probability of all tables with the same marginals whose probability is ≤ the observed table's. More accurate than χ² for small expected counts but O(N) per pair. Returns 1 for a degenerate margin.

log_likelihood_ratio_2x2

log_likelihood_ratio_2x2(joint: int, count_a: int, count_b: int, total: int) -> float

Log-likelihood ratio (G²) for the 2×2 table — Dunning (1993), the corpus-linguistics standard. G² = 2 · Σ O·ln(O/E) over the four cells; more robust than χ² for the sparse, low-count pairs of a small corpus. Returns 0 for degenerate tables; larger = stronger association.

pmi_interval

pmi_interval(joint: int, count_a: int, count_b: int, total: int) -> tuple[float, float]

Propagate a Wilson interval on the joint probability into a pointwise mutual information confidence interval (log₂ space), holding the marginals fixed. A zero lower joint clamps PMI low to a finite floor (−20).

wilson_interval

wilson_interval(k: int, n: int, z: float = 1.96) -> tuple[float, float]

Wilson score interval for a binomial proportion p̂ = k/n. Stays inside [0,1] with good coverage even at small/extreme p̂. z = 1.96 ≈ 95%.

nearest

nearest(word: str, script: str, candidates: Iterable[str], candidate_script: str, *, top: int = 5, weights: PhoneticWeights = DEFAULT_WEIGHTS, classes: PhoneticClasses = DEFAULT_PHONETIC_CLASSES, fold_aspiration: bool = False) -> list[tuple[str, float]]

Rank candidates (in candidate_script) by phonetic distance to word (in script), nearest first; returns (candidate, distance) for the top closest (top=0 = all).

The intended use is decipherment-adjacent triage — e.g. which alphabetic Greek words sound closest to a Linear B form — where the ordering is the result and the absolute distances are secondary (see the module caution). Candidates that cannot be romanized are skipped.

phonetic_compare

phonetic_compare(word_a: str, script_a: str, word_b: str, script_b: str, *, weights: PhoneticWeights = DEFAULT_WEIGHTS, classes: PhoneticClasses = DEFAULT_PHONETIC_CLASSES, fold_aspiration: bool = False, overrides_a: dict[str, str] | None = None, overrides_b: dict[str, str] | None = None) -> PhoneticComparison

Compare two words across scripts by sound: romanize each, then run the weighted phonetic distance and the per-segment alignment.

The classic bridge is phonetic_compare("po-me", "linearb", "ποιμήν", "greek") — Linear B po-me against Greek poimēn 'shepherd'. Tune the metric with weights/classes (see distance) and meet defective syllabic spelling with fold_aspiration.

romanize_greek

romanize_greek(text: str, *, fold_aspiration: bool = False) -> str

Romanize alphabetic Greek to the Latin phoneme alphabet.

Strips accents, breathings, iota subscript, and diaeresis (NFD, then drop combining marks), lowercases, and maps each letter: θ→th, φ→ph, χ→kh, ξ→ks, ψ→ps, η→ē, ω→ō, and γ→n before a velar (γγ/γκ/γχ/γξ). Rough breathing (the /h/) is dropped with the other diacritics — the syllabaries don't write it either. fold_aspiration further maps θ/φ/χ → t/p/k for a fairer match against aspiration-blind syllabic spelling. Non-Greek letters pass through.

to_phonemes

to_phonemes(word: str, script: str, *, fold_aspiration: bool = False, overrides: dict[str, str] | None = None) -> str

Reduce word (in script) to a Latin phoneme string.

greek romanizes alphabetic text; lineara / linearb / cypriot map a hyphenated transliteration through their sign→sound tables (overrides tests alternative sign values). Raises ValueError for an unsupported script (e.g. undeciphered Cypro-Minoan).

build_phonetic_classes

build_phonetic_classes(scheme: PhoneticScheme = DEFAULT_PHONETIC_SCHEME) -> PhoneticClasses

Assemble concrete class tables from a scheme.

describe_phonetic_scheme

describe_phonetic_scheme(s: PhoneticScheme) -> str

One-line scheme description, for stamping into saved findings/reports so a match ranking stays reproducible.

extract_root

extract_root(word: str, overrides: dict[str, str] | None = None) -> str

The consonant skeleton of a word's phonetic form (vowels stripped), e.g. KU-ROkr. Exploratory root-cognate heuristic.

is_numeral_token

is_numeral_token(w: str) -> bool

True for digit / superscript / subscript / approx numeral tokens.

phonetic_distance

phonetic_distance(a: str, b: str, w: PhoneticWeights = DEFAULT_WEIGHTS, cl: PhoneticClasses = DEFAULT_PHONETIC_CLASSES) -> float

Weighted Levenshtein over phonetic strings, normalized to [0,1] by the longer length. Vowel↔vowel swaps cost 0.3, same-class consonants 0.5, everything else 1 (see PhoneticWeights).

reference_key

reference_key(raw_word: str, strip_notation: bool = True) -> str

Bare comparison key for a reference word: drop hyphens (so syllables concatenate like the Linear A side) and lowercase. With strip_notation, also remove pure-notation marks (reconstruction *, PIE laryngeal subscripts ₁₂₃, the labialization/aspiration modifiers ʰ ʷ, and the combining syllabic ring U+0325). So PIE *ǵʰésr̥ǵésr.

sequence_distance

sequence_distance(a: Sequence[object], b: Sequence[object]) -> int

Standard Levenshtein over arbitrary token sequences — compares whole inscriptions as ordered bags of words.

sequence_similarity

sequence_similarity(a: Sequence[object], b: Sequence[object]) -> float

Sequence distance normalized to a 0–1 similarity (1 = identical).

find_morphological_clusters

find_morphological_clusters(words: Iterable[Mapping[str, object] | tuple[str, int]], min_suffix_productivity: int = 5, min_cluster_size: int = 2, max_suffix_len: int = 2) -> list[MorphCluster]

Cluster stems with their productive-suffix derivations.

words is an iterable of {"word": str, "count": int} mappings or (word, count) pairs (e.g. straight from Corpus.word_frequencies). A suffix is productive when it ends at least min_suffix_productivity distinct words; clusters smaller than min_cluster_size are dropped; suffixes are considered up to max_suffix_len signs long.

compile_sign_pattern

compile_sign_pattern(raw: str) -> CompiledSignPattern | None

Parse a wildcard sign pattern (KU-*-RO) into a CompiledSignPattern, or None if empty.

match_sign_pattern

match_sign_pattern(signs: list[str], pattern: CompiledSignPattern) -> bool

Match a word's sign sequence against a compiled pattern.

normalize_sign_label

normalize_sign_label(label: str) -> str

Fold subscript digits to ASCII (RA₂ → RA2).

word_matches_sign_pattern

word_matches_sign_pattern(word: str, raw: str) -> bool

Compile and match in one call. False for single-sign words / empty patterns.

build_cooccurrence_map

build_cooccurrence_map(documents: Iterable[Document]) -> dict[str, set[str]]

Map each multi-sign word to the set of multi-sign words it shares a document with.

build_word_index

build_word_index(documents: Iterable[Document]) -> dict[str, WordEntry]

Index every multi-sign word to the documents it appears in.

default_value

default_value(field: str) -> Any

The neutral default value for a field, by kind.

eval_query

eval_query(filters: list[FilterRow], output: Output, documents: list[Document], word_index: dict[str, WordEntry], annotated_ids: set[str], cooccur_map: dict[str, set[str]]) -> QueryResults

Run a query (filters + output mode) over pre-built indices and return the result set in canonical shape.

inscription_matches

inscription_matches(doc: Document, filters: Iterable[FilterRow], annotated_ids: set[str]) -> bool

True if a document satisfies the inscription-scope filter rows (AND/OR/NOT-combined).

run_query

run_query(corpus: Any, filters: list[FilterRow], output: Output = 'inscriptions', annotated_ids: set[str] | None = None) -> QueryResults

Build the indices from a Corpus and evaluate filters.

Convenience over eval_query for the common whole-corpus case. The result carries the corpus's provenance and a filter summary, so it is citable via QueryResults.cite.

summarize_filters

summarize_filters(filters: list[FilterRow]) -> str

One-line, human-readable label for a filter set.

word_matches

word_matches(word: str, filters: Iterable[FilterRow], cooccur_map: dict[str, set[str]]) -> bool

True if a word satisfies the word-scope filter rows (AND/OR/NOT-combined).

bootstrap_ci

bootstrap_ci(corpus: Any, statistic: Callable[[Sequence[Document]], float], *, n_resamples: int = 999, level: float = 0.95, seed: int = 0) -> BootstrapCI

Percentile bootstrap CI for statistic(documents).

Documents are the resampling unit (drawn with replacement, original size), the right grain for corpus questions where tokens within a document are not independent (Efron & Tibshirani 1993). The seed makes the interval reproducible by default — vary it to see Monte-Carlo wobble. The band quantifies sampling variability given these documents; it cannot speak to what was never excavated.

mean_doc_words = lambda docs: sum( ... len([t for t in d.tokens if t.kind is TokenKind.WORD]) for d in docs ... ) / len(docs) bootstrap_ci(corpus, mean_doc_words) # doctest: +SKIP BootstrapCI(estimate=7.1, low=6.4, high=7.9, level=0.95, n_resamples=999)

dispersion

dispersion(corpus: Any, item: str, *, kind: str = 'words') -> Dispersion

Gries' DP for one item over the documents of corpus.

DP = ½ · Σᵢ |vᵢ − sᵢ| where sᵢ is document i's share of the corpus (in items of this kind) and vᵢ the share of the item's occurrences falling in document i (Gries 2008). dp_norm divides by the attainable maximum 1 − min(sᵢ) (Lijffijt & Gries 2012). Raises ValueError if the item never occurs.

dispersions

dispersions(corpus: Any, *, kind: str = 'words', min_frequency: int = 2, top: int = 0) -> list[Dispersion]

DP for every item with frequency ≥ min_frequency, most evenly dispersed first (ties: higher frequency first). top truncates (0 = all).

Reading the ranking: a frequent item with low dp_norm is corpus-wide vocabulary; a frequent item with high dp_norm lives in few documents (a formulaic or genre/site-bound term) — on Aegean material often the more interesting case.

keyness

keyness(target: Any, reference: Any, *, kind: str = 'words', min_target: int = 2, smoothing: float = 0.5) -> list[KeynessRow]

Key items of target against reference, strongest first.

For each item the 2×2 table is (count in target, rest of target, count in reference, rest of reference); G² follows Rayson & Garside (2000) and the log-ratio Hardie (2014). Items need target_count ≥ min_target or to be similarly frequent in the reference (so marked under-use surfaces too). Sorted by G² descending — filter log_ratio > 0 for the target's own vocabulary, < 0 for what it conspicuously lacks.

The two corpora must be distinct texts (a subset vs its complement is the classic design: keyness(c.filter(site="Pylos"), rest)).

classify_corpus

classify_corpus(corpus: object) -> dict[str, list[str]]

Classify every document in a corpus, returning {category_key: [doc_id, ...]} with every category present (empty lists included) and documents in corpus order.

classify_structure

classify_structure(document: Document) -> str

The heuristic category key for one inscription, from its content shape.

Mirrors the workbench precedence exactly: a KU-RO total marker (or numerals with several multi-sign words) ⇒ accounting; otherwise a libation formula ⇒ libation; otherwise many separators and no numerals ⇒ list; otherwise an extended hyphenated text with no numerals ⇒ text; else other.