aegean.analysis¶

analysis ¶

Script-agnostic + Aegean-specific analysis over the core model.

Ported faithfully from the Linear A Research Workbench (src/lib/*.ts) and checked against shared golden fixtures so the Python port can't silently diverge. Methods over the undeciphered Linear A material are exploratory — see each function's docstring.

BalanceCheck `dataclass` ¶

BalanceCheck(stated_total: float, computed_sum: float, item_count: int, difference: float, balances: bool, marker: str, total_line_index: int)

One total line reconciled against the item lines feeding it: the stated total, the computed sum, their signed difference (computed − stated), whether they balance, the total marker (e.g. KU-RO), and the index of the total line.

AlignCell `dataclass` ¶

AlignCell(a: str, b: str, op: AlignOp)

One aligned position: a char from a (or "" for ins), a char from b (or "" for del), and the operation that relates them.

PhoneticComparison `dataclass` ¶

PhoneticComparison(word_a: str, word_b: str, script_a: str, script_b: str, phonemes_a: str, phonemes_b: str, distance: float, similarity: float, alignment: tuple[AlignCell, ...])

One cross-script comparison: the two words, their script ids, their romanized phoneme strings, the normalized distance (0 = identical, 1 = wholly different), its similarity complement, and the per-segment alignment.

PhoneticClasses `dataclass` ¶

PhoneticClasses(vowels: str, consonant_classes: tuple[tuple[str, ...], ...])

Concrete vowel set + consonant-class tables for the distance metric.

PhoneticScheme `dataclass` ¶

PhoneticScheme(interdentals: str = 'dental', pharyngeal_h: str = 'velar', voiced_postalveolar: str = 'sibilant', strip_notation: bool = True)

The four typologically ambiguous decisions, exposed for tuning.

PhoneticWeights `dataclass` ¶

PhoneticWeights(vowel: float = 0.3, same_class: float = 0.5, far: float = 1.0, indel: float = 1.0)

Tunable substitution / indel costs, kept in [0,1] so the normalized distance stays in [0,1].

ClusterMember `dataclass` ¶

ClusterMember(word: str, count: int, suffix: str)

A word in a cluster, with the signs it appends beyond the cluster stem ("" for the stem itself; "≠" flags a member that doesn't actually extend the stem).

MorphCluster `dataclass` ¶

MorphCluster(stem: str, members: tuple[ClusterMember, ...], total_count: int, suffixes: tuple[str, ...])

A stem and its productive-suffix derivations.

CompiledSignPattern `dataclass` ¶

CompiledSignPattern(tokens: tuple[str, ...], has_double_star: bool)

A parsed sign-pattern query: the normalized sign tokens (with * = one sign and ** = zero-or-more wildcards) and whether the pattern contains a **.

FieldDef `dataclass` ¶

FieldDef(label: str, scope: Scope, kind: FieldKind)

A queryable field: its display label, scope (inscription/word), and value kind.

FilterRow `dataclass` ¶

FilterRow(field: str, value: Any, connector: Connector | None = None, negate: bool = False)

One query row. connector joins this row to the running result within its scope (ignored on the first row); negate flips the row's own test.

QueryResults `dataclass` ¶

QueryResults(inscriptions: list[Document], words: list[tuple[str, int]], provenance: Provenance | None = None, description: str = '')

A query's result set: the matching inscriptions and/or (word, count) pairs.

Corpus.query attaches the corpus's provenance and a description of the filters, so cite can cite the exact result set used in a paper.

cite ¶

cite(style: str = 'plain') -> str

Cite this exact result set: the source plus the query that produced it.

style: "plain" (one line), "bibtex" (a @misc entry), or "apa". Raises ValueError when the results carry no provenance (results from eval_query directly rather than Corpus.query).

WordEntry `dataclass` ¶

WordEntry(count: int, inscription_ids: tuple[str, ...], sites: frozenset[str] = frozenset())

Per-word index entry: the documents a (multi-sign) word appears in.

BootstrapCI `dataclass` ¶

BootstrapCI(estimate: float, low: float, high: float, level: float, n_resamples: int)

A percentile bootstrap interval: the statistic on the full corpus (estimate) and the [low, high] band holding level of the resampled values.

Dispersion `dataclass` ¶

Dispersion(item: str, frequency: int, range: int, parts: int, dp: float, dp_norm: float)

How evenly one item spreads over the documents of a corpus.

dp is Gries' deviation of proportions: 0 = the item is distributed exactly as the document sizes predict; values toward 1 = concentrated in few documents. dp_norm rescales by the attainable maximum (Lijffijt & Gries 2012) so corpora with different size profiles compare. range is the count of documents attesting the item (out of parts documents that have any items at all).

KeynessRow `dataclass` ¶

KeynessRow(item: str, target_count: int, target_total: int, reference_count: int, reference_total: int, log_likelihood: float, log_ratio: float, p_value: float)

One item's keyness in a target (sub)corpus against a reference.

log_likelihood is Dunning's G² (significance: is the imbalance more than chance?); p_value its χ²₁ tail. log_ratio is Hardie's log₂ ratio of relative frequencies (effect size: how big is the difference?) — positive = overused in the target, negative = underused; each whole point is a doubling. Zero counts are smoothed (default +0.5) for the ratio only, never for G².

StructureCategory `dataclass` ¶

StructureCategory(key: str, label: str, description: str)

A heuristic tablet-structure category (e.g. accounting/libation/list): key, label, description.

account_lines ¶

account_lines(document: Document) -> list[list[str]]

The document's physical lines as token-text lists.

balance_check ¶

balance_check(document: Document) -> list[BalanceCheck]

Verify every total line on a document against its summed item lines.

Uses the script's total markers (Linear A's KU-RO, Linear B's TO-SO/TO-SA).

add_sequence ¶

add_sequence(aln: list[AlnPos], seq: list[str], prior_n: int) -> list[AlnPos]

Add one word sequence to a growing alignment via Needleman–Wunsch at the word level (exact-token match rewarded, substitution columns allowed, gaps penalized). prior_n is how many sequences are already in the alignment.

align_phonetic ¶

align_phonetic(a: str, b: str, w: PhoneticWeights = DEFAULT_WEIGHTS, cl: PhoneticClasses = DEFAULT_PHONETIC_CLASSES) -> list[AlignCell]

Run the weighted Levenshtein, then backtrace to emit a per-position alignment classifying each substitution as vowel / same-class / far.

align_sequences ¶

align_sequences(seqs: list[list[str]]) -> list[AlnPos]

Progressive multiple alignment of word sequences (e.g. several inscriptions). Returns aligned positions, one column per input sequence.

chi_squared_2x2 ¶

chi_squared_2x2(joint: int, count_a: int, count_b: int, total: int) -> float

Yates-corrected chi-squared test statistic for the 2×2 table.

The continuity correction subtracts N/2 from |ad − bc| and clamps the corrected deviation at 0 (so near-independent pairs score ~0, not a small spurious positive). Returns 0 for degenerate tables.

chi_squared_p_value ¶

chi_squared_p_value(x: float) -> float

p-value for chi-squared with 1 degree of freedom: P(X² ≥ x). In [0,1], and non-increasing in x.

fishers_exact ¶

fishers_exact(joint: int, count_a: int, count_b: int, total: int) -> float

Fisher's exact test, two-sided, for the 2×2 table: the summed hypergeometric probability of all tables with the same marginals whose probability is ≤ the observed table's. More accurate than χ² for small expected counts but O(N) per pair. Returns 1 for a degenerate margin.

log_likelihood_ratio_2x2 ¶

log_likelihood_ratio_2x2(joint: int, count_a: int, count_b: int, total: int) -> float

Log-likelihood ratio (G²) for the 2×2 table — Dunning (1993), the corpus-linguistics standard. G² = 2 · Σ O·ln(O/E) over the four cells; more robust than χ² for the sparse, low-count pairs of a small corpus. Returns 0 for degenerate tables; larger = stronger association.

pmi_interval ¶

pmi_interval(joint: int, count_a: int, count_b: int, total: int) -> tuple[float, float]

Propagate a Wilson interval on the joint probability into a pointwise mutual information confidence interval (log₂ space), holding the marginals fixed. A zero lower joint clamps PMI low to a finite floor (−20).

wilson_interval ¶

wilson_interval(k: int, n: int, z: float = 1.96) -> tuple[float, float]

Wilson score interval for a binomial proportion p̂ = k/n. Stays inside [0,1] with good coverage even at small/extreme p̂. z = 1.96 ≈ 95%.

nearest ¶

nearest(word: str, script: str, candidates: Iterable[str], candidate_script: str, *, top: int = 5, weights: PhoneticWeights = DEFAULT_WEIGHTS, classes: PhoneticClasses = DEFAULT_PHONETIC_CLASSES, fold_aspiration: bool = False) -> list[tuple[str, float]]

Rank candidates (in candidate_script) by phonetic distance to word (in script), nearest first; returns (candidate, distance) for the top closest (top=0 = all).

The intended use is decipherment-adjacent triage — e.g. which alphabetic Greek words sound closest to a Linear B form — where the ordering is the result and the absolute distances are secondary (see the module caution). Candidates that cannot be romanized are skipped.

phonetic_compare ¶

phonetic_compare(word_a: str, script_a: str, word_b: str, script_b: str, *, weights: PhoneticWeights = DEFAULT_WEIGHTS, classes: PhoneticClasses = DEFAULT_PHONETIC_CLASSES, fold_aspiration: bool = False, overrides_a: dict[str, str] | None = None, overrides_b: dict[str, str] | None = None) -> PhoneticComparison

Compare two words across scripts by sound: romanize each, then run the weighted phonetic distance and the per-segment alignment.

The classic bridge is phonetic_compare("po-me", "linearb", "ποιμήν", "greek") — Linear B po-me against Greek poimēn 'shepherd'. Tune the metric with weights/classes (see distance) and meet defective syllabic spelling with fold_aspiration.

romanize_greek ¶

romanize_greek(text: str, *, fold_aspiration: bool = False) -> str

Romanize alphabetic Greek to the Latin phoneme alphabet.

Strips accents, breathings, iota subscript, and diaeresis (NFD, then drop combining marks), lowercases, and maps each letter: θ→th, φ→ph, χ→kh, ξ→ks, ψ→ps, η→ē, ω→ō, and γ→n before a velar (γγ/γκ/γχ/γξ). Rough breathing (the /h/) is dropped with the other diacritics — the syllabaries don't write it either. fold_aspiration further maps θ/φ/χ → t/p/k for a fairer match against aspiration-blind syllabic spelling. Non-Greek letters pass through.

to_phonemes ¶

to_phonemes(word: str, script: str, *, fold_aspiration: bool = False, overrides: dict[str, str] | None = None) -> str

Reduce word (in script) to a Latin phoneme string.

greek romanizes alphabetic text; lineara / linearb / cypriot map a hyphenated transliteration through their sign→sound tables (overrides tests alternative sign values). Raises ValueError for an unsupported script (e.g. undeciphered Cypro-Minoan).

build_phonetic_classes ¶

build_phonetic_classes(scheme: PhoneticScheme = DEFAULT_PHONETIC_SCHEME) -> PhoneticClasses

Assemble concrete class tables from a scheme.

describe_phonetic_scheme ¶

describe_phonetic_scheme(s: PhoneticScheme) -> str

One-line scheme description, for stamping into saved findings/reports so a match ranking stays reproducible.

extract_root ¶

extract_root(word: str, overrides: dict[str, str] | None = None) -> str

The consonant skeleton of a word's phonetic form (vowels stripped), e.g. KU-RO → kr. Exploratory root-cognate heuristic.

is_numeral_token ¶

is_numeral_token(w: str) -> bool

True for digit / superscript / subscript / approx numeral tokens.

phonetic_distance ¶

phonetic_distance(a: str, b: str, w: PhoneticWeights = DEFAULT_WEIGHTS, cl: PhoneticClasses = DEFAULT_PHONETIC_CLASSES) -> float

Weighted Levenshtein over phonetic strings, normalized to [0,1] by the longer length. Vowel↔vowel swaps cost 0.3, same-class consonants 0.5, everything else 1 (see PhoneticWeights).

reference_key ¶

reference_key(raw_word: str, strip_notation: bool = True) -> str

Bare comparison key for a reference word: drop hyphens (so syllables concatenate like the Linear A side) and lowercase. With strip_notation, also remove pure-notation marks (reconstruction *, PIE laryngeal subscripts ₁₂₃, the labialization/aspiration modifiers ʰ ʷ, and the combining syllabic ring U+0325). So PIE *ǵʰésr̥ → ǵésr.

sequence_distance ¶

sequence_distance(a: Sequence[object], b: Sequence[object]) -> int

Standard Levenshtein over arbitrary token sequences — compares whole inscriptions as ordered bags of words.

sequence_similarity ¶

sequence_similarity(a: Sequence[object], b: Sequence[object]) -> float

Sequence distance normalized to a 0–1 similarity (1 = identical).

find_morphological_clusters ¶

find_morphological_clusters(words: Iterable[Mapping[str, object] | tuple[str, int]], min_suffix_productivity: int = 5, min_cluster_size: int = 2, max_suffix_len: int = 2) -> list[MorphCluster]

Cluster stems with their productive-suffix derivations.

words is an iterable of {"word": str, "count": int} mappings or (word, count) pairs (e.g. straight from Corpus.word_frequencies). A suffix is productive when it ends at least min_suffix_productivity distinct words; clusters smaller than min_cluster_size are dropped; suffixes are considered up to max_suffix_len signs long.

compile_sign_pattern ¶

compile_sign_pattern(raw: str) -> CompiledSignPattern | None

Parse a wildcard sign pattern (KU-*-RO) into a CompiledSignPattern, or None if empty.

match_sign_pattern ¶

match_sign_pattern(signs: list[str], pattern: CompiledSignPattern) -> bool

Match a word's sign sequence against a compiled pattern.

normalize_sign_label ¶

normalize_sign_label(label: str) -> str

Fold subscript digits to ASCII (RA₂ → RA2).

word_matches_sign_pattern ¶

word_matches_sign_pattern(word: str, raw: str) -> bool

Compile and match in one call. False for single-sign words / empty patterns.

build_cooccurrence_map ¶

build_cooccurrence_map(documents: Iterable[Document]) -> dict[str, set[str]]

Map each multi-sign word to the set of multi-sign words it shares a document with.

build_word_index ¶

build_word_index(documents: Iterable[Document]) -> dict[str, WordEntry]

Index every multi-sign word to the documents it appears in.

default_value ¶

default_value(field: str) -> Any

The neutral default value for a field, by kind.

eval_query ¶

eval_query(filters: list[FilterRow], output: Output, documents: list[Document], word_index: dict[str, WordEntry], annotated_ids: set[str], cooccur_map: dict[str, set[str]]) -> QueryResults

Run a query (filters + output mode) over pre-built indices and return the result set in canonical shape.

inscription_matches ¶

inscription_matches(doc: Document, filters: Iterable[FilterRow], annotated_ids: set[str]) -> bool

True if a document satisfies the inscription-scope filter rows (AND/OR/NOT-combined).

run_query ¶

run_query(corpus: Any, filters: list[FilterRow], output: Output = 'inscriptions', annotated_ids: set[str] | None = None) -> QueryResults

Build the indices from a Corpus and evaluate filters.

Convenience over eval_query for the common whole-corpus case. The result carries the corpus's provenance and a filter summary, so it is citable via QueryResults.cite.

summarize_filters ¶

summarize_filters(filters: list[FilterRow]) -> str

One-line, human-readable label for a filter set.

word_matches ¶

word_matches(word: str, filters: Iterable[FilterRow], cooccur_map: dict[str, set[str]]) -> bool

True if a word satisfies the word-scope filter rows (AND/OR/NOT-combined).

bootstrap_ci ¶

bootstrap_ci(corpus: Any, statistic: Callable[[Sequence[Document]], float], *, n_resamples: int = 999, level: float = 0.95, seed: int = 0) -> BootstrapCI

Percentile bootstrap CI for statistic(documents).

Documents are the resampling unit (drawn with replacement, original size), the right grain for corpus questions where tokens within a document are not independent (Efron & Tibshirani 1993). The seed makes the interval reproducible by default — vary it to see Monte-Carlo wobble. The band quantifies sampling variability given these documents; it cannot speak to what was never excavated.

mean_doc_words = lambda docs: sum( ... len([t for t in d.tokens if t.kind is TokenKind.WORD]) for d in docs ... ) / len(docs) bootstrap_ci(corpus, mean_doc_words) # doctest: +SKIP BootstrapCI(estimate=7.1, low=6.4, high=7.9, level=0.95, n_resamples=999)

dispersion ¶

dispersion(corpus: Any, item: str, *, kind: str = 'words') -> Dispersion

Gries' DP for one item over the documents of corpus.

DP = ½ · Σᵢ |vᵢ − sᵢ| where sᵢ is document i's share of the corpus (in items of this kind) and vᵢ the share of the item's occurrences falling in document i (Gries 2008). dp_norm divides by the attainable maximum 1 − min(sᵢ) (Lijffijt & Gries 2012). Raises ValueError if the item never occurs.

dispersions ¶

dispersions(corpus: Any, *, kind: str = 'words', min_frequency: int = 2, top: int = 0) -> list[Dispersion]

DP for every item with frequency ≥ min_frequency, most evenly dispersed first (ties: higher frequency first). top truncates (0 = all).

Reading the ranking: a frequent item with low dp_norm is corpus-wide vocabulary; a frequent item with high dp_norm lives in few documents (a formulaic or genre/site-bound term) — on Aegean material often the more interesting case.

keyness ¶

keyness(target: Any, reference: Any, *, kind: str = 'words', min_target: int = 2, smoothing: float = 0.5) -> list[KeynessRow]

Key items of target against reference, strongest first.

For each item the 2×2 table is (count in target, rest of target, count in reference, rest of reference); G² follows Rayson & Garside (2000) and the log-ratio Hardie (2014). Items need target_count ≥ min_target or to be similarly frequent in the reference (so marked under-use surfaces too). Sorted by G² descending — filter log_ratio > 0 for the target's own vocabulary, < 0 for what it conspicuously lacks.

The two corpora must be distinct texts (a subset vs its complement is the classic design: keyness(c.filter(site="Pylos"), rest)).

classify_corpus ¶

classify_corpus(corpus: object) -> dict[str, list[str]]

Classify every document in a corpus, returning {category_key: [doc_id, ...]} with every category present (empty lists included) and documents in corpus order.

classify_structure ¶

classify_structure(document: Document) -> str

The heuristic category key for one inscription, from its content shape.

Mirrors the workbench precedence exactly: a KU-RO total marker (or numerals with several multi-sign words) ⇒ accounting; otherwise a libation formula ⇒ libation; otherwise many separators and no numerals ⇒ list; otherwise an extended hyphenated text with no numerals ⇒ text; else other.