Skip to content

aegean.data

data

Bundled-data access + a download-to-cache layer.

Compact text data ships in the wheel (read via importlib.resources). Large or license-restricted assets — notably the Linear A facsimile mirror (~116 MB) — are NOT bundled; they are fetched on demand from upstream into a user cache. This is how the package stays small regardless of how large the source corpora are.

Downloads are sha256-verified (when a checksum is pinned), atomic (written to a .part file then renamed), and idempotent (a present, valid cache file is a no-op). A dataset's URL can be overridden without a code change via PYAEGEAN_<NAME>_URL (e.g. PYAEGEAN_LINEARA_IMAGES_URL), so a researcher can point at their own mirror before an official release is pinned.

DataNotAvailableError

Bases: RuntimeError

Raised when a non-bundled dataset has not been fetched (or can't be).

load_bundled_json

load_bundled_json(*parts: str) -> Any

Load a JSON file shipped inside the wheel, e.g. load_bundled_json("lineara", "signs.json").

cache_dir

cache_dir() -> Path

Where fetched datasets are cached (override with PYAEGEAN_CACHE).

bundled_data_version

bundled_data_version() -> str

The version of the bundled datasets.

Bundled data ships inside the wheel and is immutable for a given release, so its version is the package version; versions gives per-file sha256s.

versions

versions() -> dict[str, Any]

A reproducibility manifest of every dataset pyaegean can touch.

Returns {"package": …, "bundled": {…}, "fetched": {…}}: each bundled JSON file with its sha256 + size (hashed from the installed wheel contents), and each registered fetchable asset with its pinned URL/sha256, license, and whether it is present in the local cache.

Pinning for papers: record aegean.__version__ and this manifest (e.g. json.dump(aegean.data.versions(), f)) alongside your results; anyone with the same package version and matching sha256s is analyzing byte-identical data. Fetched assets are sha256-verified on download, so a matching pin in this manifest is the byte-level guarantee.

sha256_file

sha256_file(path: Path, *, chunk: int = 1 << 20) -> str

Streaming sha256 of a file (won't load a 500 MB asset into memory).

download_file

download_file(url: str, dest: Path, *, sha256: str = '') -> Path

Download a single URL to dest atomically (a .part temp then rename), optionally sha256-verified. Returns dest; raises DataNotAvailableError on a network failure or checksum mismatch. Shared by fetch and the on-demand dataset downloaders (e.g. the Greek treebank).

fetch

fetch(name: str, *, force: bool = False) -> Path

Download a registered remote dataset into the cache and return its path.

Verifies the sha256 when one is pinned, downloads atomically, and is a no-op when the cache already holds it. For extract datasets the download is a tar archive that is unpacked into a cache directory (returned); otherwise the downloaded file path is returned. Raises DataNotAvailableError for unknown datasets, un-pinned URLs, checksum mismatches, unsafe archives, or network failures — never silently, and never blocking import.

fetch_prebuilt

fetch_prebuilt(name: str, dest: Path, *, member: str | None = None) -> bool

Place a hosted prebuilt artifact at dest; return True on success.

Lets an opt-in backend prefer a small hosted index/model over a slow local build (a ~270 MB download, or minutes of training), while keeping build-from-source as the fallback: any failure — no pinned URL, network error, checksum mismatch — returns False instead of raising, so the caller proceeds to build. member names a file inside an extract dataset's unpacked directory.