aegean.data¶
data ¶
Bundled-data access + a download-to-cache layer.
Compact text data ships in the wheel (read via importlib.resources). Large or license-restricted assets — notably the Linear A facsimile mirror (~116 MB) — are NOT bundled; they are fetched on demand from upstream into a user cache. This is how the package stays small regardless of how large the source corpora are.
Downloads are sha256-verified (when a checksum is pinned), atomic (written to a
.part file then renamed), and idempotent (a present, valid cache file is a
no-op). A dataset's URL can be overridden without a code change via
PYAEGEAN_<NAME>_URL (e.g. PYAEGEAN_LINEARA_IMAGES_URL), so a researcher
can point at their own mirror before an official release is pinned.
DataNotAvailableError ¶
Bases: RuntimeError
Raised when a non-bundled dataset has not been fetched (or can't be).
load_bundled_json ¶
Load a JSON file shipped inside the wheel, e.g.
load_bundled_json("lineara", "signs.json").
bundled_data_version ¶
The version of the bundled datasets.
Bundled data ships inside the wheel and is immutable for a given release, so
its version is the package version; versions gives per-file sha256s.
versions ¶
A reproducibility manifest of every dataset pyaegean can touch.
Returns {"package": …, "bundled": {…}, "fetched": {…}}: each bundled
JSON file with its sha256 + size (hashed from the installed wheel contents),
and each registered fetchable asset with its pinned URL/sha256, license, and
whether it is present in the local cache.
Pinning for papers: record aegean.__version__ and this manifest
(e.g. json.dump(aegean.data.versions(), f)) alongside your results;
anyone with the same package version and matching sha256s is analyzing
byte-identical data. Fetched assets are sha256-verified on download, so a
matching pin in this manifest is the byte-level guarantee.
sha256_file ¶
Streaming sha256 of a file (won't load a 500 MB asset into memory).
download_file ¶
Download a single URL to dest atomically (a .part temp then rename),
optionally sha256-verified. Returns dest; raises DataNotAvailableError
on a network failure or checksum mismatch. Shared by fetch and the
on-demand dataset downloaders (e.g. the Greek treebank).
fetch ¶
Download a registered remote dataset into the cache and return its path.
Verifies the sha256 when one is pinned, downloads atomically, and is a no-op
when the cache already holds it. For extract datasets the download is a
tar archive that is unpacked into a cache directory (returned); otherwise the
downloaded file path is returned. Raises DataNotAvailableError for
unknown datasets, un-pinned URLs, checksum mismatches, unsafe archives, or
network failures — never silently, and never blocking import.
fetch_prebuilt ¶
Place a hosted prebuilt artifact at dest; return True on success.
Lets an opt-in backend prefer a small hosted index/model over a slow local
build (a ~270 MB download, or minutes of training), while keeping
build-from-source as the fallback: any failure — no pinned URL, network
error, checksum mismatch — returns False instead of raising, so the
caller proceeds to build. member names a file inside an extract
dataset's unpacked directory.