feat(datasets): reusable dataset registry + downloader#40
Merged
Conversation
Add a `datasets` subpackage (behind the `datasets` extra) that packages can share instead of each reimplementing pooch-based dataset downloading: - DatasetRegistry / DatasetEntry / FileEntry: declarative YAML registry, supporting both full URLs (e.g. Zenodo) and base_url + s3_key. - Fetcher: pooch download with SHA-256 verification, URL fallback, caching, archive extraction (via FetchContext helpers). - register_loader: pluggable loader registry keyed by the free-form dataset `type` string, so domain loaders (image, spatialdata, visium, ...) are registered by the consuming package. Ships a built-in `anndata` loader. Tests cover registry parsing, URL building, loader dispatch and extraction (no network). Verified end-to-end against a real S3-hosted SpatialData zip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #40 +/- ##
==========================================
+ Coverage 91.36% 93.60% +2.24%
==========================================
Files 8 11 +3
Lines 440 532 +92
==========================================
+ Hits 402 498 +96
+ Misses 38 34 -4
🚀 New features to boost your workflow:
|
- type register_loader via overloads so the decorator preserves loader types - narrow file() suffix matching for mypy - positional-only ctx in the built-in anndata loader to match the Loader protocol - ignore_missing_imports for stubless optional deps (pooch, anndata, yaml) via a single mypy override Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-file datasets (e.g. a per-sample Visium layout) need to place files in a subdirectory rather than the shared type cache dir. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- add a built-in spatialdata loader (zip -> .zarr -> read_zarr), behind the new 'spatialdata' extra - promote anndata to a core dependency so the anndata loader works out of the box - [datasets] extra is now just the download machinery (pooch, pyyaml, tqdm) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Matches the Loader protocol; the pre-commit mypy hook type-checks tests too (local runs over src/ alone missed it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r, retries) - replace the hand-rolled url-fallback loop with pooch.create(...).fetch(), gaining retry_if_failed for free - replace manual shutil.unpack_archive with pooch's Unzip/Untar processors (passed via FetchContext.download(processor=...)) - drop the now-redundant extract_archive helper and module logger - Fetcher gains a 'retries' arg (pooch retry_if_failed, default 3) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- FileEntry.urls() returned a candidate list but only [0] was ever used and pooch.create takes one url per key -> collapse to resolve_url() -> str - remove unused FetchContext.download_all Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback (over-engineered): remove DatasetRegistry, Fetcher and FetchContext. Keep the typed FileEntry/DatasetEntry dataclasses; the registry is now a plain dict[str, DatasetEntry] from parse_registry(), and downloading is a fetch() function. Loaders are (entry, target, download, **kwargs) callables. parse_registry folds every YAML key except type/files into entry.metadata, so it no longer hardcodes domain fields (shape/library_id). ~216 -> ~96 lines, 6 classes -> 2 (both pure-data). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flying-sheep
requested changes
Jun 17, 2026
flying-sheep
requested changes
Jun 17, 2026
Co-authored-by: Philipp A. <flying-sheep@web.de>
flying-sheep
requested changes
Jun 17, 2026
flying-sheep
left a comment
Member
There was a problem hiding this comment.
test coverage seems quite poor, can you improve that?
flying-sheep
requested changes
Jun 17, 2026
flying-sheep
left a comment
Member
There was a problem hiding this comment.
Apart from the above, also needs param docs.
- sphinx_ext: read __scverse_misc_canonical_instance_name__ (the attr the
namespace decorator sets); the old __scverse_misc_namespace_name__ never
existed, so namespace-decorator docstrings were silently never rendered
- datasets: parse_registry drops unknown per-file YAML keys so extras
(e.g. `description`) no longer crash FileEntry(**fd)
- datasets: _load_spatialdata extracts into a per-dataset dir and finds the
store by glob("*.zarr") instead of hardcoding <name>.zarr — decouples from
zip layout, avoids collisions in the shared target
- tests: cover the download closure, both built-in loaders, file(name=...),
extra-key tolerance, and the new glob/error paths (datasets module now 100%)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flying-sheep
requested changes
Jun 19, 2026
flying-sheep
left a comment
Member
There was a problem hiding this comment.
looks great! one small docs problem:
Co-authored-by: Philipp A. <flying-sheep@web.de>
- parse_registry now warns on (and still drops) unrecognised per-file keys so typos surface, via a small `_file_entry` helper - remove unused `calls["processor"]` capture in test_download_drives_pooch (the FakePup.fetch method itself backs the real pup.fetch call and stays) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flying-sheep
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Several scverse packages each reimplement the same thing: a registry of downloadable datasets + a pooch-based downloader with hash verification. squidpy, scanpy and pertpy all roll their own; there is no shared, reusable building block. This PR adds one to
scverse-miscso packages can drop their bespoke infra and register only their domain-specific loaders.