Skip to content

feat(datasets): reusable dataset registry + downloader#40

Merged
flying-sheep merged 26 commits into
scverse:mainfrom
timtreis:feat/datasets
Jun 19, 2026
Merged

feat(datasets): reusable dataset registry + downloader#40
flying-sheep merged 26 commits into
scverse:mainfrom
timtreis:feat/datasets

Conversation

@timtreis

@timtreis timtreis commented Jun 15, 2026

Copy link
Copy Markdown
Member

Motivation

Several scverse packages each reimplement the same thing: a registry of downloadable datasets + a pooch-based downloader with hash verification. squidpy, scanpy and pertpy all roll their own; there is no shared, reusable building block. This PR adds one to scverse-misc so packages can drop their bespoke infra and register only their domain-specific loaders.

from scverse_misc.datasets import Fetcher, register_loader

@register_loader("spatialdata")
def load_sd(ctx):
    zip_path = ctx.download(ctx.entry.file(suffix=".zip"))
    ctx.extract_archive(zip_path)
    import spatialdata as sd
    return sd.read_zarr(ctx.target_dir / f"{ctx.entry.name}.zarr")

sdata = Fetcher("datasets.yaml").fetch("cells")

timtreis and others added 2 commits June 15, 2026 12:44
Add a `datasets` subpackage (behind the `datasets` extra) that packages can
share instead of each reimplementing pooch-based dataset downloading:

- DatasetRegistry / DatasetEntry / FileEntry: declarative YAML registry,
  supporting both full URLs (e.g. Zenodo) and base_url + s3_key.
- Fetcher: pooch download with SHA-256 verification, URL fallback, caching,
  archive extraction (via FetchContext helpers).
- register_loader: pluggable loader registry keyed by the free-form dataset
  `type` string, so domain loaders (image, spatialdata, visium, ...) are
  registered by the consuming package. Ships a built-in `anndata` loader.

Tests cover registry parsing, URL building, loader dispatch and extraction
(no network). Verified end-to-end against a real S3-hosted SpatialData zip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.60%. Comparing base (e084b6a) to head (9fa4c8d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   91.36%   93.60%   +2.24%     
==========================================
  Files           8       11       +3     
  Lines         440      532      +92     
==========================================
+ Hits          402      498      +96     
+ Misses         38       34       -4     
Files with missing lines Coverage Δ
src/scverse_misc/datasets/__init__.py 100.00% <100.00%> (ø)
src/scverse_misc/datasets/_fetcher.py 100.00% <100.00%> (ø)
src/scverse_misc/datasets/_registry.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

timtreis and others added 2 commits June 15, 2026 12:57
- type register_loader via overloads so the decorator preserves loader types
- narrow file() suffix matching for mypy
- positional-only ctx in the built-in anndata loader to match the Loader protocol
- ignore_missing_imports for stubless optional deps (pooch, anndata, yaml) via a
  single mypy override

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-file datasets (e.g. a per-sample Visium layout) need to place files in a
subdirectory rather than the shared type cache dir.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- add a built-in spatialdata loader (zip -> .zarr -> read_zarr), behind the new
  'spatialdata' extra
- promote anndata to a core dependency so the anndata loader works out of the box
- [datasets] extra is now just the download machinery (pooch, pyyaml, tqdm)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
timtreis and others added 4 commits June 15, 2026 13:37
Matches the Loader protocol; the pre-commit mypy hook type-checks tests too
(local runs over src/ alone missed it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r, retries)

- replace the hand-rolled url-fallback loop with pooch.create(...).fetch(),
  gaining retry_if_failed for free
- replace manual shutil.unpack_archive with pooch's Unzip/Untar processors
  (passed via FetchContext.download(processor=...))
- drop the now-redundant extract_archive helper and module logger
- Fetcher gains a 'retries' arg (pooch retry_if_failed, default 3)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- FileEntry.urls() returned a candidate list but only [0] was ever used and
  pooch.create takes one url per key -> collapse to resolve_url() -> str
- remove unused FetchContext.download_all

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback (over-engineered): remove DatasetRegistry, Fetcher and
FetchContext. Keep the typed FileEntry/DatasetEntry dataclasses; the registry is
now a plain dict[str, DatasetEntry] from parse_registry(), and downloading is a
fetch() function. Loaders are (entry, target, download, **kwargs) callables.

parse_registry folds every YAML key except type/files into entry.metadata, so it
no longer hardcodes domain fields (shape/library_id). ~216 -> ~96 lines, 6 classes
-> 2 (both pure-data).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread .gitignore Outdated
Comment thread src/scverse_misc/datasets/_fetcher.py Outdated
Comment thread src/scverse_misc/datasets/_fetcher.py Outdated
Co-authored-by: Philipp A. <flying-sheep@web.de>
Comment thread src/scverse_misc/datasets/_fetcher.py Outdated

@flying-sheep flying-sheep left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flying-sheep flying-sheep left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the above, also needs param docs.

timtreis and others added 2 commits June 18, 2026 17:43
- sphinx_ext: read __scverse_misc_canonical_instance_name__ (the attr the
  namespace decorator sets); the old __scverse_misc_namespace_name__ never
  existed, so namespace-decorator docstrings were silently never rendered
- datasets: parse_registry drops unknown per-file YAML keys so extras
  (e.g. `description`) no longer crash FileEntry(**fd)
- datasets: _load_spatialdata extracts into a per-dataset dir and finds the
  store by glob("*.zarr") instead of hardcoding <name>.zarr — decouples from
  zip layout, avoids collisions in the shared target
- tests: cover the download closure, both built-in loaders, file(name=...),
  extra-key tolerance, and the new glob/error paths (datasets module now 100%)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@flying-sheep flying-sheep self-requested a review June 18, 2026 16:23

@flying-sheep flying-sheep left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! one small docs problem:

Comment thread src/scverse_misc/datasets/_registry.py Outdated
Comment thread tests/test_datasets.py Outdated
Comment thread docs/api.md Outdated
flying-sheep and others added 4 commits June 19, 2026 15:53
Co-authored-by: Philipp A. <flying-sheep@web.de>
- parse_registry now warns on (and still drops) unrecognised per-file keys
  so typos surface, via a small `_file_entry` helper
- remove unused `calls["processor"]` capture in test_download_drives_pooch
  (the FakePup.fetch method itself backs the real pup.fetch call and stays)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@flying-sheep flying-sheep merged commit 3926c37 into scverse:main Jun 19, 2026
10 checks passed
@timtreis timtreis deleted the feat/datasets branch June 19, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants