feat(datasets): reusable dataset registry + downloader by timtreis · Pull Request #40 · scverse/scverse-misc

timtreis · 2026-06-15T10:45:41Z

Motivation

Several scverse packages each reimplement the same thing: a registry of downloadable datasets + a pooch-based downloader with hash verification. squidpy, scanpy and pertpy all roll their own; there is no shared, reusable building block. This PR adds one to scverse-misc so packages can drop their bespoke infra and register only their domain-specific loaders.

from scverse_misc.datasets import Fetcher, register_loader

@register_loader("spatialdata")
def load_sd(ctx):
    zip_path = ctx.download(ctx.entry.file(suffix=".zip"))
    ctx.extract_archive(zip_path)
    import spatialdata as sd
    return sd.read_zarr(ctx.target_dir / f"{ctx.entry.name}.zarr")

sdata = Fetcher("datasets.yaml").fetch("cells")

Add a `datasets` subpackage (behind the `datasets` extra) that packages can share instead of each reimplementing pooch-based dataset downloading: - DatasetRegistry / DatasetEntry / FileEntry: declarative YAML registry, supporting both full URLs (e.g. Zenodo) and base_url + s3_key. - Fetcher: pooch download with SHA-256 verification, URL fallback, caching, archive extraction (via FetchContext helpers). - register_loader: pluggable loader registry keyed by the free-form dataset `type` string, so domain loaders (image, spatialdata, visium, ...) are registered by the consuming package. Ships a built-in `anndata` loader. Tests cover registry parsing, URL building, loader dispatch and extraction (no network). Verified end-to-end against a real S3-hosted SpatialData zip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

codecov · 2026-06-15T10:47:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.60%. Comparing base (e084b6a) to head (9fa4c8d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   91.36%   93.60%   +2.24%     
==========================================
  Files           8       11       +3     
  Lines         440      532      +92     
==========================================
+ Hits          402      498      +96     
+ Misses         38       34       -4

Files with missing lines	Coverage Δ
src/scverse_misc/datasets/__init__.py	`100.00% <100.00%> (ø)`
src/scverse_misc/datasets/_fetcher.py	`100.00% <100.00%> (ø)`
src/scverse_misc/datasets/_registry.py	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- type register_loader via overloads so the decorator preserves loader types - narrow file() suffix matching for mypy - positional-only ctx in the built-in anndata loader to match the Loader protocol - ignore_missing_imports for stubless optional deps (pooch, anndata, yaml) via a single mypy override Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Multi-file datasets (e.g. a per-sample Visium layout) need to place files in a subdirectory rather than the shared type cache dir. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- add a built-in spatialdata loader (zip -> .zarr -> read_zarr), behind the new 'spatialdata' extra - promote anndata to a core dependency so the anndata loader works out of the box - [datasets] extra is now just the download machinery (pooch, pyyaml, tqdm) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Matches the Loader protocol; the pre-commit mypy hook type-checks tests too (local runs over src/ alone missed it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r, retries) - replace the hand-rolled url-fallback loop with pooch.create(...).fetch(), gaining retry_if_failed for free - replace manual shutil.unpack_archive with pooch's Unzip/Untar processors (passed via FetchContext.download(processor=...)) - drop the now-redundant extract_archive helper and module logger - Fetcher gains a 'retries' arg (pooch retry_if_failed, default 3) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- FileEntry.urls() returned a candidate list but only [0] was ever used and pooch.create takes one url per key -> collapse to resolve_url() -> str - remove unused FetchContext.download_all Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback (over-engineered): remove DatasetRegistry, Fetcher and FetchContext. Keep the typed FileEntry/DatasetEntry dataclasses; the registry is now a plain dict[str, DatasetEntry] from parse_registry(), and downloading is a fetch() function. Loaders are (entry, target, download, **kwargs) callables. parse_registry folds every YAML key except type/files into entry.metadata, so it no longer hardcodes domain fields (shape/library_id). ~216 -> ~96 lines, 6 classes -> 2 (both pure-data). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-authored-by: Philipp A. <flying-sheep@web.de>

flying-sheep

test coverage seems quite poor, can you improve that?

https://app.codecov.io/gh/scverse/scverse-misc/pull/40/blob/src/scverse_misc/datasets/_fetcher.py?dropdown=coverage

flying-sheep

Apart from the above, also needs param docs.

- sphinx_ext: read __scverse_misc_canonical_instance_name__ (the attr the namespace decorator sets); the old __scverse_misc_namespace_name__ never existed, so namespace-decorator docstrings were silently never rendered - datasets: parse_registry drops unknown per-file YAML keys so extras (e.g. `description`) no longer crash FileEntry(**fd) - datasets: _load_spatialdata extracts into a per-dataset dir and finds the store by glob("*.zarr") instead of hardcoding <name>.zarr — decouples from zip layout, avoids collisions in the shared target - tests: cover the download closure, both built-in loaders, file(name=...), extra-key tolerance, and the new glob/error paths (datasets module now 100%) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

flying-sheep

looks great! one small docs problem:

Co-authored-by: Philipp A. <flying-sheep@web.de>

- parse_registry now warns on (and still drops) unrecognised per-file keys so typos surface, via a small `_file_entry` helper - remove unused `calls["processor"]` capture in test_download_drives_pooch (the FakePup.fetch method itself backs the real pup.fetch call and stays) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

timtreis and others added 2 commits June 15, 2026 12:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ad1929

for more information, see https://pre-commit.ci

timtreis and others added 2 commits June 15, 2026 12:57

feat(datasets): allow FetchContext.download into a custom dest dir

74d5217

Multi-file datasets (e.g. a per-sample Visium layout) need to place files in a subdirectory rather than the shared type cache dir. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

timtreis mentioned this pull request Jun 15, 2026

refactor(datasets): use the shared scverse-misc dataset registry + downloader scverse/squidpy#1213

Draft

timtreis force-pushed the feat/datasets branch from ef33516 to 592b180 Compare June 15, 2026 11:14

timtreis and others added 4 commits June 15, 2026 13:37

test(datasets): positional-only ctx in the dummy loader for mypy

86e2527

Matches the Loader protocol; the pre-commit mypy hook type-checks tests too (local runs over src/ alone missed it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

flying-sheep reviewed Jun 17, 2026

View reviewed changes

Comment thread .gitignore Outdated

flying-sheep requested changes Jun 17, 2026

View reviewed changes

Merge branch 'main' into pr/timtreis/40

ddc2e50

flying-sheep requested changes Jun 17, 2026

View reviewed changes

Comment thread src/scverse_misc/datasets/_fetcher.py Outdated

flying-sheep reviewed Jun 17, 2026

View reviewed changes

Comment thread src/scverse_misc/datasets/_fetcher.py Outdated

Update .gitignore

412c074

Co-authored-by: Philipp A. <flying-sheep@web.de>

flying-sheep reviewed Jun 17, 2026

View reviewed changes

Comment thread src/scverse_misc/datasets/_fetcher.py Outdated

flying-sheep added 3 commits June 17, 2026 15:06

fix types

80e33af

Merge branch 'main' into pr/timtreis/40

f1f87db

fix types

026d096

flying-sheep requested changes Jun 17, 2026

View reviewed changes

flying-sheep added 3 commits June 17, 2026 15:41

docstrings

291f85d

fix

51b12ac

fix docs

8e5a623

flying-sheep requested changes Jun 17, 2026

View reviewed changes

ilan-gold mentioned this pull request Jun 18, 2026

Use scverse-misc for data downloads scverse/SnapATAC2#472

Open

timtreis and others added 2 commits June 18, 2026 17:43

Merge branch 'main' into pr/timtreis/40

79886df

flying-sheep added 2 commits June 18, 2026 18:15

fix tests

392ba3c

fix pre tests

dec25bb

flying-sheep self-requested a review June 18, 2026 16:23

Merge branch 'main' into feat/datasets

826abfb

flying-sheep requested changes Jun 19, 2026

View reviewed changes

Comment thread src/scverse_misc/datasets/_registry.py Outdated

Comment thread tests/test_datasets.py Outdated

Comment thread docs/api.md Outdated

flying-sheep and others added 4 commits June 19, 2026 15:53

Merge branch 'main' into pr/timtreis/40

4c30bff

Update docs/api.md

a6c2eb6

Co-authored-by: Philipp A. <flying-sheep@web.de>

module

9fa4c8d

flying-sheep approved these changes Jun 19, 2026

View reviewed changes

flying-sheep merged commit 3926c37 into scverse:main Jun 19, 2026
10 checks passed

timtreis deleted the feat/datasets branch June 19, 2026 14:11

timtreis mentioned this pull request Jun 19, 2026

Add downloadable cells dataset via scverse-misc scverse/spatialdata#1149

Open

ilia-kats mentioned this pull request Jun 19, 2026

feat(logging): shared logger skeleton with Rule extension #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): reusable dataset registry + downloader#40

feat(datasets): reusable dataset registry + downloader#40
flying-sheep merged 26 commits into
scverse:mainfrom
timtreis:feat/datasets

timtreis commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flying-sheep left a comment

Uh oh!

flying-sheep left a comment

Uh oh!

flying-sheep left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timtreis commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timtreis commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading