Skip to content

[Roadmap] Fill / sentinel-value handling in VirtualiZarr #371

@maxrjones

Description

@maxrjones

Tracking the work to resolve the 15 open issues labelled fill-sentinel-values plus the related upstream gaps in xarray's FillValueCoder.

Framing

VirtualiZarr correctness is measured against the Zarr spec, not against xarray equivalence. Several recent reports (zarr-developers/VirtualiZarr#989, zarr-developers/VirtualiZarr#485, zarr-developers/VirtualiZarr#628) hit xarray's FillValueCoder.decode failing on JSON-native scalars in zarr metadata — the parser is producing spec-compliant output that xarray's HDF5-style coder can't consume. Tracked upstream at pydata/xarray#11332. Those are upstream xarray issues, not virtualizarr bugs.

The property-test infrastructure added in zarr-developers/VirtualiZarr#990 distinguishes failure categories:

Failure shape Attribution Action
Both engines fail identically Upstream xarray / zarr-python Track upstream — no virtualizarr PR
Observed (virtualizarr) fails; reference ok Virtualizarr-specific bug Fix in VirtualiZarr
Both succeed but differ Real correctness gap Fix in VirtualiZarr

Root-cause clusters

The 15 open issues plus two new findings collapse into 8 underlying problems. Each issue is listed under its primary cluster; cross-cluster cascades are noted inline.

A. Parser crashes during fill extraction — local parser fixes, ~5-20 lines each.

B. HDF parser _FillValue encoding gaps — local parser fix; emit base64 for kind S per docs/custom_parsers.md.

C. xarray FillValueCoder lacking branches — upstream, tracked at pydata/xarray#11332. Out of virtualizarr scope.

D. h5py default fillvalue propagated indiscriminately — parser fix: use dataset.id.get_create_plist().fill_value_defined() to skip propagating defaults. Fixing D removes the cascade into C for vlen-string-without-_FillValue cases.

E. Cross-parser inconsistency — different parsers produce different fill defaults / metadata for the same source. Architectural fix.

F. Writer-side fill semantics — writer-API design questions, distinct from parser fixes.

G. Attribute serialization fidelity — zarr v3 metadata is JSON; lossy for some attribute shapes.

H. Cross-cutting encoding model — meta-discussion; closes via the totality of the other clusters.

Phases

  1. Local parser fixes (low risk): Correctly handle HDF5 fillvalue for string dtype arrays. zarr-developers/VirtualiZarr#988, structured-dtype guard at _extract_attrs, ZarrParser default lookup, S-dtype base64 encoding. ~50 lines total across several small PRs.
  2. Upstream advocacy (parallel track, no virtualizarr PRs): pydata/xarray#11332 tracks the FillValueCoder JSON-native-scalar gap; engage with zarr-specs#351, zarr-extensions#33.
  3. fill_value_defined() distinction: stop propagating h5py-default fills to zarr storage.
  4. Cross-parser consistency: extend the property-test suite to Kerchunk, TIFF; port HDFParser conventions; document the contract in docs/custom_parsers.md.
  5. Writer-side round-trips: Icechunk / Kerchunk writers preserve fill semantics.
  6. Attribute fidelity: policy for non-JSON-serializable attrs (How to handle non-JSON serializable attributes? zarr-developers/VirtualiZarr#715), scalar dtype preservation across JSON metadata.

Phase 2 runs in parallel with all others. BothEnginesFailedIdenticallyError cases auto-resolve when xarray ships the fix; no virtualizarr code change required.

Status

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions