Skip to content

QA/QC: validate sieve output against GDAL/rasterio reference #1168

@brendancol

Description

@brendancol

Reason or problem

The sieve module (xrspatial/sieve.py) was added recently and claims equivalence with GDAL's gdal_sieve.py. The existing tests cover internal correctness (region sizes, connectivity, skip_values, NaN handling), but nothing actually compares output against GDAL's sieve. Without that cross-check, we don't know if the behavior matches what users coming from GDAL/rasterio expect.

Proposal

Add a QA/QC test module that runs xrspatial.sieve() and rasterio.features.sieve() on identical inputs and asserts matching output. Cover:

  • Multiple thresholds (1, 5, 10, 50)
  • Both connectivity modes (4 and 8)
  • Rasters with NaN/nodata regions
  • Integer vs float inputs
  • A realistic classified raster, not just toy grids
  • Edge cases: single-value raster, checkerboard pattern, L-shaped regions

Where outputs differ, document why (e.g., tie-breaking order, nodata semantics).

Design

  • Test file: xrspatial/tests/test_sieve_gdal_parity.py
  • Use rasterio.features.sieve as reference (already an optional dep in CI)
  • Generate synthetic rasters in test fixtures so tests are self-contained
  • Use pytest.importorskip("rasterio") so tests skip if rasterio isn't installed
  • Document behavioral differences in inline comments, not a separate doc

Value

  • Lets us verify that migrating from GDAL sieve to xrspatial sieve produces the same results
  • Catches regressions if our output drifts from GDAL's behavior
  • Documents intentional differences

Drawbacks

  • Adds rasterio as a test-time dependency for this file (already available in CI)
  • If GDAL changes sieve behavior in future versions, tests may need updating

Alternatives

  • Compare against scipy.ndimage.label + manual merge, but that just reimplements the algorithm rather than validating against the accepted standard
  • Use GDAL python bindings directly, but rasterio.features.sieve is the more common user-facing API

Unresolved questions

  • Tie-breaking: GDAL may pick a different neighbor when multiple neighbors have the same size. We should document these cases rather than trying to match GDAL's arbitrary tie-breaking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions