-
Notifications
You must be signed in to change notification settings - Fork 86
QA/QC: validate sieve output against GDAL/rasterio reference #1168
Copy link
Copy link
Closed
Labels
Description
Reason or problem
The sieve module (xrspatial/sieve.py) was added recently and claims equivalence with GDAL's gdal_sieve.py. The existing tests cover internal correctness (region sizes, connectivity, skip_values, NaN handling), but nothing actually compares output against GDAL's sieve. Without that cross-check, we don't know if the behavior matches what users coming from GDAL/rasterio expect.
Proposal
Add a QA/QC test module that runs xrspatial.sieve() and rasterio.features.sieve() on identical inputs and asserts matching output. Cover:
- Multiple thresholds (1, 5, 10, 50)
- Both connectivity modes (4 and 8)
- Rasters with NaN/nodata regions
- Integer vs float inputs
- A realistic classified raster, not just toy grids
- Edge cases: single-value raster, checkerboard pattern, L-shaped regions
Where outputs differ, document why (e.g., tie-breaking order, nodata semantics).
Design
- Test file:
xrspatial/tests/test_sieve_gdal_parity.py - Use
rasterio.features.sieveas reference (already an optional dep in CI) - Generate synthetic rasters in test fixtures so tests are self-contained
- Use
pytest.importorskip("rasterio")so tests skip if rasterio isn't installed - Document behavioral differences in inline comments, not a separate doc
Value
- Lets us verify that migrating from GDAL sieve to xrspatial sieve produces the same results
- Catches regressions if our output drifts from GDAL's behavior
- Documents intentional differences
Drawbacks
- Adds rasterio as a test-time dependency for this file (already available in CI)
- If GDAL changes sieve behavior in future versions, tests may need updating
Alternatives
- Compare against scipy.ndimage.label + manual merge, but that just reimplements the algorithm rather than validating against the accepted standard
- Use GDAL python bindings directly, but rasterio.features.sieve is the more common user-facing API
Unresolved questions
- Tie-breaking: GDAL may pick a different neighbor when multiple neighbors have the same size. We should document these cases rather than trying to match GDAL's arbitrary tie-breaking.
Reactions are currently unavailable