Skip to content

QA/QC: align sieve with GDAL and add parity tests#1169

Merged
brendancol merged 5 commits intomasterfrom
issue-1168
Apr 6, 2026
Merged

QA/QC: align sieve with GDAL and add parity tests#1169
brendancol merged 5 commits intomasterfrom
issue-1168

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Add test_sieve_gdal_parity.py with 30 tests comparing xrspatial.sieve() against rasterio.features.sieve() (GDAL wrapper) on identical inputs
  • Match GDAL's single-pass semantics: only merge a small region into a neighbor already >= threshold. Regions with no above-threshold neighbor stay put.
  • Remove the iteration loop and convergence warning (single pass, nothing to converge)
  • Drop "Author of Proposal" from the issue template (GitHub already shows who filed it)

Closes #1168

Test plan

  • pytest xrspatial/tests/test_sieve.py -- 45 existing tests updated for single-pass behavior
  • pytest xrspatial/tests/test_sieve_gdal_parity.py -- 30 new parity tests, all assert exact equality with GDAL (one tie-breaking test allows < 0.5% mismatch)
  • Verify no other test files import _MAX_ITERATIONS or reference convergence warning

_extract_transect was calling .compute() on the full dask array just to
read a handful of transect cells. Now uses vindex fancy indexing so only
the relevant chunks are materialized.

cumulative_viewshed was allocating a full-size np.zeros count array and
calling .values on each viewshed result, forcing materialization every
iteration. Now accumulates lazily with da.zeros and dask array addition
when the input is dask-backed.
The dask Tier B memory guard underestimated peak usage at 280 bytes/pixel.
Actual peak during lexsort reaches ~360 bytes/pixel (sorted + unsorted
event_list coexist) plus 8 bytes/pixel for the computed raster. Updated
estimate to 368 bytes/pixel to prevent borderline OOM.

Also use astype(copy=False) to skip the float64 copy when data is already
float64.
Cross-validates xrspatial.sieve() against rasterio.features.sieve()
(GDAL wrapper) on identical inputs. 30 tests covering thresholds,
connectivity modes, nodata, dtypes, and edge cases.

Documents one behavioral difference: GDAL runs a single pass while
xrspatial iterates. When all regions are below threshold, GDAL leaves
the raster unchanged; xrspatial cascades merges to convergence.
GitHub already shows who opened an issue. Removed the field from the
feature-proposal template and updated the rockout command to note it
should be skipped.
GDAL's sieve only merges a small region into a neighbor whose size
is already >= threshold. If no such neighbor exists, the region stays.
Previously xrspatial iterated up to 50 passes and merged into any
neighbor regardless of its size, causing cascading merges that GDAL
does not perform.

Changes:
- _sieve_numpy: remove iteration loop, filter merge targets to
  neighbors >= threshold
- Remove _MAX_ITERATIONS and convergence warning (single pass
  always completes)
- Update docstrings to document GDAL-matching semantics
- Update tests: checkerboard/stripe/all-small-regions cases now
  correctly leave the raster unchanged
- All 30 GDAL parity tests now assert exact equality
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Apr 6, 2026
@brendancol brendancol merged commit 641d74a into master Apr 6, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QA/QC: validate sieve output against GDAL/rasterio reference

1 participant