Problem
ctd_measurement (released parquet, v2026.04.08) has duplicate (ctd_cast_uuid, depth_m, measurement_type) keys — its declared natural key. assign_deterministic_uuids_md5() only warns on non-unique keys, so ctd_measurement_uuid is itself non-unique and the duplication shipped silently.
Evidence
Sampled the first 4 cruise partitions of data/parquet/calcofi_ctd-cast/ctd_measurement/:
| cruise_key |
rows |
duplicate-key rows |
| 1998-02-31JD |
1,250,773 |
0 |
| 1998-04-31JD |
3,040,202 |
1,520,101 (every row doubled) |
| 1998-07-32NM |
1,582,430 |
0 |
| 2003-02-31JD |
6,006,342 |
3,002,285 (every row doubled) |
Across those 4 cruises: 4,522,386 duplicate-key groups. 96.6% have identical measurement_value (true exact duplicates); 3.4% have conflicting values (max spread 43.76 in measurement units) — likely the same cast present in both a final and a preliminary source file, both unzipped and loaded.
Likely cause
In ingest_calcofi_ctd-cast.qmd, the read/bind/filter step or the final-vs-preliminary selection is not deduplicating: some cruises' casts are loaded twice. The "selects final if available, otherwise preliminary" logic may not be excluding the preliminary file when a final exists, or a file is bound twice.
Impact
ctd_measurement_uuid is non-unique → breaks any downstream join/PK assumption.
- Doubled rows inflate the table (~38% of sampled rows are duplicates).
- Conflicting-value dups mean some
(cast, depth, type) positions have two different recorded values with no recorded provenance to disambiguate.
Workaround in place
The new ctd_thin table (#51) de-duplicates defensively — QUALIFY ROW_NUMBER() OVER (PARTITION BY ctd_cast_uuid, depth_m, measurement_type ORDER BY measurement_qual NULLS LAST, measurement_value) = 1 — so it has a unique key regardless. But ctd_measurement itself should be fixed at the source.
Suggested fix
Problem
ctd_measurement(released parquet, v2026.04.08) has duplicate(ctd_cast_uuid, depth_m, measurement_type)keys — its declared natural key.assign_deterministic_uuids_md5()only warns on non-unique keys, soctd_measurement_uuidis itself non-unique and the duplication shipped silently.Evidence
Sampled the first 4 cruise partitions of
data/parquet/calcofi_ctd-cast/ctd_measurement/:Across those 4 cruises: 4,522,386 duplicate-key groups. 96.6% have identical
measurement_value(true exact duplicates); 3.4% have conflicting values (max spread 43.76 in measurement units) — likely the same cast present in both afinaland apreliminarysource file, both unzipped and loaded.Likely cause
In
ingest_calcofi_ctd-cast.qmd, the read/bind/filter step or the final-vs-preliminary selection is not deduplicating: some cruises' casts are loaded twice. The "selects final if available, otherwise preliminary" logic may not be excluding the preliminary file when a final exists, or a file is bound twice.Impact
ctd_measurement_uuidis non-unique → breaks any downstream join/PK assumption.(cast, depth, type)positions have two different recorded values with no recorded provenance to disambiguate.Workaround in place
The new
ctd_thintable (#51) de-duplicates defensively —QUALIFY ROW_NUMBER() OVER (PARTITION BY ctd_cast_uuid, depth_m, measurement_type ORDER BY measurement_qual NULLS LAST, measurement_value) = 1— so it has a unique key regardless. Butctd_measurementitself should be fixed at the source.Suggested fix
ctd_measurement's key after the pivot, failing the ingest if violatedfinal) and record it