Problem
ctd_cast.ctd_cast_uuid is declared the primary key but is not unique — 6,063,332 rows / 5,550,014 distinct uuid → 513,318 duplicate-uuid rows (319,767 dup groups). Pre-existing (the v2026.04.08 release has it too — that ctd_cast had 594,011 dup rows). Distinct from #52 (different mechanism).
Found while verifying the #52 fix: the ctd_raw de-dup (#52) reduced ctd_cast dups 594K→513K but a second mechanism remains.
Two causes
Column-variance breakdown across the 319,767 dup-uuid groups:
| column |
groups where it varies |
ox_aveu_m_sta_corr |
277,893 |
geom / lat_dec / lon_dec |
140,990 / 81,807 / 95,646 |
everything else (datetime_utc, site_key, cast_key, data_stage, …) |
0 |
Cause 1 — OxAveuM_StaCorr is a mis-modeled measurement (dominant, ~87%). The source CSV column OxAveuM_StaCorr (oxygen µM average, station-corrected) is a per-depth measurement value, but it is not registered in metadata/measurement_type.csv (the umol/kg oxygen types registered are only ox1u_m* and ox2u_m*, not the ave). So meas_cols doesn't include it, cast_cols <- setdiff(raw_cols, c(meas_cols, ...)) keeps it, and it lands in ctd_cast as if it were cast-level metadata. Because it varies per depth, SELECT DISTINCT cast_cols emits multiple ctd_cast rows per (cruise_key, cast_key, cast_dir, datetime_utc).
Cause 2 — GPS jitter. lat_dec/lon_dec differ in the ~5th–6th decimal (sub-meter) between scans that share a 1-second datetime_utc.
Effect
ctd_cast_uuid is not a valid unique PK.
ctd_summary (built via ctd_measurement m JOIN ctd_cast c ON ctd_cast_uuid) fans out — confirmed +137,482 extra rows for cruise 2003-02-31JD alone (~4.6% inflation), so ctd_summary.n_obs is over-counted for affected positions.
- Joining
ctd_thin → ctd_cast for cast metadata (e.g. in the ctd-viz app) would duplicate cast points.
Suggested fix
- Register
OxAveuM_StaCorr in metadata/measurement_type.csv (e.g. oxygen_umol_kg_ave_sta_corr, mirroring the existing oxygen_ml_l_ave_sta_corr) so it pivots into ctd_measurement and leaves cast_cols. Consider making it the canonical umol/kg oxygen type (the ave is generally preferred over sensor-1).
- De-duplicate
ctd_cast to one row per (cruise_key, cast_key, cast_dir, datetime_utc) for the residual GPS-jitter case (pick one position deterministically).
- Add a guard: assert every raw measurement-like column is registered in
measurement_type.csv (so an unregistered column can't silently leak into ctd_cast again).
Once fixed, ctd_cast_uuid is a valid PK and ctd_summary is no longer inflated.
Problem
ctd_cast.ctd_cast_uuidis declared the primary key but is not unique — 6,063,332 rows / 5,550,014 distinct uuid → 513,318 duplicate-uuid rows (319,767 dup groups). Pre-existing (the v2026.04.08 release has it too — thatctd_casthad 594,011 dup rows). Distinct from #52 (different mechanism).Found while verifying the #52 fix: the
ctd_rawde-dup (#52) reducedctd_castdups 594K→513K but a second mechanism remains.Two causes
Column-variance breakdown across the 319,767 dup-uuid groups:
ox_aveu_m_sta_corrgeom/lat_dec/lon_decdatetime_utc,site_key,cast_key,data_stage, …)Cause 1 —
OxAveuM_StaCorris a mis-modeled measurement (dominant, ~87%). The source CSV columnOxAveuM_StaCorr(oxygen µM average, station-corrected) is a per-depth measurement value, but it is not registered inmetadata/measurement_type.csv(the umol/kg oxygen types registered are onlyox1u_m*andox2u_m*, not theave). Someas_colsdoesn't include it,cast_cols <- setdiff(raw_cols, c(meas_cols, ...))keeps it, and it lands inctd_castas if it were cast-level metadata. Because it varies per depth,SELECT DISTINCT cast_colsemits multiplectd_castrows per(cruise_key, cast_key, cast_dir, datetime_utc).Cause 2 — GPS jitter.
lat_dec/lon_decdiffer in the ~5th–6th decimal (sub-meter) between scans that share a 1-seconddatetime_utc.Effect
ctd_cast_uuidis not a valid unique PK.ctd_summary(built viactd_measurement m JOIN ctd_cast c ON ctd_cast_uuid) fans out — confirmed +137,482 extra rows for cruise2003-02-31JDalone (~4.6% inflation), soctd_summary.n_obsis over-counted for affected positions.ctd_thin→ctd_castfor cast metadata (e.g. in the ctd-viz app) would duplicate cast points.Suggested fix
OxAveuM_StaCorrinmetadata/measurement_type.csv(e.g.oxygen_umol_kg_ave_sta_corr, mirroring the existingoxygen_ml_l_ave_sta_corr) so it pivots intoctd_measurementand leavescast_cols. Consider making it the canonical umol/kg oxygen type (theaveis generally preferred over sensor-1).ctd_castto one row per(cruise_key, cast_key, cast_dir, datetime_utc)for the residual GPS-jitter case (pick one position deterministically).measurement_type.csv(so an unregistered column can't silently leak intoctd_castagain).Once fixed,
ctd_cast_uuidis a valid PK andctd_summaryis no longer inflated.