Skip to content

ctd_cast_uuid is not a unique PK: OxAveuM_StaCorr mis-modeled + GPS jitter #53

@bbest

Description

@bbest

Problem

ctd_cast.ctd_cast_uuid is declared the primary key but is not unique — 6,063,332 rows / 5,550,014 distinct uuid → 513,318 duplicate-uuid rows (319,767 dup groups). Pre-existing (the v2026.04.08 release has it too — that ctd_cast had 594,011 dup rows). Distinct from #52 (different mechanism).

Found while verifying the #52 fix: the ctd_raw de-dup (#52) reduced ctd_cast dups 594K→513K but a second mechanism remains.

Two causes

Column-variance breakdown across the 319,767 dup-uuid groups:

column groups where it varies
ox_aveu_m_sta_corr 277,893
geom / lat_dec / lon_dec 140,990 / 81,807 / 95,646
everything else (datetime_utc, site_key, cast_key, data_stage, …) 0

Cause 1 — OxAveuM_StaCorr is a mis-modeled measurement (dominant, ~87%). The source CSV column OxAveuM_StaCorr (oxygen µM average, station-corrected) is a per-depth measurement value, but it is not registered in metadata/measurement_type.csv (the umol/kg oxygen types registered are only ox1u_m* and ox2u_m*, not the ave). So meas_cols doesn't include it, cast_cols <- setdiff(raw_cols, c(meas_cols, ...)) keeps it, and it lands in ctd_cast as if it were cast-level metadata. Because it varies per depth, SELECT DISTINCT cast_cols emits multiple ctd_cast rows per (cruise_key, cast_key, cast_dir, datetime_utc).

Cause 2 — GPS jitter. lat_dec/lon_dec differ in the ~5th–6th decimal (sub-meter) between scans that share a 1-second datetime_utc.

Effect

  • ctd_cast_uuid is not a valid unique PK.
  • ctd_summary (built via ctd_measurement m JOIN ctd_cast c ON ctd_cast_uuid) fans out — confirmed +137,482 extra rows for cruise 2003-02-31JD alone (~4.6% inflation), so ctd_summary.n_obs is over-counted for affected positions.
  • Joining ctd_thinctd_cast for cast metadata (e.g. in the ctd-viz app) would duplicate cast points.

Suggested fix

  1. Register OxAveuM_StaCorr in metadata/measurement_type.csv (e.g. oxygen_umol_kg_ave_sta_corr, mirroring the existing oxygen_ml_l_ave_sta_corr) so it pivots into ctd_measurement and leaves cast_cols. Consider making it the canonical umol/kg oxygen type (the ave is generally preferred over sensor-1).
  2. De-duplicate ctd_cast to one row per (cruise_key, cast_key, cast_dir, datetime_utc) for the residual GPS-jitter case (pick one position deterministically).
  3. Add a guard: assert every raw measurement-like column is registered in measurement_type.csv (so an unregistered column can't silently leak into ctd_cast again).

Once fixed, ctd_cast_uuid is a valid PK and ctd_summary is no longer inflated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions