Skip to content

ctd_measurement has duplicate (ctd_cast_uuid, depth_m, measurement_type) keys #52

@bbest

Description

@bbest

Problem

ctd_measurement (released parquet, v2026.04.08) has duplicate (ctd_cast_uuid, depth_m, measurement_type) keys — its declared natural key. assign_deterministic_uuids_md5() only warns on non-unique keys, so ctd_measurement_uuid is itself non-unique and the duplication shipped silently.

Evidence

Sampled the first 4 cruise partitions of data/parquet/calcofi_ctd-cast/ctd_measurement/:

cruise_key rows duplicate-key rows
1998-02-31JD 1,250,773 0
1998-04-31JD 3,040,202 1,520,101 (every row doubled)
1998-07-32NM 1,582,430 0
2003-02-31JD 6,006,342 3,002,285 (every row doubled)

Across those 4 cruises: 4,522,386 duplicate-key groups. 96.6% have identical measurement_value (true exact duplicates); 3.4% have conflicting values (max spread 43.76 in measurement units) — likely the same cast present in both a final and a preliminary source file, both unzipped and loaded.

Likely cause

In ingest_calcofi_ctd-cast.qmd, the read/bind/filter step or the final-vs-preliminary selection is not deduplicating: some cruises' casts are loaded twice. The "selects final if available, otherwise preliminary" logic may not be excluding the preliminary file when a final exists, or a file is bound twice.

Impact

  • ctd_measurement_uuid is non-unique → breaks any downstream join/PK assumption.
  • Doubled rows inflate the table (~38% of sampled rows are duplicates).
  • Conflicting-value dups mean some (cast, depth, type) positions have two different recorded values with no recorded provenance to disambiguate.

Workaround in place

The new ctd_thin table (#51) de-duplicates defensively — QUALIFY ROW_NUMBER() OVER (PARTITION BY ctd_cast_uuid, depth_m, measurement_type ORDER BY measurement_qual NULLS LAST, measurement_value) = 1 — so it has a unique key regardless. But ctd_measurement itself should be fixed at the source.

Suggested fix

  • Identify why those cruises double-load (final/preliminary dedup, or a double bind)
  • Add a uniqueness assertion (not just a warning) on ctd_measurement's key after the pivot, failing the ingest if violated
  • For genuinely conflicting final/preliminary values, decide a precedence rule (prefer final) and record it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions