Skip to content

feat: defer track import until after dedup#5

Merged
jakebromberg merged 3 commits intomainfrom
feat/defer-track-import
Feb 15, 2026
Merged

feat: defer track import until after dedup#5
jakebromberg merged 3 commits intomainfrom
feat/defer-track-import

Conversation

@jakebromberg
Copy link
Member

Summary

  • Defer track table import (release_track, release_track_artist) until after dedup, avoiding importing/deduplicating/indexing ~88% of track rows that would be discarded
  • Pre-compute track counts from CSV into a release_track_count table so dedup ranking works without track data in the database
  • New 8-step pipeline (v2): create_schema -> import_csv (base) -> create_indexes (base) -> dedup (base) -> import_tracks -> create_track_indexes -> prune -> vacuum
  • Pipeline state v2 with automatic v1 migration on --resume
  • Dedup falls back to live release_track count if release_track_count table doesn't exist (backward compat for standalone usage)

Test plan

  • 209 unit tests pass (including new tests for count_tracks_from_csv, table split, pipeline state v2, v1 migration)
  • Integration tests: track count table, filtered track import, dedup with pre-computed counts, dedup fallback, db_introspect split trigram checks, new step inference, schema index split (need Postgres)
  • E2E tests: full pipeline, resume with 8 steps, state file version 2 (need Postgres)

Track tables (release_track, release_track_artist) are now imported
after dedup instead of before, avoiding importing/deduplicating/indexing
millions of track rows that would be discarded. Pre-computed track counts
from CSV drive the dedup ranking instead of a live JOIN on release_track.

New pipeline step order (8 steps, pipeline state v2):
  create_schema -> import_csv (base only) -> create_indexes (base only)
  -> dedup (base only) -> import_tracks -> create_track_indexes
  -> prune -> vacuum

Key changes:
- Split TABLES into BASE_TABLES + TRACK_TABLES with --base-only/--tracks-only flags
- Pre-compute release_track_count table from CSV for dedup ranking
- Filter track import to surviving release IDs after dedup
- Split add_constraints_and_indexes() into base and track versions
- Split create_indexes.sql; new create_track_indexes.sql for track indexes
- Split trigram_indexes_exist() into base and track variants
- Pipeline state v2 with v1 migration support
- Dedup falls back to release_track if release_track_count doesn't exist
@jakebromberg jakebromberg force-pushed the feat/defer-track-import branch from d344c2f to 60a070d Compare February 14, 2026 23:21
Jake Bromberg added 2 commits February 14, 2026 15:45
- Add comment on f-string SQL noting values are trusted internal constants
- Rename misleading test_track_tables_empty_before_track_import to
  test_deduped_release_has_no_tracks
- Replace hardcoded step numbers with step name labels in run_pipeline.py
  comments to avoid renumbering when steps are added or reordered
Add label data (release_id, label_name) as a base table throughout the
pipeline: schema, CSV filter, import, dedup copy-swap, verify/copy-to,
vacuum, and reporting. Labels follow the same pattern as release_artist
-- FK CASCADE child table, imported before dedup, copy-swapped during
dedup, streamed during copy-to.

Also fix a pre-existing bug where copy-to targets were missing track
trigram indexes (_create_target_indexes now runs create_track_indexes.sql
in addition to create_indexes.sql).
@jakebromberg jakebromberg merged commit 8506e80 into main Feb 15, 2026
3 checks passed
@jakebromberg jakebromberg deleted the feat/defer-track-import branch February 15, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant