feat: defer track import until after dedup#5
Merged
jakebromberg merged 3 commits intomainfrom Feb 15, 2026
Merged
Conversation
Track tables (release_track, release_track_artist) are now imported after dedup instead of before, avoiding importing/deduplicating/indexing millions of track rows that would be discarded. Pre-computed track counts from CSV drive the dedup ranking instead of a live JOIN on release_track. New pipeline step order (8 steps, pipeline state v2): create_schema -> import_csv (base only) -> create_indexes (base only) -> dedup (base only) -> import_tracks -> create_track_indexes -> prune -> vacuum Key changes: - Split TABLES into BASE_TABLES + TRACK_TABLES with --base-only/--tracks-only flags - Pre-compute release_track_count table from CSV for dedup ranking - Filter track import to surviving release IDs after dedup - Split add_constraints_and_indexes() into base and track versions - Split create_indexes.sql; new create_track_indexes.sql for track indexes - Split trigram_indexes_exist() into base and track variants - Pipeline state v2 with v1 migration support - Dedup falls back to release_track if release_track_count doesn't exist
d344c2f to
60a070d
Compare
added 2 commits
February 14, 2026 15:45
- Add comment on f-string SQL noting values are trusted internal constants - Rename misleading test_track_tables_empty_before_track_import to test_deduped_release_has_no_tracks - Replace hardcoded step numbers with step name labels in run_pipeline.py comments to avoid renumbering when steps are added or reordered
Add label data (release_id, label_name) as a base table throughout the pipeline: schema, CSV filter, import, dedup copy-swap, verify/copy-to, vacuum, and reporting. Labels follow the same pattern as release_artist -- FK CASCADE child table, imported before dedup, copy-swapped during dedup, streamed during copy-to. Also fix a pre-existing bug where copy-to targets were missing track trigram indexes (_create_target_indexes now runs create_track_indexes.sql in addition to create_indexes.sql).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
release_track_counttable so dedup ranking works without track data in the databasecreate_schema -> import_csv (base) -> create_indexes (base) -> dedup (base) -> import_tracks -> create_track_indexes -> prune -> vacuum--resumerelease_trackcount ifrelease_track_counttable doesn't exist (backward compat for standalone usage)Test plan