discogs-cache

ETL pipeline for building a PostgreSQL cache of Discogs release data, filtered to artists in the WXYC radio library catalog. Reduces Discogs API calls by providing a local cache for album lookups, track validation, and artwork URLs.

Overview

The pipeline processes monthly Discogs data dumps (~40 GB XML) into a focused PostgreSQL database (~3 GB) containing only releases by artists in the WXYC library catalog. This provides:

Fast local lookups instead of rate-limited Discogs API calls
Accent-insensitive trigram fuzzy text search via pg_trgm + unaccent
Shared data resource for multiple consuming services

Prerequisites

Python 3.11+
PostgreSQL with the pg_trgm and unaccent extensions (or use Docker Compose)
Discogs monthly data dump (XML) from https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html
discogs-xml2db -- clone separately; not a PyPI package
library_artists.txt and library.db (produced by request-o-matic's library sync)

Setup

pip install -e ".[dev]"

Pipeline

All 9 steps are automated by run_pipeline.py (or Docker Compose). The script supports two modes: full pipeline from XML, or database build from pre-filtered CSVs.

Step	Script	Description
1. Convert	discogs-xml2db	XML data dump to CSV
2. Fix newlines	`scripts/fix_csv_newlines.py`	Clean embedded newlines in CSV fields
2.5. Enrich	`scripts/enrich_library_artists.py`	Enrich artist list with cross-references (optional)
3. Filter	`scripts/filter_csv.py`	Keep only library artists, with diacritics normalization (~70% reduction)
4. Create schema	`schema/create_database.sql`, `schema/create_functions.sql`	Set up tables, extensions, and functions
5. Import	`scripts/import_csv.py`	Bulk load CSVs via psycopg COPY
6. Create indexes	`schema/create_indexes.sql`	Accent-insensitive trigram GIN indexes for fuzzy search
7. Deduplicate	`scripts/dedup_releases.py`	Keep best release per master_id (most tracks)
8. Prune/Copy	`scripts/verify_cache.py`	Remove non-library releases or copy matches to target DB
9. Vacuum	`VACUUM FULL`	Reclaim disk space

Step 2.5 generates library_artists.txt from library.db and optionally enriches it with alternate artist names and cross-references from the WXYC MySQL catalog database. This reduces false negatives at the filtering stage for artists known by multiple names (e.g., "Body Count" filed under Ice-T).

Docker Compose

The easiest way to run the full pipeline:

# Clone discogs-xml2db (one-time setup)
git clone https://github.com/philipmat/discogs-xml2db.git

# Place input files in data/
mkdir -p data
cp /path/to/releases.xml.gz data/
cp /path/to/library_artists.txt data/
cp /path/to/library.db data/

docker compose up --build

Orchestration Script

Full pipeline from XML (steps 1-9):

python scripts/run_pipeline.py \
  --xml /path/to/releases.xml.gz \
  --xml2db /path/to/discogs-xml2db/ \
  --library-artists /path/to/library_artists.txt \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs

To enrich library_artists.txt with alternate names and cross-references from the WXYC catalog database, add --wxyc-db-url:

python scripts/run_pipeline.py \
  --xml /path/to/releases.xml.gz \
  --xml2db /path/to/discogs-xml2db/ \
  --library-artists /path/to/library_artists.txt \
  --library-db /path/to/library.db \
  --wxyc-db-url mysql://user:pass@host:port/wxycmusic \
  --database-url postgresql://localhost:5432/discogs

Database build from pre-filtered CSVs (steps 4-9):

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs

--library-db is optional; if omitted, the prune step is skipped
--database-url defaults to DATABASE_URL env var or postgresql://localhost:5432/discogs

Copy to Target Database

Instead of pruning releases in place (which destroys the full imported dataset), you can copy only matched releases to a separate target database:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs \
  --target-db-url postgresql://localhost:5432/discogs_cache

This preserves the full discogs database and creates a lean discogs_cache database with only KEEP and REVIEW releases, complete with schema, FK constraints, and trigram indexes. The target database is created automatically if it doesn't exist.

You can also use --copy-to directly with verify_cache.py:

python scripts/verify_cache.py \
  --copy-to postgresql://localhost:5432/discogs_cache \
  /path/to/library.db \
  postgresql://localhost:5432/discogs

--copy-to and --prune are mutually exclusive.

Resuming a Failed Pipeline

If a pipeline run fails mid-way (e.g., disk full during index creation), you can resume from where it left off instead of restarting from scratch:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --library-db /path/to/library.db \
  --database-url postgresql://localhost:5432/discogs \
  --resume

The pipeline tracks step completion in a JSON state file (default: .pipeline_state.json). On resume, completed steps are skipped. You can specify a custom state file path with --state-file:

python scripts/run_pipeline.py \
  --csv-dir /path/to/filtered/ \
  --resume \
  --state-file /tmp/my_pipeline_state.json

If no state file exists when --resume is used, the pipeline infers completed steps from database state (e.g., schema exists, tables have rows, indexes present, master_id column dropped by dedup).

--resume is only valid with --csv-dir mode, not --xml mode.

Running Steps Manually

Individual steps can also be run directly:

# 1. Convert XML to CSV (run from discogs-xml2db directory)
cd /path/to/discogs-xml2db
python run.py --export release --output /path/to/raw/ /path/to/releases.xml.gz

# 2. Fix CSV newlines
python scripts/fix_csv_newlines.py /path/to/raw/release.csv /path/to/cleaned/release.csv

# 3. Filter to library artists
python scripts/filter_csv.py /path/to/library_artists.txt /path/to/cleaned/ /path/to/filtered/

# 4. Create schema and functions
psql -d discogs -f schema/create_database.sql
psql -d discogs -f schema/create_functions.sql

# 5. Import CSVs
python scripts/import_csv.py /path/to/filtered/ [database_url]

# 6. Create indexes (10-30 min on large datasets)
psql -d discogs -f schema/create_indexes.sql

# 7. Deduplicate
python scripts/dedup_releases.py [database_url]

# 8. Prune (dry run first, then with --prune or --copy-to)
python scripts/verify_cache.py /path/to/library.db [database_url]
python scripts/verify_cache.py --prune /path/to/library.db [database_url]
# Or copy to a target database instead:
python scripts/verify_cache.py --copy-to postgresql:///discogs_cache /path/to/library.db [database_url]

# 9. Vacuum
psql -d discogs -c "VACUUM FULL;"

Database Schema

The schema files in schema/ define the shared contract between this ETL pipeline and all consumers.

Tables

Table	Description
`release`	Release metadata: id, title, release_year, artwork_url
`release_artist`	Artists on releases (main + extra credits)
`release_track`	Tracks on releases with position and duration
`release_track_artist`	Artists on specific tracks (for compilations)
`cache_metadata`	Data freshness tracking (cached_at, source)

Indexes

Foreign key indexes on all child tables
Accent-insensitive trigram GIN indexes (pg_trgm + unaccent) on title and artist_name columns for fuzzy text search. Uses an immutable f_unaccent() wrapper to enable index expressions with lower(f_unaccent(column)).
Cache metadata indexes for freshness queries

Consumer Integration

Consumers connect via the DATABASE_URL_DISCOGS environment variable:

DATABASE_URL_DISCOGS=postgresql://user:pass@host:5432/discogs

Current consumers:

request-o-matic (discogs/cache_service.py) - Python/asyncpg
Backend-Service - TypeScript/Node.js (planned)

Testing

Tests are organized into three layers:

# Unit tests (no external dependencies, run by default)
pytest tests/unit/ -v

# Integration tests (needs PostgreSQL)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
  pytest -m postgres -v

# MySQL integration tests (needs WXYC MySQL on port 3307)
pytest -m mysql -v

# E2E tests (needs PostgreSQL, runs full pipeline as subprocess)
DATABASE_URL_TEST=postgresql://discogs:discogs@localhost:5433/postgres \
  pytest -m e2e -v

Integration and E2E tests are excluded by default (pytest with no args runs only unit tests). Start the test database with:

docker compose up db -d

Migrations

The migrations/ directory contains historical one-time migrations:

01_optimize_schema.sql - Initial schema optimization (drops unused tables/columns, adds artwork_url and release_year, deduplicates by master_id). Already applied to the production database.

Documentation

Cache Technical Overview -- design rationale, benchmarks, and pipeline architecture details

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
docs		docs
hooks		hooks
lib		lib
migrations		migrations
schema		schema
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discogs-cache

Overview

Prerequisites

Setup

Pipeline

Docker Compose

Orchestration Script

Copy to Target Database

Resuming a Failed Pipeline

Running Steps Manually

Database Schema

Tables

Indexes

Consumer Integration

Testing

Migrations

Documentation

About

Uh oh!

Releases

Packages

Languages

WXYC/discogs-cache

Folders and files

Latest commit

History

Repository files navigation

discogs-cache

Overview

Prerequisites

Setup

Pipeline

Docker Compose

Orchestration Script

Copy to Target Database

Resuming a Failed Pipeline

Running Steps Manually

Database Schema

Tables

Indexes

Consumer Integration

Testing

Migrations

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages