Toolkit to process and enrich your personal Spotify listening history, with memory-safe merges and resumable Spotify API enrichment.
This project processes your extended Spotify streaming history data to:
- Combine multiple JSON history files into a single dataset
- Clean and enrich the data with temporal features
- Enrich metadata using multiple sources (Spotify API + external datasets)
- Analyze listening patterns, genre evolution, and music trends
- Generate visualizations of your musical journey
- Data Processing: Combine and clean multiple Spotify history JSON files
- Multi-Source Metadata Enrichment:
- Spotify Web API (complete but rate-limited)
- External Kaggle datasets (fast, high coverage)
- Automatic matching and merging
- Comprehensive Analysis: Audio features, genres, popularity, and trends
- Organized Structure: Clean, scalable codebase with proper separation
- Resume Capability: All processes can be safely interrupted and resumed
spotify-data/
βββ README.md
βββ PROJECT_STATUS.md
βββ requirements.txt
βββ Makefile
βββ docs/ # Original Spotify data from export
βββ data/
β βββ processed/
β β βββ cleaned_streaming_history.csv
β β βββ combined_streaming_history.csv
β βββ enriched/ # (symlinked under scripts/spotify_api/data)
β βββ ultimate_spotify_enriched_streaming_history.csv
β βββ spotify_api_metadata.csv # append-only API results
β βββ progress.sqlite # crash-safe progress
β βββ spotify_api_enriched_streaming_history.csv # merged final (base β meta)
β βββ spotify_api_enriched_streaming_history_songs.csv
β βββ spotify_api_enriched_streaming_history_podcasts.csv
βββ scripts/
β βββ data_processing/
β β βββ clean-history.py
β β βββ combine-history.py
β βββ external_matching/
β β βββ ultimate_spotify_matcher.py
β βββ enrichment/
β β βββ merge_enrichments.py # legacy (superseded by DuckDB)
β βββ analysis/
β βββ app/
β βββ orchestrate.py
β βββ spotify_api/
β βββ __init__.py
β βββ duckdb_merge.py # memory-safe on-disk LEFT JOIN (DuckDB)
β βββ smart_metadata_enrichment.py# API enrichment + merge-only mode
β βββ split_media_types.py # split final CSV into songs/podcasts
β βββ data -> ../../data # symlink so BASE_DIR-relative paths work
βββ external_datasets/
- Final merged CSV generated by DuckDB without large in-memory joins.
- Songs vs. podcasts split available for downstream analysis.
- Spotify API enrichment runs in batches, appends to CSV, and records progress in SQLite for safe resume.
Coverage snapshot (Aug 15, 2025):
- Base unique tracks: ~27,400
- Kaggle-covered unique: ~6,069
- API-covered unique (deduped): ~20,777 (prev ~10,866; +~9,911 today)
- Final merged rows: 138,762 (includes header)
- Split outputs: songs 127,099 rows; podcasts 11,664 rows (each includes header)
- Estimated remaining unique for API next pass: ~3,000 (start-of-run to_process was 12,896; ~9.9k covered this run)
- Audio Features: Danceability, energy, valence, tempo, acousticness, instrumentalness, liveness, loudness, speechiness
- Track Info: Release dates, popularity scores, explicit content flags
- Genre Data: Detailed genre classifications with play counts
- Artist Info: Complete artist metadata and classifications
- Request your extended streaming history from Spotify Privacy Settings
- Wait for the email with your data (can take up to 30 days)
- Extract the JSON files to a
docs/folder in this project
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install required packages
pip install -r requirements.txt
While external datasets provide good coverage, you may want API access for additional data:
- Go to Spotify Developer Dashboard
- Create a new app to get your Client ID and Client Secret
- Copy
.env.exampleto.env:cp .env.example .env - Fill in your credentials:
SPOTIFY_CLIENT_ID=your_client_id_here SPOTIFY_CLIENT_SECRET=your_client_secret_here
For the fastest results with good metadata coverage:
-
Process Your Raw Data
python scripts/data_processing/combine-history.py python scripts/data_processing/clean-history.py -
Enrich with External Datasets (Fast, no API limits)
python scripts/external_matching/ultimate_spotify_matcher.pyThis provides ~30% coverage with complete audio features and genres.
For Maximum Coverage: Combine external datasets with Spotify API enrichment
- Set up your
.envwith Spotify API credentials - Run external matching first (above)
- Run API enrichment (background, resumable):
. .venv/bin/activate [ -f scripts/__init__.py ] || touch scripts/__init__.py [ -f scripts/spotify_api/__init__.py ] || touch scripts/spotify_api/__init__.py [ -e scripts/spotify_api/data ] || ln -s ../../data scripts/spotify_api/data LOG=/tmp/enrich_$(date +%Y%m%d-%H%M%S).log TQDM_DISABLE=1 nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"
Monitor and resume safely:
tail -n 80 "$LOG"
pgrep -af scripts.spotify_api.smart_metadata_enrichment
sqlite3 scripts/spotify_api/data/enriched/progress.sqlite "SELECT COUNT(*) FROM progress WHERE meta_json IS NOT NULL;"
Merge-only (low-memory, on-disk):
LOG=/tmp/merge_$(date +%Y%m%d-%H%M%S).log
nohup python -m scripts.spotify_api.smart_metadata_enrichment --merge-only > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"
python scripts/data_processing/combine-history.py
- Combines all JSON files from
docs/folder - Creates
data/processed/combined_streaming_history.csv
python scripts/data_processing/clean-history.py
- Creates
data/processed/cleaned_streaming_history.csvwith:- Converted timestamps and time-based features
- Filtered short plays (<30 seconds)
- Play duration in minutes
- Weekday, hour, and temporal analysis features
Option A: External Datasets (Recommended First)
python scripts/external_matching/ultimate_spotify_matcher.py
- Downloads Ultimate Spotify DB from Kaggle
- Matches tracks using fuzzy string matching
- Provides audio features, genres, popularity data
- No rate limits, fast processing
Option B: Spotify Web API (memory-safe, resumable)
nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 &
- Skips tracks covered by external datasets and already-processed API keys
- Streams results to
spotify_api_metadata.csvand records progress in SQLite - Performs final DuckDB LEFT JOIN into
spotify_api_enriched_streaming_history.csv
Both create enriched_streaming_history.csv with additional metadata:
- Spotify track IDs
After processing, you'll have these datasets:
data/processed/cleaned_streaming_history.csv- Cleaned listening history with temporal featuresdata/enriched/ultimate_spotify_enriched_*.csv- External dataset enrichment resultsdata/enriched/spotify_api_enriched_*.csv- API enrichment results (when available)
Temporal Patterns:
- Listening habits by time of day, day of week, season
- Evolution of music taste over 15 years
- Monthly and yearly listening volume trends
Audio Feature Analysis:
- Preference for danceability, energy, valence over time
- Tempo preferences and changes
- Acousticness vs. electronic music trends
Genre Evolution:
- Top genres by play count and listening time
- Genre diversity and discovery patterns
- Seasonal genre preferences
Discovery Patterns:
- Track popularity vs. personal preference
- Artist loyalty and discovery rates
- Repeat listening behavior
scripts/spotify_api/data/enriched/
βββ ultimate_spotify_enriched_streaming_history.csv
βββ spotify_api_metadata.csv
βββ progress.sqlite
βββ spotify_api_enriched_streaming_history.csv
βββ spotify_api_enriched_streaming_history_songs.csv
βββ spotify_api_enriched_streaming_history_podcasts.csv
With enriched data, you can analyze:
- Total hours and play counts by year, month, season
- Peak listening hours and days of week
- Seasonal and temporal trends across 15 years
- Musical taste evolution over time
- Trends in danceability, energy, valence, tempo
- Acoustic vs. electronic preferences
- Mood patterns (valence/energy correlations)
- Top genres by play count and listening time
- Genre discovery and evolution patterns
- Seasonal genre preferences
- Musical diversity metrics
- Track popularity vs. personal preference
- Artist loyalty and discovery rates
- Repeat listening behavior analysis
- Mainstream vs. niche music preferences
The enrichment is designed to run daily until quota is hit (β10k calls/day), then resume the next day:
- Default pacing: 50 items/batch, 0.3s/request, 5s between batches
- Auto-resume and skip logic ensure no redundant calls
- Safe to stop and restart anytime
- No Rate Limits: Process entire dataset quickly
- High Coverage: 30%+ match rates with quality data
- Complete Features: All audio features and genres included
- Reliable: No authentication or quota concerns
- Fuzzy string matching for robust track identification
- Duplicate detection and handling
- Comprehensive metadata validation
- Multiple source integration for maximum coverage
- Your
.envfile with API credentials is excluded from git - All CSV files with personal listening data are gitignored
- External datasets contain only public metadata
- Never commit or share personal listening data
This project welcomes contributions! Areas for improvement:
- Additional external dataset integrations
- Advanced analysis and visualization scripts
- Performance optimizations
- New matching algorithms
- API Rate Limits: Use external datasets first, then supplement with API
- Low Match Rates: Check for special characters in track/artist names
- Missing Data: Some tracks may not exist in any dataset
- Memory Issues: Process in smaller batches if needed
- Check the progress files for API enrichment status
- Review console output for match statistics
- Verify your raw data format matches expected structure
This project is for personal use. Respect Spotify's API terms of service and your data privacy.