Spotify Data Analysis

Toolkit to process and enrich your personal Spotify listening history, with memory-safe merges and resumable Spotify API enrichment.

Overview

This project processes your extended Spotify streaming history data to:

Combine multiple JSON history files into a single dataset
Clean and enrich the data with temporal features
Enrich metadata using multiple sources (Spotify API + external datasets)
Analyze listening patterns, genre evolution, and music trends
Generate visualizations of your musical journey

Features

Data Processing: Combine and clean multiple Spotify history JSON files
Multi-Source Metadata Enrichment:
- Spotify Web API (complete but rate-limited)
- External Kaggle datasets (fast, high coverage)
- Automatic matching and merging
Comprehensive Analysis: Audio features, genres, popularity, and trends
Organized Structure: Clean, scalable codebase with proper separation
Resume Capability: All processes can be safely interrupted and resumed

Project Structure (current)

spotify-data/
├── README.md
├── PROJECT_STATUS.md
├── requirements.txt
├── Makefile
├── docs/                                # Original Spotify data from export
├── data/
│   ├── processed/
│   │   ├── cleaned_streaming_history.csv
│   │   └── combined_streaming_history.csv
│   └── enriched/                        # (symlinked under scripts/spotify_api/data)
│       ├── ultimate_spotify_enriched_streaming_history.csv
│       ├── spotify_api_metadata.csv                      # append-only API results
│       ├── progress.sqlite                               # crash-safe progress
│       ├── spotify_api_enriched_streaming_history.csv    # merged final (base ← meta)
│       ├── spotify_api_enriched_streaming_history_songs.csv
│       └── spotify_api_enriched_streaming_history_podcasts.csv
├── scripts/
│   ├── data_processing/
│   │   ├── clean-history.py
│   │   └── combine-history.py
│   ├── external_matching/
│   │   └── ultimate_spotify_matcher.py
│   ├── enrichment/
│   │   └── merge_enrichments.py        # legacy (superseded by DuckDB)
│   ├── analysis/
│   ├── app/
│   ├── orchestrate.py
│   └── spotify_api/
│       ├── __init__.py
│       ├── duckdb_merge.py             # memory-safe on-disk LEFT JOIN (DuckDB)
│       ├── smart_metadata_enrichment.py# API enrichment + merge-only mode
│       ├── split_media_types.py        # split final CSV into songs/podcasts
│       └── data -> ../../data          # symlink so BASE_DIR-relative paths work
└── external_datasets/

Current Status (Aug 2025)

Final merged CSV generated by DuckDB without large in-memory joins.
Songs vs. podcasts split available for downstream analysis.
Spotify API enrichment runs in batches, appends to CSV, and records progress in SQLite for safe resume.

Coverage snapshot (Aug 15, 2025):

Base unique tracks: ~27,400
Kaggle-covered unique: ~6,069
API-covered unique (deduped): ~20,777 (prev ~10,866; +~9,911 today)
Final merged rows: 138,762 (includes header)
Split outputs: songs 127,099 rows; podcasts 11,664 rows (each includes header)
Estimated remaining unique for API next pass: ~3,000 (start-of-run to_process was 12,896; ~9.9k covered this run)

Available Metadata

Audio Features: Danceability, energy, valence, tempo, acousticness, instrumentalness, liveness, loudness, speechiness
Track Info: Release dates, popularity scores, explicit content flags
Genre Data: Detailed genre classifications with play counts
Artist Info: Complete artist metadata and classifications

Setup

1. Get Your Spotify Data

Request your extended streaming history from Spotify Privacy Settings
Wait for the email with your data (can take up to 30 days)
Extract the JSON files to a docs/ folder in this project

2. Set Up Python Environment

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install required packages
pip install -r requirements.txt

3. Set Up Spotify API Credentials

While external datasets provide good coverage, you may want API access for additional data:

Go to Spotify Developer Dashboard
Create a new app to get your Client ID and Client Secret
Copy .env.example to .env:
```
cp .env.example .env
```

Fill in your credentials:

SPOTIFY_CLIENT_ID=your_client_id_here
SPOTIFY_CLIENT_SECRET=your_client_secret_here

Usage

Quick Start

For the fastest results with good metadata coverage:

Process Your Raw Data

python scripts/data_processing/combine-history.py
python scripts/data_processing/clean-history.py

Enrich with External Datasets (Fast, no API limits)
```
python scripts/external_matching/ultimate_spotify_matcher.py
```
This provides ~30% coverage with complete audio features and genres.

Advanced Usage

For Maximum Coverage: Combine external datasets with Spotify API enrichment

Set up your .env with Spotify API credentials
Run external matching first (above)

Run API enrichment (background, resumable):

. .venv/bin/activate
[ -f scripts/__init__.py ] || touch scripts/__init__.py
[ -f scripts/spotify_api/__init__.py ] || touch scripts/spotify_api/__init__.py
[ -e scripts/spotify_api/data ] || ln -s ../../data scripts/spotify_api/data
LOG=/tmp/enrich_$(date +%Y%m%d-%H%M%S).log
TQDM_DISABLE=1 nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"

Monitor and resume safely:

tail -n 80 "$LOG"
pgrep -af scripts.spotify_api.smart_metadata_enrichment
sqlite3 scripts/spotify_api/data/enriched/progress.sqlite "SELECT COUNT(*) FROM progress WHERE meta_json IS NOT NULL;"

Merge-only (low-memory, on-disk):

LOG=/tmp/merge_$(date +%Y%m%d-%H%M%S).log
nohup python -m scripts.spotify_api.smart_metadata_enrichment --merge-only > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"

Data Processing Steps

Step 1: Combine History Files

python scripts/data_processing/combine-history.py

Combines all JSON files from docs/ folder
Creates data/processed/combined_streaming_history.csv

Step 2: Clean and Enhance Data

python scripts/data_processing/clean-history.py

Creates data/processed/cleaned_streaming_history.csv with:
- Converted timestamps and time-based features
- Filtered short plays (<30 seconds)
- Play duration in minutes
- Weekday, hour, and temporal analysis features

Step 3: Metadata Enrichment

Option A: External Datasets (Recommended First)

python scripts/external_matching/ultimate_spotify_matcher.py

Downloads Ultimate Spotify DB from Kaggle
Matches tracks using fuzzy string matching
Provides audio features, genres, popularity data
No rate limits, fast processing

Option B: Spotify Web API (memory-safe, resumable)

nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 &

Skips tracks covered by external datasets and already-processed API keys
Streams results to spotify_api_metadata.csv and records progress in SQLite
Performs final DuckDB LEFT JOIN into spotify_api_enriched_streaming_history.csv

Both create enriched_streaming_history.csv with additional metadata:

Spotify track IDs

Data Analysis

Available Data Files

After processing, you'll have these datasets:

data/processed/cleaned_streaming_history.csv - Cleaned listening history with temporal features
data/enriched/ultimate_spotify_enriched_*.csv - External dataset enrichment results
data/enriched/spotify_api_enriched_*.csv - API enrichment results (when available)

What You Can Analyze

Temporal Patterns:

Listening habits by time of day, day of week, season
Evolution of music taste over 15 years
Monthly and yearly listening volume trends

Audio Feature Analysis:

Preference for danceability, energy, valence over time
Tempo preferences and changes
Acousticness vs. electronic music trends

Genre Evolution:

Top genres by play count and listening time
Genre diversity and discovery patterns
Seasonal genre preferences

Discovery Patterns:

Track popularity vs. personal preference
Artist loyalty and discovery rates
Repeat listening behavior

Generated Files Overview

scripts/spotify_api/data/enriched/
├── ultimate_spotify_enriched_streaming_history.csv
├── spotify_api_metadata.csv
├── progress.sqlite
├── spotify_api_enriched_streaming_history.csv
├── spotify_api_enriched_streaming_history_songs.csv
└── spotify_api_enriched_streaming_history_podcasts.csv

Available Analysis Features

With enriched data, you can analyze:

Listening Patterns

Total hours and play counts by year, month, season
Peak listening hours and days of week
Seasonal and temporal trends across 15 years

Audio Features Evolution

Musical taste evolution over time
Trends in danceability, energy, valence, tempo
Acoustic vs. electronic preferences
Mood patterns (valence/energy correlations)

Genre Analysis

Top genres by play count and listening time
Genre discovery and evolution patterns
Seasonal genre preferences
Musical diversity metrics

Discovery & Popularity

Track popularity vs. personal preference
Artist loyalty and discovery rates
Repeat listening behavior analysis
Mainstream vs. niche music preferences

Technical Notes

API Rate Limiting

The enrichment is designed to run daily until quota is hit (≈10k calls/day), then resume the next day:

Default pacing: 50 items/batch, 0.3s/request, 5s between batches
Auto-resume and skip logic ensure no redundant calls
Safe to stop and restart anytime

External Dataset Benefits

No Rate Limits: Process entire dataset quickly
High Coverage: 30%+ match rates with quality data
Complete Features: All audio features and genres included
Reliable: No authentication or quota concerns

Data Quality

Fuzzy string matching for robust track identification
Duplicate detection and handling
Comprehensive metadata validation
Multiple source integration for maximum coverage

Data Privacy & Security

Your .env file with API credentials is excluded from git
All CSV files with personal listening data are gitignored
External datasets contain only public metadata
Never commit or share personal listening data

Contributing

This project welcomes contributions! Areas for improvement:

Additional external dataset integrations
Advanced analysis and visualization scripts
Performance optimizations
New matching algorithms

Troubleshooting

Common Issues

API Rate Limits: Use external datasets first, then supplement with API
Low Match Rates: Check for special characters in track/artist names
Missing Data: Some tracks may not exist in any dataset
Memory Issues: Process in smaller batches if needed

Getting Help

Check the progress files for API enrichment status
Review console output for match statistics
Verify your raw data format matches expected structure

License

This project is for personal use. Respect Spotify's API terms of service and your data privacy.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/enriched		data/enriched
logs		logs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
PROJECT_STATUS.md		PROJECT_STATUS.md
README.md		README.md
ReadMeFirst_ExtendedStreamingHistory.pdf		ReadMeFirst_ExtendedStreamingHistory.pdf
current_state.md		current_state.md
download_ultimate_spotify_db.py		download_ultimate_spotify_db.py
kaggle_metadata_match.py		kaggle_metadata_match.py
requirements-app.txt		requirements-app.txt
requirements.txt		requirements.txt

glitch-bee/spotify.data

Folders and files

Latest commit

History

Repository files navigation