Video ingress pipeline for collecting and extracting frames from YouTube.
search → frames
Search finds videos and records them in a SQLite pipeline DB. Frames pulls pending URLs from the DB, fetches each stream directly, and extracts JPEGs — no full download required. Everything is resumable and idempotent.
# 1. Build the Docker image
./docker/build.sh
# 2. Add a search_terms.yml to your workdir (see format below)
# 3. Run the pipeline
./docker/run.sh python search.py
./docker/run.sh python frames.py --interval 20 --sleep 1Output lands in $GRABBY_WORKDIR/{source}/{category}/ as JPEGs, with pipeline.db and search_terms.yml at the workdir root.
workdir/ (or any dir set via GRABBY_WORKDIR)
search_terms.yml ← you provide this
pipeline.db ← created automatically
youtube/
sports/ ← frames land here
auto/
gym/
search_terms:
sports:
- top sports highlights
- best sports reactions
auto:
- top cars of the year
- driving videos
gym:
- workouts
- weight lifting techniqueCategories become subdirectories. Terms are searched once and skipped on re-runs.
Searches YouTube for each term and records discovered URLs in pipeline.db. Already-searched terms are skipped automatically (use --rerun to re-search and merge new finds).
# Search all terms in workdir/search_terms.yml
./docker/run.sh python search.py
# Single explicit query
./docker/run.sh python search.py --query "best gym workouts" --limit 50
# Re-search all terms and merge any new URLs
./docker/run.sh python search.py --rerunReads pending URLs from pipeline.db and extracts JPEG frames directly from the YouTube stream — no full video download needed. Marks each URL done/failed in the DB so runs are resumable.
# Extract a frame every 20 seconds, 1s pause between videos
./docker/run.sh python frames.py --interval 20 --sleep 1
# Specific timestamps
./docker/run.sh python frames.py --timestamps 1m30s 2m45s 4m10s
# Filter to one category
./docker/run.sh python frames.py --interval 20 --category gym
# Retry previously failed videos
./docker/run.sh python frames.py --interval 20 --retry-failed
# Single URL (bypasses DB, useful for testing)
./docker/run.sh python frames.py https://youtube.com/watch?v=abc123 --interval 30Downloads full MP4 files when you need the video itself rather than just frames. Not required for the frames pipeline.
# Download a single URL
./docker/run.sh python download.py https://youtube.com/watch?v=abc123
# Clip: extract a time range only
./docker/run.sh python download.py https://youtube.com/watch?v=abc123 --start 1m30s --end 2m45s
# Browser cookies for gated content
./docker/run.sh python download.py --browser firefoxAll scripts resolve paths relative to a workdir. Set it via:
.env file (recommended, gitignored):
GRABBY_WORKDIR=/path/to/workdir
Inline (one-off override):
GRABBY_WORKDIR=/tmp/test ./docker/run.sh python search.py--workdir flag (explicit per-command):
./docker/run.sh python frames.py --workdir /path/to/dataset --interval 20All scripts accept --source (default: youtube) and --db (default: <workdir>/pipeline.db). Run any script with --help for the full list.
pipeline.db is a SQLite file at the workdir root. It tracks:
search_log— which terms have been searched (enables resume)urls— every discovered URL withframes_status(pending/done/failed)frames— one row per extracted JPEG with its timestamp
Query it directly if you need to inspect state:
# How many frames per category?
python3 -c "
import sqlite3
conn = sqlite3.connect('pipeline.db')
for row in conn.execute('SELECT category, COUNT(*) FROM frames JOIN urls ON urls.id=frames.url_id GROUP BY category'):
print(row)
"Grabby's output is designed to feed directly into samantics, an auto-labelling pipeline that deduplicates frames, runs SAM3 segmentation, and exports COCO-format annotations. Point GRABBY_WORKDIR in samantics at the same directory.