Skip to content

broomhead/grabby

Repository files navigation

Grabby

Video ingress pipeline for collecting and extracting frames from YouTube.

search → frames

Search finds videos and records them in a SQLite pipeline DB. Frames pulls pending URLs from the DB, fetches each stream directly, and extracts JPEGs — no full download required. Everything is resumable and idempotent.

Quick start

# 1. Build the Docker image
./docker/build.sh

# 2. Add a search_terms.yml to your workdir (see format below)

# 3. Run the pipeline
./docker/run.sh python search.py
./docker/run.sh python frames.py --interval 20 --sleep 1

Output lands in $GRABBY_WORKDIR/{source}/{category}/ as JPEGs, with pipeline.db and search_terms.yml at the workdir root.

Workdir layout

workdir/                      (or any dir set via GRABBY_WORKDIR)
  search_terms.yml            ← you provide this
  pipeline.db                 ← created automatically
  youtube/
    sports/                  ← frames land here
    auto/
    gym/

search_terms.yml format

search_terms:
  sports:
    - top sports highlights
    - best sports reactions
  auto:
    - top cars of the year
    - driving videos
  gym:
    - workouts
    - weight lifting technique

Categories become subdirectories. Terms are searched once and skipped on re-runs.

Scripts

search.py — find videos

Searches YouTube for each term and records discovered URLs in pipeline.db. Already-searched terms are skipped automatically (use --rerun to re-search and merge new finds).

# Search all terms in workdir/search_terms.yml
./docker/run.sh python search.py

# Single explicit query
./docker/run.sh python search.py --query "best gym workouts" --limit 50

# Re-search all terms and merge any new URLs
./docker/run.sh python search.py --rerun

frames.py — extract frames

Reads pending URLs from pipeline.db and extracts JPEG frames directly from the YouTube stream — no full video download needed. Marks each URL done/failed in the DB so runs are resumable.

# Extract a frame every 20 seconds, 1s pause between videos
./docker/run.sh python frames.py --interval 20 --sleep 1

# Specific timestamps
./docker/run.sh python frames.py --timestamps 1m30s 2m45s 4m10s

# Filter to one category
./docker/run.sh python frames.py --interval 20 --category gym

# Retry previously failed videos
./docker/run.sh python frames.py --interval 20 --retry-failed

# Single URL (bypasses DB, useful for testing)
./docker/run.sh python frames.py https://youtube.com/watch?v=abc123 --interval 30

download.py — download full videos (optional)

Downloads full MP4 files when you need the video itself rather than just frames. Not required for the frames pipeline.

# Download a single URL
./docker/run.sh python download.py https://youtube.com/watch?v=abc123

# Clip: extract a time range only
./docker/run.sh python download.py https://youtube.com/watch?v=abc123 --start 1m30s --end 2m45s

# Browser cookies for gated content
./docker/run.sh python download.py --browser firefox

Configuration

GRABBY_WORKDIR

All scripts resolve paths relative to a workdir. Set it via:

.env file (recommended, gitignored):

GRABBY_WORKDIR=/path/to/workdir

Inline (one-off override):

GRABBY_WORKDIR=/tmp/test ./docker/run.sh python search.py

--workdir flag (explicit per-command):

./docker/run.sh python frames.py --workdir /path/to/dataset --interval 20

Other flags

All scripts accept --source (default: youtube) and --db (default: <workdir>/pipeline.db). Run any script with --help for the full list.

Pipeline DB

pipeline.db is a SQLite file at the workdir root. It tracks:

  • search_log — which terms have been searched (enables resume)
  • urls — every discovered URL with frames_status (pending/done/failed)
  • frames — one row per extracted JPEG with its timestamp

Query it directly if you need to inspect state:

# How many frames per category?
python3 -c "
import sqlite3
conn = sqlite3.connect('pipeline.db')
for row in conn.execute('SELECT category, COUNT(*) FROM frames JOIN urls ON urls.id=frames.url_id GROUP BY category'):
    print(row)
"

Downstream

Grabby's output is designed to feed directly into samantics, an auto-labelling pipeline that deduplicates frames, runs SAM3 segmentation, and exports COCO-format annotations. Point GRABBY_WORKDIR in samantics at the same directory.

About

Downloads content off social media platforms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors