Grabby

Video ingress pipeline for collecting and extracting frames from YouTube.

search → frames

Search finds videos and records them in a SQLite pipeline DB. Frames pulls pending URLs from the DB, fetches each stream directly, and extracts JPEGs — no full download required. Everything is resumable and idempotent.

Quick start

# 1. Build the Docker image
./docker/build.sh

# 2. Add a search_terms.yml to your workdir (see format below)

# 3. Run the pipeline
./docker/run.sh python search.py
./docker/run.sh python frames.py --interval 20 --sleep 1

Output lands in $GRABBY_WORKDIR/{source}/{category}/ as JPEGs, with pipeline.db and search_terms.yml at the workdir root.

Workdir layout

workdir/                      (or any dir set via GRABBY_WORKDIR)
  search_terms.yml            ← you provide this
  pipeline.db                 ← created automatically
  youtube/
    sports/                  ← frames land here
    auto/
    gym/

search_terms.yml format

search_terms:
  sports:
    - top sports highlights
    - best sports reactions
  auto:
    - top cars of the year
    - driving videos
  gym:
    - workouts
    - weight lifting technique

Categories become subdirectories. Terms are searched once and skipped on re-runs.

Scripts

`search.py` — find videos

Searches YouTube for each term and records discovered URLs in pipeline.db. Already-searched terms are skipped automatically (use --rerun to re-search and merge new finds).

# Search all terms in workdir/search_terms.yml
./docker/run.sh python search.py

# Single explicit query
./docker/run.sh python search.py --query "best gym workouts" --limit 50

# Re-search all terms and merge any new URLs
./docker/run.sh python search.py --rerun

`frames.py` — extract frames

Reads pending URLs from pipeline.db and extracts JPEG frames directly from the YouTube stream — no full video download needed. Marks each URL done/failed in the DB so runs are resumable.

# Extract a frame every 20 seconds, 1s pause between videos
./docker/run.sh python frames.py --interval 20 --sleep 1

# Specific timestamps
./docker/run.sh python frames.py --timestamps 1m30s 2m45s 4m10s

# Filter to one category
./docker/run.sh python frames.py --interval 20 --category gym

# Retry previously failed videos
./docker/run.sh python frames.py --interval 20 --retry-failed

# Single URL (bypasses DB, useful for testing)
./docker/run.sh python frames.py https://youtube.com/watch?v=abc123 --interval 30

`download.py` — download full videos (optional)

Downloads full MP4 files when you need the video itself rather than just frames. Not required for the frames pipeline.

# Download a single URL
./docker/run.sh python download.py https://youtube.com/watch?v=abc123

# Clip: extract a time range only
./docker/run.sh python download.py https://youtube.com/watch?v=abc123 --start 1m30s --end 2m45s

# Browser cookies for gated content
./docker/run.sh python download.py --browser firefox

Configuration

GRABBY_WORKDIR

All scripts resolve paths relative to a workdir. Set it via:

.env file (recommended, gitignored):

GRABBY_WORKDIR=/path/to/workdir

Inline (one-off override):

GRABBY_WORKDIR=/tmp/test ./docker/run.sh python search.py

--workdir flag (explicit per-command):

./docker/run.sh python frames.py --workdir /path/to/dataset --interval 20

Other flags

All scripts accept --source (default: youtube) and --db (default: <workdir>/pipeline.db). Run any script with --help for the full list.

Pipeline DB

pipeline.db is a SQLite file at the workdir root. It tracks:

search_log — which terms have been searched (enables resume)
urls — every discovered URL with frames_status (pending/done/failed)
frames — one row per extracted JPEG with its timestamp

Query it directly if you need to inspect state:

# How many frames per category?
python3 -c "
import sqlite3
conn = sqlite3.connect('pipeline.db')
for row in conn.execute('SELECT category, COUNT(*) FROM frames JOIN urls ON urls.id=frames.url_id GROUP BY category'):
    print(row)
"

Downstream

Grabby's output is designed to feed directly into samantics, an auto-labelling pipeline that deduplicates frames, runs SAM3 segmentation, and exports COCO-format annotations. Point GRABBY_WORKDIR in samantics at the same directory.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker		docker
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
db.py		db.py
dedup.py		dedup.py
download.py		download.py
frames.py		frames.py
grabby.py		grabby.py
progress.py		progress.py
requirements.txt		requirements.txt
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grabby

Quick start

Workdir layout

search_terms.yml format

Scripts

`search.py` — find videos

`frames.py` — extract frames

`download.py` — download full videos (optional)

Configuration

GRABBY_WORKDIR

Other flags

Pipeline DB

Downstream

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grabby

Quick start

Workdir layout

search_terms.yml format

Scripts

search.py — find videos

frames.py — extract frames

download.py — download full videos (optional)

Configuration

GRABBY_WORKDIR

Other flags

Pipeline DB

Downstream

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`search.py` — find videos

`frames.py` — extract frames

`download.py` — download full videos (optional)

Packages