Large-scale multimodal dataset annotation pipeline for the Whissle Deterministic Perception Model. Processes 100K+ hours of audio and audio-visual data through a 10-stage pipeline, producing consistently annotated JSONL manifests with 75 fixed fields per sample.
┌─────────────┐
│ Connectors │ LibriSpeech, Common Voice, MLS, YouTube, Generic
└──────┬──────┘
▼
┌──────────────────────────────────────────────────────────────────────┐
│ 10-Stage Pipeline │
│ s01_ingest → s02_diarize → s03_transcribe → s04_audio_classify │
│ → s05_entity_intent → s06_visual_extract → s07_visual_classify │
│ → s08_crossmodal → s09_quality_check → s10_merge_finalize │
└──────────────────────────────────────────────────────────────────────┘
▼
┌────────────┼────────────┐
▼ ▼ ▼
full.jsonl ctc_train.jsonl splits/
(75 fields) (text | tokens) train/dev/test
whissle-annotator/
schema/ # 75-field JSONL schema, token vocabulary, validators
stages/ # 10 processing stages (ingest → merge)
connectors/ # Dataset-specific ingest adapters
runners/ # Local runner, checkpoint DB, GCP storage
configs/ # Domain-specific Gemini prompt configs (YAML)
server/ # FastAPI annotation server + frontend UI
evaluation/ # WER, NER evaluation tools
tools/ # YouTube data collection agent
pipeline_cli.py # Main CLI entry point
pipeline_config.py # YAML pipeline configuration
# Install
pip install -r requirements.txt
# Generate a starter config
python pipeline_cli.py init-config --output pipeline.yaml
# Edit pipeline.yaml — set input_path values for your datasets
# Ingest a dataset
python pipeline_cli.py ingest --connector librispeech \
--input /data/LibriSpeech/train-clean-100 \
--output manifests/librispeech.jsonl
# Run the full pipeline
python pipeline_cli.py run --config pipeline.yaml --dataset librispeech_clean
# Validate output
python pipeline_cli.py validate --input pipeline_output/librispeech/s10_merge_finalize.jsonl
# Show stats
python pipeline_cli.py stats --input pipeline_output/librispeech/s10_merge_finalize.jsonl| Command | Description |
|---|---|
run |
Run pipeline stages for a dataset from config |
ingest |
Run a connector to produce initial JSONL manifest |
validate |
Schema validation with field distribution report |
stats |
Dataset statistics (hours, sources, completeness) |
upload |
Upload finalized dataset to GCS |
download |
Download training data from GCS |
gcp-setup |
Create GCS bucket with lifecycle rules |
init-config |
Generate starter pipeline.yaml |
Every sample has 75 fixed fields (same for all samples, "NA" when not applicable):
- Metadata (6): sample_id, audio/video paths, duration, source, language
- Text (1): Transcript with inline
ENTITY_TYPE word ENDannotations - NLP (13): Intent, sentiment, topic, speech act, formality, etc.
- Audio (21): Emotion, age, gender, speech rate, noise, SNR, etc.
- Visual (22): Face emotion, gaze, gesture, scene, objects, etc.
- Cross-modal (5): AV sync, active speaker, lip sync, etc.
- Processing (7): Pipeline version, models used, quality score
| Connector | Dataset | Format |
|---|---|---|
librispeech |
LibriSpeech | .flac + .trans.txt |
common_voice |
Mozilla Common Voice | TSV + clips/ |
mls |
Multilingual LibriSpeech | transcripts.txt + audio/ |
youtube |
YouTube (yt-dlp) | Download + VTT subtitles |
generic_audio |
Any audio directory | WAV/MP3/FLAC + optional sidecar |
gs://whissle-datasets/
raw/{source}/{language}/ # Raw downloads
manifests/{dataset}/{stage}.jsonl # Per-stage manifests
processed/{dataset}/ # Final annotated data + splits
checkpoints/{dataset}/ # SQLite checkpoint DBs
models/{experiment}/ # Trained model artifacts
YAML configs in configs/ define domain-specific Gemini prompts for entity/intent extraction. Available domains: automotive, generic, interview, kitchen, meeting, movies, podcast, product-demo, scoorer, tv-show, wellness.
MIT License