Whissle Annotator

Large-scale multimodal dataset annotation pipeline for the Whissle Deterministic Perception Model. Processes 100K+ hours of audio and audio-visual data through a 10-stage pipeline, producing consistently annotated JSONL manifests with 75 fixed fields per sample.

Architecture

                    ┌─────────────┐
                    │ Connectors  │  LibriSpeech, Common Voice, MLS, YouTube, Generic
                    └──────┬──────┘
                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│                     10-Stage Pipeline                                │
│  s01_ingest → s02_diarize → s03_transcribe → s04_audio_classify    │
│  → s05_entity_intent → s06_visual_extract → s07_visual_classify    │
│  → s08_crossmodal → s09_quality_check → s10_merge_finalize         │
└──────────────────────────────────────────────────────────────────────┘
                           ▼
              ┌────────────┼────────────┐
              ▼            ▼            ▼
         full.jsonl   ctc_train.jsonl  splits/
         (75 fields)  (text | tokens)  train/dev/test

Repository Structure

whissle-annotator/
  schema/              # 75-field JSONL schema, token vocabulary, validators
  stages/              # 10 processing stages (ingest → merge)
  connectors/          # Dataset-specific ingest adapters
  runners/             # Local runner, checkpoint DB, GCP storage
  configs/             # Domain-specific Gemini prompt configs (YAML)
  server/              # FastAPI annotation server + frontend UI
  evaluation/          # WER, NER evaluation tools
  tools/               # YouTube data collection agent
  pipeline_cli.py      # Main CLI entry point
  pipeline_config.py   # YAML pipeline configuration

Quick Start

# Install
pip install -r requirements.txt

# Generate a starter config
python pipeline_cli.py init-config --output pipeline.yaml
# Edit pipeline.yaml — set input_path values for your datasets

# Ingest a dataset
python pipeline_cli.py ingest --connector librispeech \
  --input /data/LibriSpeech/train-clean-100 \
  --output manifests/librispeech.jsonl

# Run the full pipeline
python pipeline_cli.py run --config pipeline.yaml --dataset librispeech_clean

# Validate output
python pipeline_cli.py validate --input pipeline_output/librispeech/s10_merge_finalize.jsonl

# Show stats
python pipeline_cli.py stats --input pipeline_output/librispeech/s10_merge_finalize.jsonl

CLI Commands

Command	Description
`run`	Run pipeline stages for a dataset from config
`ingest`	Run a connector to produce initial JSONL manifest
`validate`	Schema validation with field distribution report
`stats`	Dataset statistics (hours, sources, completeness)
`upload`	Upload finalized dataset to GCS
`download`	Download training data from GCS
`gcp-setup`	Create GCS bucket with lifecycle rules
`init-config`	Generate starter pipeline.yaml

Schema

Every sample has 75 fixed fields (same for all samples, "NA" when not applicable):

Metadata (6): sample_id, audio/video paths, duration, source, language
Text (1): Transcript with inline ENTITY_TYPE word END annotations
NLP (13): Intent, sentiment, topic, speech act, formality, etc.
Audio (21): Emotion, age, gender, speech rate, noise, SNR, etc.
Visual (22): Face emotion, gaze, gesture, scene, objects, etc.
Cross-modal (5): AV sync, active speaker, lip sync, etc.
Processing (7): Pipeline version, models used, quality score

Connectors

Connector	Dataset	Format
`librispeech`	LibriSpeech	.flac + .trans.txt
`common_voice`	Mozilla Common Voice	TSV + clips/
`mls`	Multilingual LibriSpeech	transcripts.txt + audio/
`youtube`	YouTube (yt-dlp)	Download + VTT subtitles
`generic_audio`	Any audio directory	WAV/MP3/FLAC + optional sidecar

GCP Lifecycle

gs://whissle-datasets/
  raw/{source}/{language}/        # Raw downloads
  manifests/{dataset}/{stage}.jsonl  # Per-stage manifests
  processed/{dataset}/            # Final annotated data + splits
  checkpoints/{dataset}/          # SQLite checkpoint DBs
  models/{experiment}/            # Trained model artifacts

Domain Configs

YAML configs in configs/ define domain-specific Gemini prompts for entity/intent extraction. Available domains: automotive, generic, interview, kitchen, meeting, movies, podcast, product-demo, scoorer, tv-show, wellness.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 554 Commits
_archive		_archive
configs		configs
connectors		connectors
evaluation		evaluation
frontend		frontend
pipeline_output/youtube_sample		pipeline_output/youtube_sample
runners		runners
schema		schema
server		server
stages		stages
static		static
tools/youtube_agent		tools/youtube_agent
.dockerignore		.dockerignore
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENCE		LICENCE
cloud_server.py		cloud_server.py
cloudbuild-gpu.yaml		cloudbuild-gpu.yaml
cloudbuild.yaml		cloudbuild.yaml
db.py		db.py
deploy.sh		deploy.sh
experiment.py		experiment.py
model_vocab.py		model_vocab.py
pipeline_cli.py		pipeline_cli.py
pipeline_config.py		pipeline_config.py
pipeline_sample.yaml		pipeline_sample.yaml
readme.md		readme.md
requirements-cloud.txt		requirements-cloud.txt
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt
task_queue.py		task_queue.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whissle Annotator

Architecture

Repository Structure

Quick Start

CLI Commands

Schema

Connectors

GCP Lifecycle

Domain Configs

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Whissle Annotator

Architecture

Repository Structure

Quick Start

CLI Commands

Schema

Connectors

GCP Lifecycle

Domain Configs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages