Skip to content

WhissleAI/whissle-annotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

554 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Whissle Annotator

Large-scale multimodal dataset annotation pipeline for the Whissle Deterministic Perception Model. Processes 100K+ hours of audio and audio-visual data through a 10-stage pipeline, producing consistently annotated JSONL manifests with 75 fixed fields per sample.

Architecture

                    ┌─────────────┐
                    │ Connectors  │  LibriSpeech, Common Voice, MLS, YouTube, Generic
                    └──────┬──────┘
                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│                     10-Stage Pipeline                                │
│  s01_ingest → s02_diarize → s03_transcribe → s04_audio_classify    │
│  → s05_entity_intent → s06_visual_extract → s07_visual_classify    │
│  → s08_crossmodal → s09_quality_check → s10_merge_finalize         │
└──────────────────────────────────────────────────────────────────────┘
                           ▼
              ┌────────────┼────────────┐
              ▼            ▼            ▼
         full.jsonl   ctc_train.jsonl  splits/
         (75 fields)  (text | tokens)  train/dev/test

Repository Structure

whissle-annotator/
  schema/              # 75-field JSONL schema, token vocabulary, validators
  stages/              # 10 processing stages (ingest → merge)
  connectors/          # Dataset-specific ingest adapters
  runners/             # Local runner, checkpoint DB, GCP storage
  configs/             # Domain-specific Gemini prompt configs (YAML)
  server/              # FastAPI annotation server + frontend UI
  evaluation/          # WER, NER evaluation tools
  tools/               # YouTube data collection agent
  pipeline_cli.py      # Main CLI entry point
  pipeline_config.py   # YAML pipeline configuration

Quick Start

# Install
pip install -r requirements.txt

# Generate a starter config
python pipeline_cli.py init-config --output pipeline.yaml
# Edit pipeline.yaml — set input_path values for your datasets

# Ingest a dataset
python pipeline_cli.py ingest --connector librispeech \
  --input /data/LibriSpeech/train-clean-100 \
  --output manifests/librispeech.jsonl

# Run the full pipeline
python pipeline_cli.py run --config pipeline.yaml --dataset librispeech_clean

# Validate output
python pipeline_cli.py validate --input pipeline_output/librispeech/s10_merge_finalize.jsonl

# Show stats
python pipeline_cli.py stats --input pipeline_output/librispeech/s10_merge_finalize.jsonl

CLI Commands

Command Description
run Run pipeline stages for a dataset from config
ingest Run a connector to produce initial JSONL manifest
validate Schema validation with field distribution report
stats Dataset statistics (hours, sources, completeness)
upload Upload finalized dataset to GCS
download Download training data from GCS
gcp-setup Create GCS bucket with lifecycle rules
init-config Generate starter pipeline.yaml

Schema

Every sample has 75 fixed fields (same for all samples, "NA" when not applicable):

  • Metadata (6): sample_id, audio/video paths, duration, source, language
  • Text (1): Transcript with inline ENTITY_TYPE word END annotations
  • NLP (13): Intent, sentiment, topic, speech act, formality, etc.
  • Audio (21): Emotion, age, gender, speech rate, noise, SNR, etc.
  • Visual (22): Face emotion, gaze, gesture, scene, objects, etc.
  • Cross-modal (5): AV sync, active speaker, lip sync, etc.
  • Processing (7): Pipeline version, models used, quality score

Connectors

Connector Dataset Format
librispeech LibriSpeech .flac + .trans.txt
common_voice Mozilla Common Voice TSV + clips/
mls Multilingual LibriSpeech transcripts.txt + audio/
youtube YouTube (yt-dlp) Download + VTT subtitles
generic_audio Any audio directory WAV/MP3/FLAC + optional sidecar

GCP Lifecycle

gs://whissle-datasets/
  raw/{source}/{language}/        # Raw downloads
  manifests/{dataset}/{stage}.jsonl  # Per-stage manifests
  processed/{dataset}/            # Final annotated data + splits
  checkpoints/{dataset}/          # SQLite checkpoint DBs
  models/{experiment}/            # Trained model artifacts

Domain Configs

YAML configs in configs/ define domain-specific Gemini prompts for entity/intent extraction. Available domains: automotive, generic, interview, kitchen, meeting, movies, podcast, product-demo, scoorer, tv-show, wellness.

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors