Simple local TTS dataset creation pipeline with Silero VAD-based silence removal, DeepFilterNet denoising, and Label Studio integration.
- Silero VAD-based silence removal - Removes long silences while preserving natural speech gaps
- DeepFilterNet denoising - CPU-optimized audio denoising (enabled by default)
- AssemblyAI transcription - High-quality speech-to-text with speaker diarization
- Existing transcript support - Use pre-existing transcripts to skip transcription
- Label Studio integration - Manual annotation and quality control
- Precise segmentation - Exact audio segment extraction matching transcript timings
- Unique file naming - Source-aware segment names to prevent collisions
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync# Option A: use uv for dependency management
!pip install uv
!uv pip install -r requirements.txt
# Option B: plain pip (if you prefer)
!pip install -r requirements.txt# Copy environment template
cp env.example .env
# Edit .env with your API keys
# ASSEMBLYAI_API_KEY=your_key_here
# LABEL_STUDIO_API_KEY=your_key_here (optional)# Install Label Studio
pip install label-studio
# Start Label Studio
label-studio start
# Start static file server for audio files (in another terminal)
python3 scripts/serve_static.py dataset 8000# Create your config file first
# Look at configs/example_config.md and create configs/dataset_config.json
# Basic processing (with VAD and silence removal)
uv run python scripts/process.py configs/dataset_config.json
# or
python scripts/process.py configs/dataset_config.json
# Process without VAD (keep all audio)
VAD_ENABLED=false uv run python scripts/process.py configs/dataset_config.json
# or
VAD_ENABLED=false python scripts/process.py configs/dataset_config.json
# Process without silence removal
REMOVE_LONG_SILENCES=false uv run python scripts/process.py configs/dataset_config.json
# or
REMOVE_LONG_SILENCES=false python scripts/process.py configs/dataset_config.json
# Process with custom settings
MIN_SEGMENT_DURATION=2.0 MAX_SILENCE_DURATION=0.5 uv run python scripts/process.py configs/dataset_config.json
# or
MIN_SEGMENT_DURATION=2.0 MAX_SILENCE_DURATION=0.5 python scripts/process.py configs/dataset_config.jsonCreate your config file:
- Look at the example: Open
configs/example_config.mdto see the JSON structure - Create your config: Create a new file
configs/dataset_config.jsonwith your settings - Edit the paths: Update the
sourcesarray with your audio file paths
{
"name": "my_dataset",
"sources": [
"/path/to/audio1.mp3",
"/path/to/audio2.wav"
]
}{
"name": "my_dataset",
"sources": [
{
"file": "/path/to/audio1.mp3",
"transcript": "/path/to/transcript1.json"
},
{
"file": "/path/to/audio2.wav",
"transcript": "/path/to/transcript2.json"
}
]
}If you have existing transcripts, they should be JSON files with this structure:
{
"utterances": [
{
"text": "Hello world",
"speaker": "A",
"start": 0.0,
"end": 1.5,
"confidence": 0.95
}
]
}ASSEMBLYAI_API_KEY- Your AssemblyAI API key (required)
LABEL_STUDIO_URL- Label Studio URL (default: http://localhost:8080)LABEL_STUDIO_API_KEY- Label Studio API key (optional)
VAD_ENABLED- Enable Voice Activity Detection (default: true)VAD_METHOD- VAD method: silero, webrtc (default: silero)REMOVE_LONG_SILENCES- Remove long silences (default: true)MAX_SILENCE_DURATION- Max silence duration to keep (default: 1.0s)SILENCE_THRESHOLD- Silence detection threshold in dB (default: -30.0)MIN_SPEECH_DURATION- Minimum speech segment duration (default: 0.5s)MIN_SPEECH_RATIO- Minimum speech ratio in segment (default: 0.3)
MIN_SEGMENT_DURATION- Minimum segment duration (default: 1.0s)MAX_SEGMENT_DURATION- Maximum segment duration (default: 3600.0s)MIN_CONFIDENCE_SCORE- Minimum transcription confidence (default: 0.7)
Processed datasets are saved to dataset/:
audio/- Individual audio segments with unique names (e.g.,pb1_segment_0.wav)metadata.json- Complete segment metadata for Label Studiotranscripts/- Per-file transcript JSON files
- Python 3.10+
- PyTorch (CPU)
- Silero VAD
- DeepFilterNet
- AssemblyAI
- Label Studio SDK
- Librosa
- SoundFile
MIT