fix(tts): Remove 440Hz beep, implement ALBERT encoder (#179) #183

m96-chan · 2026-01-01T12:27:33Z

Summary

Fixes #179 - TTS sample outputs beep sound (440Hz sine wave) instead of actual speech.

Changes:

Removed 440Hz sine wave placeholder in _forward_simple() that was causing the beep
Implemented ALBERT encoder (Kokoro uses ALBERT architecture with shared weights, not standard BERT)
Added specialized layers for Kokoro TTS:
- WeightNormConv1d: Convolution with weight normalization (weight_g/weight_v decomposition)
- InstanceNorm1d: Per-channel instance normalization
- AdaIN: Adaptive Instance Normalization for style conditioning
- ALBERTLayer/ALBERTEncoder: ALBERT with shared layer weights
- KokoroTextEncoder: CNN (3 layers) + BiLSTM architecture
- AdaINResBlock: Residual blocks with AdaIN for style-conditioned decoding
Added builder functions:
- build_albert_from_weights(): Constructs ALBERT from weight dict
- build_text_encoder_from_weights(): Constructs text encoder from weight dict
Updated model.py to use actual neural network layers instead of placeholder

Current State:

Text encoding pipeline (ALBERT + text encoder) is implemented
Generates silent audio placeholder instead of beep when full decoder is not yet available
Full decoder/vocoder implementation requires additional weight structure verification

Build Requirements

No C++/CUDA build required. This PR contains Python-only changes.

Linux CMake build should pass in CI without issues.

Test Plan

Testing not yet implemented - Will be done separately as noted in Issue bug(tts): Kokoro TTS outputs 440Hz sine wave instead of speech #179
Verify model loads without errors
Verify ALBERT encoder produces valid hidden states
Verify text encoder produces valid features
Integration test with actual audio generation (pending decoder implementation)

🤖 Generated with Claude Code

Fixes #179 - TTS sample outputs beep sound instead of speech Changes: - Remove 440Hz sine wave placeholder generation in _forward_simple() - Implement ALBERT encoder (Kokoro uses ALBERT, not standard BERT) - Add WeightNormConv1d for weight-normalized convolutions - Add InstanceNorm1d for per-channel normalization - Add AdaIN (Adaptive Instance Normalization) for style conditioning - Add KokoroTextEncoder (CNN + BiLSTM architecture) - Add AdaINResBlock for style-conditioned residual blocks - Add builder functions: build_albert_from_weights(), build_text_encoder_from_weights() - Update model.py to use actual neural network layers - Generate silence placeholder instead of beep when decoder not implemented Note: Full decoder/vocoder implementation requires additional weight mapping. Current implementation runs through ALBERT and text encoder, generating placeholder audio while decoder pipeline is being completed. Testing: Not yet verified - requires model weights and audio playback. Testing will be done separately as noted in Issue #179. Build: No C++/CUDA build required. Python-only changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit abd01b9 into main Jan 1, 2026
13 checks passed

m96-chan mentioned this pull request Jan 1, 2026

test(tts): Verify Kokoro TTS implementation (#183) #184

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tts): Remove 440Hz beep, implement ALBERT encoder (#179) #183

fix(tts): Remove 440Hz beep, implement ALBERT encoder (#179) #183

Uh oh!

m96-chan commented Jan 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(tts): Remove 440Hz beep, implement ALBERT encoder (#179) #183

fix(tts): Remove 440Hz beep, implement ALBERT encoder (#179) #183

Uh oh!

Conversation

m96-chan commented Jan 1, 2026

Summary

Build Requirements

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants