Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions models/tts/kittentts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# KittenTTS CoreML Conversion

## Status: Complete

KittenTTS Nano is a distilled Kokoro/StyleTTS2 model (15M params, 24kHz) that ships as ONNX-only (INT8 quantized).
Successfully converted to CoreML by reconstructing the PyTorch model from the ONNX graph, dequantizing weights, and tracing.

## Architecture

| Component | KittenTTS Nano | Kokoro-82M |
|-----------|---------------|------------|
| BERT embedding dim | 128 | 768 |
| BERT hidden (ALBERT) | 768 | 768 |
| bert_encoder output | 128 | 768 |
| Style dim | 128 | 128 |
| Generator channels | 256->128->64 | 512->256->128 |
| Total params | 15M | 82M |
| Source format | ONNX INT8 | PyTorch |

## Usage

```bash
cd coreml
python convert_kittentts.py --seconds 5 --output kittentts_5s.mlpackage
```

See [coreml/README.md](coreml/README.md) for full documentation.
195 changes: 195 additions & 0 deletions models/tts/kittentts/coreml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# KittenTTS Nano CoreML

Convert [KittenTTS Nano](https://huggingface.co/KittenML/kitten-tts-nano-0.1) (15M param distilled Kokoro/StyleTTS2) from ONNX to CoreML for on-device inference on iOS and macOS.

**Features**: 15M params (tiny) | 24kHz audio | CPU-optimized | FP32 CoreML | Single model

---

## Quick Start

### Prerequisites

```bash
python3.10 -m venv venv
source venv/bin/activate
pip install torch coremltools onnx onnxruntime numpy scipy phonemizer
```

Also requires `espeak-ng` for phonemization:
```bash
brew install espeak-ng
```

### Convert

```bash
# 5-second model (70 max tokens)
python convert_kittentts.py --seconds 5 --output kittentts_5s.mlpackage

# 10-second model (140 max tokens)
python convert_kittentts.py --seconds 10 --output kittentts_10s.mlpackage

# Verify weights only (no conversion)
python convert_kittentts.py --verify-only
```

### Inference (Python)

```python
import numpy as np
import coremltools as ct

model = ct.models.MLModel("kittentts_5s.mlpackage")

# Phonemize text with espeak
import phonemizer, re
backend = phonemizer.backend.EspeakBackend(language="en-us", preserve_punctuation=True, with_stress=True)
phonemes = backend.phonemize(["Hello world"])[0]
tokens = ' '.join(re.findall(r"\w+|[^\w\s]", phonemes))

# Build input_ids from vocab (see convert_kittentts.py for full vocab string)
input_ids = np.zeros((1, 70), dtype=np.int32)
# ... fill with token indices ...

# Load voice
voices = np.load("voices.npz")
ref_s = voices["expr-voice-2-m"].reshape(1, -1).astype(np.float32)

out = model.predict({
"input_ids": input_ids,
"ref_s": ref_s,
"random_phases": np.random.randn(1, 9).astype(np.float32),
"attention_mask": attention_mask,
"source_noise": np.random.randn(1, 120000, 9).astype(np.float32),
})

audio = out["audio"].flatten()
length = int(out["audio_length_samples"].flatten()[0])
audio = audio[:length] # tail is already zeroed
```

---

## Model Architecture

```
Text -> Phonemes -> ALBERT -> Duration -> Alignment -> F0/Energy -> Style -> Decoder -> Generator -> Audio
```

KittenTTS Nano is a distilled version of Kokoro/StyleTTS2 with the same architecture but smaller dimensions:

| Component | Kokoro-82M | KittenTTS Nano |
|-----------|-----------|----------------|
| ALBERT hidden | 768 | 768 |
| ALBERT embed | 128 | 128 |
| Style dim | 128 | 128 |
| Text encoder | 512 | 512 |
| Decoder blocks | 4 | 4 |
| Generator channels | 256->128->64 | 256->128->64 |
| Parameters | 82M | 15M |
| Quantization | None | INT8 (ONNX) |

### Key Components

- **ALBERT Encoder**: Shared-weight transformer for phoneme context (4 repeats of 1 layer)
- **Predictor**: Duration, F0, and energy prediction with bidirectional LSTMs
- **Decoder**: 4 AdaIN decode blocks with style conditioning
- **Generator**: ISTFTNet vocoder with Snake activations, harmonic source module

---

## Conversion Details

The ONNX model uses INT8 quantization (ConvInteger, MatMulInteger, DynamicQuantizeLSTM). The conversion pipeline:

1. **Extract & dequantize** ONNX weights (INT8 * scale + zero_point -> FP32)
2. **Reconstruct** PyTorch model architecture from ONNX graph analysis
3. **Load** dequantized weights into PyTorch model (561/573 parameters)
4. **Trace** with `torch.jit.trace` using fixed input shapes
5. **Convert** to CoreML mlprogram format (FP32, iOS 17+)

### Bugs Fixed During Conversion

| Bug | Impact | Fix |
|-----|--------|-----|
| LSTM gate order ONNX [i,o,f,c] vs PyTorch [i,f,g,o] | Silent wrong output | Reorder gate weights during loading |
| BERT weight mapping (embedding_hidden_mapping_in swapped) | Wrong text encoding | Swap weight assignment |
| BatchNorm1d instead of LayerNorm in TextEncoder | Different normalization | Replace with LayerNorm + LeakyReLU |
| LeakyReLU instead of Snake activation in resblocks | Robotic audio | Implement Snake: x + (1/a)*sin^2(a*x) |
| Resblock dilations (1,1,1) instead of (1,3,5) | ~2x volume loss | Set convs1 dilations to (1,3,5) |
| NoiseResBlock missing dilations | Degraded noise path | Add dilations parameter (1,3,5) |
| reflection_pad (3,3) instead of (1,0) | Wrong padding | Fix to nn.ReflectionPad1d((1,0)) |
| conv_post missing padding=3 | Frequency response error | Add padding=3 to Conv1d |
| Phase accumulation fp32 drift in CoreML | Robotic harmonics | Chunked cumsum with periodic wrapping |

### CoreML-Specific Fixes

- **Phase accumulation**: `torch.cumsum` over 42k steps causes fp32 precision drift between CoreML and PyTorch runtimes. Higher harmonics (9th at 1800Hz) lose correlation (0.79). Fix: reshape into 300-step frames, cumsum per frame, carry wrapped inter-frame phase.
- **Fixed frame count**: Model uses `fixed_total_frames` to avoid dynamic shape issues in traced graph.
- **Tail zeroing**: Audio buffer is zeroed past `audio_length_samples` so consumers don't need to trim.

---

## Inputs & Outputs

### Inputs

| Name | Shape | Type | Description |
|------|-------|------|-------------|
| `input_ids` | [1, N] | INT32 | Phoneme token IDs (0-padded) |
| `ref_s` | [1, 256] | FLOAT32 | Voice style vector (from voices.npz) |
| `random_phases` | [1, 9] | FLOAT32 | Initial harmonic phases |
| `attention_mask` | [1, N] | INT32 | 1=valid token, 0=padding |
| `source_noise` | [1, T, 9] | FLOAT32 | Stochastic noise for unvoiced regions |

N = max tokens (e.g. 70 for 5s model). T = max audio samples (e.g. 120000 for 5s).

### Outputs

| Name | Shape | Type | Description |
|------|-------|------|-------------|
| `audio` | [1, 1, T+20] | FLOAT32 | Audio waveform (24kHz), zeroed past valid length |
| `audio_length_samples` | [1] | INT32 | Number of valid audio samples |
| `pred_dur` | [1, N] | FLOAT32 | Predicted duration per token (frames) |

---

## Verification Results

Comparison with ONNX reference (same text, matched frame count):

| Metric | Value |
|--------|-------|
| CoreML vs PyTorch correlation | 0.963 |
| RMS ratio (CoreML/ONNX) | 0.99 |
| Whisper transcription match | Identical |
| Parameters loaded | 561/573 (12 use defaults) |

The 12 unloaded parameters are `predictor.text_encoder.lstms.{1,3,5,7,9,11}.norm.{weight,bias}` — LayerNorm layers that default to weight=1, bias=0, matching the ONNX constants.

---

## Source Model

- **Model**: [KittenML/kitten-tts-nano-0.1](https://huggingface.co/KittenML/kitten-tts-nano-0.1)
- **Format**: ONNX (INT8 quantized, 23.8 MB)
- **Sample rate**: 24kHz
- **Voices**: 6 voices in `voices.npz`
- **Architecture**: Distilled Kokoro/StyleTTS2

---

## Files

```
coreml/
├── convert_kittentts.py # Conversion script (model architecture + weight loading + CoreML export)
├── README.md # This file
├── kitten_tts_nano_weights.npz # Extracted dequantized weights (numpy)
└── kitten_tts_nano_weights.pt # Extracted weights (PyTorch state dict)
```
Comment on lines +186 to +191

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing pyproject.toml and uv.lock required by AGENTS.md for target directories

Both AGENTS.md files explicitly require that each target directory bundles its own pyproject.toml and uv.lock: "Each target directory is self-contained: pyproject.toml, uv.lock, conversion scripts, docs, and sample assets." Every other coreml target directory in the repo has these files (e.g. models/tts/kokoro/coreml/, models/tts/magpie/coreml/, models/vad/silero-vad/coreml/, etc.). The models/tts/kittentts/coreml/ directory only contains README.md and convert_kittentts.py, missing both pyproject.toml and uv.lock. This means uv sync cannot be run from this target directory, breaking the standard development workflow described in AGENTS.md.

Prompt for agents
Add a pyproject.toml and uv.lock to models/tts/kittentts/coreml/. Follow the pattern from models/tts/kokoro/coreml/pyproject.toml. The pyproject.toml should declare the project dependencies (torch, coremltools, onnx, onnxruntime, numpy, scipy, phonemizer, huggingface_hub) with requires-python = ">=3.10". Then run uv lock to generate the uv.lock file. Also update the README.md Files section (lines 186-191) to include pyproject.toml and uv.lock in the directory listing, and change the Quick Start prerequisites section (lines 13-17) to use uv sync instead of raw pip install.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


---

**Requires**: iOS 17+ / macOS 14+ | Python 3.10+ | coremltools 9.0+
Loading