Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
a20f040
feat: add Qwen3-TTS backend for multilingual text-to-speech
Alex-Wengg Feb 5, 2026
147142b
feat: auto-download Qwen3-TTS models from HuggingFace
Alex-Wengg Feb 5, 2026
334aded
fix: include .npy and .bin files in model downloads
Alex-Wengg Feb 12, 2026
fa90426
fix: add repetition penalty to prevent LM getting stuck
Alex-Wengg Feb 12, 2026
ca6ae7d
fix: align CB0/CP sampling with verified PyTorch pipeline
Alex-Wengg Feb 13, 2026
bfbf3ac
fix: move Qwen3TTS to FluidAudioEspeak after module rename
Alex-Wengg Feb 13, 2026
07c3c40
Merge branch 'main' into feature/qwen3-tts-coreml
Alex-Wengg Mar 22, 2026
0acb749
Merge remote-tracking branch 'origin/main' into feature/qwen3-tts-coreml
Alex-Wengg Mar 22, 2026
bd53d76
fix: move Qwen3TTS into FluidAudio module
Alex-Wengg Mar 22, 2026
cb946c9
Fix Qwen3-TTS audio quality issues
Alex-Wengg Mar 22, 2026
c8a5056
Add missing source_noise input to Kokoro TTS models
Alex-Wengg Mar 22, 2026
0dd3038
Fix ModelNamesTests for Qwen3-TTS models
Alex-Wengg Mar 22, 2026
1a43043
Mark Qwen3-TTS APIs as beta
Alex-Wengg Mar 22, 2026
9fc4353
Address Devin review findings for Qwen3-TTS
Alex-Wengg Mar 22, 2026
1f686ec
Add Qwen3-TTS documentation
Alex-Wengg Mar 22, 2026
7a9b345
Remove internal Qwen3-TTS issues doc from tracking
Alex-Wengg Mar 22, 2026
cbb1684
Address second Devin review: guard-let, Float16 safety, cleanup, logger
Alex-Wengg Mar 22, 2026
021fb77
Include embeddings in isLoaded check
Alex-Wengg Mar 22, 2026
f4477b5
Add beta testing call-to-action in Qwen3-TTS docs
Alex-Wengg Mar 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions Documentation/TTS/Qwen3TTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Qwen3-TTS: Multilingual Text-to-Speech (Beta)

## Overview

Qwen3-TTS is an LLM-based multilingual TTS backend built on the Qwen3 language model. It supports 10 languages including English and Chinese, producing natural speech at 24 kHz via a 4-stage CoreML pipeline.

> **Beta.** Qwen3-TTS is in early beta. It does not yet include a built-in text tokenizer — input must be pre-tokenized externally (e.g., via the Python `qwen-tts` package). If you run into issues or have feedback, please open an issue. We'd love help testing across languages and hardware configs.

## Quick Start

### CLI

```bash
# English
swift run fluidaudiocli tts --backend qwen3 \
"Hello world, this is a test of the text to speech system." \
--output hello.wav

# Chinese
swift run fluidaudiocli tts --backend qwen3 \
"你好世界,这是一个文字转语音系统的测试。" \
--output chinese.wav
```

Models are auto-downloaded from HuggingFace on first run.

### Swift

```swift
import FluidAudio

let manager = Qwen3TtsManager()
try await manager.loadIfNeeded()

// Token IDs must be generated externally (e.g., via Python qwen-tts processor)
let tokenIds = [9707, 1879, 11, 419, 374, 264, 1273, 315, 279, 1467, 4686, 1331, 39586, 1849, 13]
let result = try await manager.synthesize(text: "Hello world", tokenIds: tokenIds)

let outputURL = URL(fileURLWithPath: "/tmp/qwen3_output.wav")
try result.audio.write(to: outputURL)
```

## Pipeline

```
text tokens ──► Prefill ──► LM Decode Loop ──► Audio Decoder ──► WAV
│ │
│ ┌────┴────┐
│ │ CB0 │ (greedy with repetition penalty)
│ │ CB1-15 │ (code predictor, temperature sampling)
│ └─────────┘
role_ids + text_ids + speaker_embed + TTS special tokens
```

### Stages

| Stage | Model | Description |
|-------|-------|-------------|
| 1. Prefill | `qwen3_tts_lm_prefill_v9` | Encodes text context → initial logits, KV cache, past hidden state |
| 2. LM Decode | `qwen3_tts_lm_decode_v10` | Autoregressive loop generating CB0 tokens (main codebook) |
| 3. Code Predictor | `qwen3_tts_cp_prefill` + `qwen3_tts_cp_decode` | Generates CB1-15 from past hidden + CB0 per step |
| 4. Audio Decoder | `qwen3_tts_decoder_10s` | Converts 16-layer codebook frames to 24 kHz waveform |

## Files

| File | Role |
|------|------|
| `Qwen3TtsManager.swift` | Public API — `loadIfNeeded()`, `synthesize()` |
| `Qwen3TtsSynthesizer.swift` | Core inference pipeline — prefill, decode loop, code predictor, audio decoder |
| `Qwen3TtsModelStore.swift` | Loads and stores 5 CoreML models + embeddings from `.npy` files |
| `Qwen3TtsConstants.swift` | Model dimensions, special token IDs, sampling parameters |
| `Qwen3TtsResourceDownloader.swift` | Auto-downloads models from HuggingFace |

## Sampling

CB0 (main language model) uses greedy decoding with logit processors:
- Repetition penalty (1.05) on all previously generated CB0 tokens
- Token suppression: tokens 2048-3071 masked except EOS (2150)
- `min_new_tokens`: EOS suppressed for first 2 steps

CB1-15 (code predictor) uses temperature sampling:
- Temperature: 0.9
- Top-K: 50
- Greedy code prediction produces silent/broken audio; temperature sampling is required.

## Languages

Qwen3-TTS supports 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.

Language IDs are embedded via the codec embedding table during prefill (e.g., English = 2050, Chinese = 2055).

## Limitations

- **No built-in tokenizer.** Text must be pre-tokenized using the Qwen3 tokenizer externally. The CLI currently supports two hardcoded test sentences.
- **Max 128 text tokens.** Longer inputs are truncated.
- **Max 125 codec frames.** Generates up to ~10 seconds of audio per call.
- **CPU+GPU compute.** Models run on `cpuAndGPU` compute units (no ANE optimization yet).

## Model Source

Models are hosted at [alexwengg/qwen3-tts-coreml](https://huggingface.co/alexwengg/qwen3-tts-coreml) on HuggingFace.

Based on [Qwen/Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base).
1 change: 1 addition & 0 deletions Sources/FluidAudio/DownloadUtils.swift
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,7 @@ public class DownloadUtils {
shouldInclude =
patterns.isEmpty || patterns.contains { itemPath.hasPrefix($0) }
|| itemPath.hasSuffix(".json") || itemPath.hasSuffix(".txt")
|| itemPath.hasSuffix(".npy") || itemPath.hasSuffix(".bin")
}
if shouldInclude {
let fileSize = item["size"] as? Int ?? -1
Expand Down
48 changes: 48 additions & 0 deletions Sources/FluidAudio/ModelNames.swift
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ public enum Repo: String, CaseIterable {
case pocketTts = "FluidInference/pocket-tts-coreml"
case qwen3Asr = "FluidInference/qwen3-asr-0.6b-coreml/f32"
case qwen3AsrInt8 = "FluidInference/qwen3-asr-0.6b-coreml/int8"
case qwen3Tts = "alexwengg/qwen3-tts-coreml"

/// Repository slug (without owner)
public var name: String {
Expand Down Expand Up @@ -51,6 +52,8 @@ public enum Repo: String, CaseIterable {
return "qwen3-asr-0.6b-coreml/f32"
case .qwen3AsrInt8:
return "qwen3-asr-0.6b-coreml/int8"
case .qwen3Tts:
return "qwen3-tts-coreml"
}
}

Expand All @@ -69,6 +72,8 @@ public enum Repo: String, CaseIterable {
return "FluidInference/ls-eend-coreml"
case .qwen3Asr, .qwen3AsrInt8:
return "FluidInference/qwen3-asr-0.6b-coreml"
case .qwen3Tts:
return "alexwengg/qwen3-tts-coreml"
default:
return "FluidInference/\(name)"
}
Expand Down Expand Up @@ -109,6 +114,8 @@ public enum Repo: String, CaseIterable {
return "ls-eend"
case .pocketTts:
return "pocket-tts"
case .qwen3Tts:
return "qwen3-tts"
default:
return name
}
Expand Down Expand Up @@ -423,6 +430,45 @@ public enum ModelNames {
]
}

/// Qwen3-TTS model names (LLM-based multilingual TTS)
public enum Qwen3TTS {
public static let lmPrefill = "qwen3_tts_lm_prefill_v9"
public static let lmDecode = "qwen3_tts_lm_decode_v10"
public static let cpPrefill = "qwen3_tts_cp_prefill"
public static let cpDecode = "qwen3_tts_cp_decode"
public static let audioDecoder = "qwen3_tts_decoder_10s"

public static let lmPrefillFile = lmPrefill + ".mlmodelc"
public static let lmDecodeFile = lmDecode + ".mlmodelc"
public static let cpPrefillFile = cpPrefill + ".mlmodelc"
public static let cpDecodeFile = cpDecode + ".mlmodelc"
public static let audioDecoderFile = audioDecoder + ".mlmodelc"

/// Speaker embedding file.
public static let speakerEmbeddingFile = "speaker_embedding_official.npy"

/// Code predictor embedding tables [15, 2048, 1024].
public static let cpEmbeddingsFile = "cp_embeddings.npy"

/// TTS special token embedding files.
public static let ttsBosEmbedFile = "tts_bos_embed.npy"
public static let ttsPadEmbedFile = "tts_pad_embed.npy"
public static let ttsEosEmbedFile = "tts_eos_embed.npy"

public static let requiredModels: Set<String> = [
lmPrefillFile,
lmDecodeFile,
cpPrefillFile,
cpDecodeFile,
audioDecoderFile,
speakerEmbeddingFile,
cpEmbeddingsFile,
ttsBosEmbedFile,
ttsPadEmbedFile,
ttsEosEmbedFile,
]
}

/// Multilingual G2P (CharsiuG2P ByT5) model names
public enum MultilingualG2P {
public static let encoder = "MultilingualG2PEncoder"
Expand Down Expand Up @@ -540,6 +586,8 @@ public enum ModelNames {
return ModelNames.LSEEND.requiredModels
case .qwen3Asr, .qwen3AsrInt8:
return ModelNames.Qwen3ASR.requiredModelsFull
case .qwen3Tts:
return ModelNames.Qwen3TTS.requiredModels
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,22 @@ public struct KokoroSynthesizer {
zeroFill: true
)

// Source noise for newer Kokoro models
let maxSeconds = variant.maxDurationSeconds
let noiseLength = TtsConstants.audioSampleRate * maxSeconds
let sourceNoise = try await multiArrayPool.rent(
shape: [1, noiseLength, 9],
dataType: .float16,
zeroFill: false
)
let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9)
for i in 0..<(noiseLength * 9) {
let randomValue = Float.random(in: -1...1)
noisePointer[i] = Float16(randomValue).bitPattern
}

func recycleModelArrays() async {
await multiArrayPool.recycle(sourceNoise, zeroFill: false)
await multiArrayPool.recycle(phasesArray, zeroFill: true)
await multiArrayPool.recycle(attentionMask, zeroFill: false)
await multiArrayPool.recycle(inputArray, zeroFill: false)
Expand Down Expand Up @@ -338,6 +353,7 @@ public struct KokoroSynthesizer {
"attention_mask": attentionMask,
"ref_s": refStyle,
"random_phases": phasesArray,
"source_noise": sourceNoise,
])

let predictionStart = Date()
Expand Down
65 changes: 65 additions & 0 deletions Sources/FluidAudio/TTS/Qwen3TTS/Qwen3TtsConstants.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import Foundation

/// Constants for the Qwen3-TTS language model TTS backend.
public enum Qwen3TtsConstants {

// MARK: - Audio

public static let audioSampleRate: Int = 24_000

// MARK: - Model dimensions

public static let hiddenSize: Int = 1024
public static let numHeads: Int = 16
public static let numKvHeads: Int = 8
public static let headDim: Int = 128
public static let numLayers: Int = 28
public static let vocabSize: Int = 152064
public static let numCodebooks: Int = 16
public static let numCodeGroups: Int = 16

// MARK: - Special token IDs

public static let ttsBosTokenId: Int = 151672
public static let ttsPadTokenId: Int = 151671
public static let ttsEosTokenId: Int = 151673
public static let codecBosTokenId: Int = 2149
public static let codecEosTokenId: Int = 2150
public static let codecPadTokenId: Int = 2050

// MARK: - Language IDs

public static let languageEnglish: Int = 2050
public static let languageChinese: Int = 2055

// MARK: - Role prefix tokens

public static let rolePrefixTokens: [Int] = [151644, 77091, 198]

// MARK: - Generation parameters

public static let maxTextLength: Int = 128
public static let maxCodecTokens: Int = 125

/// CB0 (outer LM) sampling parameters
public static let temperature: Float = 0.7
public static let topK: Int = 30
public static let repetitionPenalty: Float = 1.05
public static let minNewTokens: Int = 2

/// CB1-15 (code predictor) sampling parameters
public static let cpTemperature: Float = 0.9
public static let cpTopK: Int = 50

// MARK: - KV cache

/// Maximum KV cache length (prefill + generated tokens)
public static let maxKvLength: Int = 200

/// Number of KV cache entries (2 per layer: key + value)
public static let kvCacheEntries: Int = 56

// MARK: - Default voice

public static let defaultVoice: String = "default"
}
Loading
Loading