FluidInference · Alex-Wengg · Feb 5, 2026 · Feb 5, 2026 · Feb 12, 2026 · Feb 12, 2026
diff --git a/Documentation/TTS/Qwen3TTS.md b/Documentation/TTS/Qwen3TTS.md
@@ -0,0 +1,104 @@
+# Qwen3-TTS: Multilingual Text-to-Speech (Beta)
+
+## Overview
+
+Qwen3-TTS is an LLM-based multilingual TTS backend built on the Qwen3 language model. It supports 10 languages including English and Chinese, producing natural speech at 24 kHz via a 4-stage CoreML pipeline.
+
+> **Beta.** Qwen3-TTS is in early beta. It does not yet include a built-in text tokenizer — input must be pre-tokenized externally (e.g., via the Python `qwen-tts` package). If you run into issues or have feedback, please open an issue. We'd love help testing across languages and hardware configs.
+
+## Quick Start
+
+### CLI
+
+```bash
+# English
+swift run fluidaudiocli tts --backend qwen3 \
+  "Hello world, this is a test of the text to speech system." \
+  --output hello.wav
+
+# Chinese
+swift run fluidaudiocli tts --backend qwen3 \
+  "你好世界，这是一个文字转语音系统的测试。" \
+  --output chinese.wav
+```
+
+Models are auto-downloaded from HuggingFace on first run.
+
+### Swift
+
+```swift
+import FluidAudio
+
+let manager = Qwen3TtsManager()
+try await manager.loadIfNeeded()
+
+// Token IDs must be generated externally (e.g., via Python qwen-tts processor)
+let tokenIds = [9707, 1879, 11, 419, 374, 264, 1273, 315, 279, 1467, 4686, 1331, 39586, 1849, 13]
+let result = try await manager.synthesize(text: "Hello world", tokenIds: tokenIds)
+
+let outputURL = URL(fileURLWithPath: "/tmp/qwen3_output.wav")
+try result.audio.write(to: outputURL)
+```
+
+## Pipeline
+
+```
+text tokens ──► Prefill ──► LM Decode Loop ──► Audio Decoder ──► WAV
+                  │              │
+                  │         ┌────┴────┐
+                  │         │ CB0     │ (greedy with repetition penalty)
+                  │         │ CB1-15  │ (code predictor, temperature sampling)
+                  │         └─────────┘
+                  │
+             role_ids + text_ids + speaker_embed + TTS special tokens
+```
+
+### Stages
+
+| Stage | Model | Description |
+|-------|-------|-------------|
+| 1. Prefill | `qwen3_tts_lm_prefill_v9` | Encodes text context → initial logits, KV cache, past hidden state |
+| 2. LM Decode | `qwen3_tts_lm_decode_v10` | Autoregressive loop generating CB0 tokens (main codebook) |
+| 3. Code Predictor | `qwen3_tts_cp_prefill` + `qwen3_tts_cp_decode` | Generates CB1-15 from past hidden + CB0 per step |
+| 4. Audio Decoder | `qwen3_tts_decoder_10s` | Converts 16-layer codebook frames to 24 kHz waveform |
+
+## Files
+
+| File | Role |
+|------|------|
+| `Qwen3TtsManager.swift` | Public API — `loadIfNeeded()`, `synthesize()` |
+| `Qwen3TtsSynthesizer.swift` | Core inference pipeline — prefill, decode loop, code predictor, audio decoder |
+| `Qwen3TtsModelStore.swift` | Loads and stores 5 CoreML models + embeddings from `.npy` files |
+| `Qwen3TtsConstants.swift` | Model dimensions, special token IDs, sampling parameters |
+| `Qwen3TtsResourceDownloader.swift` | Auto-downloads models from HuggingFace |
+
+## Sampling
+
+CB0 (main language model) uses greedy decoding with logit processors:
+- Repetition penalty (1.05) on all previously generated CB0 tokens
+- Token suppression: tokens 2048-3071 masked except EOS (2150)
+- `min_new_tokens`: EOS suppressed for first 2 steps
+
+CB1-15 (code predictor) uses temperature sampling:
+- Temperature: 0.9
+- Top-K: 50
+- Greedy code prediction produces silent/broken audio; temperature sampling is required.
+
+## Languages
+
+Qwen3-TTS supports 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.
+
+Language IDs are embedded via the codec embedding table during prefill (e.g., English = 2050, Chinese = 2055).
+
+## Limitations
+
+- **No built-in tokenizer.** Text must be pre-tokenized using the Qwen3 tokenizer externally. The CLI currently supports two hardcoded test sentences.
+- **Max 128 text tokens.** Longer inputs are truncated.
+- **Max 125 codec frames.** Generates up to ~10 seconds of audio per call.
+- **CPU+GPU compute.** Models run on `cpuAndGPU` compute units (no ANE optimization yet).
+
+## Model Source
+
+Models are hosted at [alexwengg/qwen3-tts-coreml](https://huggingface.co/alexwengg/qwen3-tts-coreml) on HuggingFace.
+
+Based on [Qwen/Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base).
diff --git a/Sources/FluidAudio/DownloadUtils.swift b/Sources/FluidAudio/DownloadUtils.swift
@@ -329,6 +329,7 @@ public class DownloadUtils {
                         shouldInclude =
                             patterns.isEmpty || patterns.contains { itemPath.hasPrefix($0) }
                             || itemPath.hasSuffix(".json") || itemPath.hasSuffix(".txt")
+                            || itemPath.hasSuffix(".npy") || itemPath.hasSuffix(".bin")
                     }
                     if shouldInclude {
                         let fileSize = item["size"] as? Int ?? -1

diff --git a/Sources/FluidAudio/ModelNames.swift b/Sources/FluidAudio/ModelNames.swift
@@ -17,6 +17,7 @@ public enum Repo: String, CaseIterable {
     case pocketTts = "FluidInference/pocket-tts-coreml"
     case qwen3Asr = "FluidInference/qwen3-asr-0.6b-coreml/f32"
     case qwen3AsrInt8 = "FluidInference/qwen3-asr-0.6b-coreml/int8"
+    case qwen3Tts = "alexwengg/qwen3-tts-coreml"
 
     /// Repository slug (without owner)
     public var name: String {
@@ -51,6 +52,8 @@ public enum Repo: String, CaseIterable {
             return "qwen3-asr-0.6b-coreml/f32"
         case .qwen3AsrInt8:
             return "qwen3-asr-0.6b-coreml/int8"
+        case .qwen3Tts:
+            return "qwen3-tts-coreml"
         }
     }
 
@@ -69,6 +72,8 @@ public enum Repo: String, CaseIterable {
             return "FluidInference/ls-eend-coreml"
         case .qwen3Asr, .qwen3AsrInt8:
             return "FluidInference/qwen3-asr-0.6b-coreml"
+        case .qwen3Tts:
+            return "alexwengg/qwen3-tts-coreml"
         default:
             return "FluidInference/\(name)"
         }
@@ -109,6 +114,8 @@ public enum Repo: String, CaseIterable {
             return "ls-eend"
         case .pocketTts:
             return "pocket-tts"
+        case .qwen3Tts:
+            return "qwen3-tts"
         default:
             return name
         }
@@ -423,6 +430,45 @@ public enum ModelNames {
         ]
     }
 
+    /// Qwen3-TTS model names (LLM-based multilingual TTS)
+    public enum Qwen3TTS {
+        public static let lmPrefill = "qwen3_tts_lm_prefill_v9"
+        public static let lmDecode = "qwen3_tts_lm_decode_v10"
+        public static let cpPrefill = "qwen3_tts_cp_prefill"
+        public static let cpDecode = "qwen3_tts_cp_decode"
+        public static let audioDecoder = "qwen3_tts_decoder_10s"
+
+        public static let lmPrefillFile = lmPrefill + ".mlmodelc"
+        public static let lmDecodeFile = lmDecode + ".mlmodelc"
+        public static let cpPrefillFile = cpPrefill + ".mlmodelc"
+        public static let cpDecodeFile = cpDecode + ".mlmodelc"
+        public static let audioDecoderFile = audioDecoder + ".mlmodelc"
+
+        /// Speaker embedding file.
+        public static let speakerEmbeddingFile = "speaker_embedding_official.npy"
+
+        /// Code predictor embedding tables [15, 2048, 1024].
+        public static let cpEmbeddingsFile = "cp_embeddings.npy"
+
+        /// TTS special token embedding files.
+        public static let ttsBosEmbedFile = "tts_bos_embed.npy"
+        public static let ttsPadEmbedFile = "tts_pad_embed.npy"
+        public static let ttsEosEmbedFile = "tts_eos_embed.npy"
+
+        public static let requiredModels: Set<String> = [
+            lmPrefillFile,
+            lmDecodeFile,
+            cpPrefillFile,
+            cpDecodeFile,
+            audioDecoderFile,
+            speakerEmbeddingFile,
+            cpEmbeddingsFile,
+            ttsBosEmbedFile,
+            ttsPadEmbedFile,
+            ttsEosEmbedFile,
+        ]
+    }
+
     /// Multilingual G2P (CharsiuG2P ByT5) model names
     public enum MultilingualG2P {
         public static let encoder = "MultilingualG2PEncoder"
@@ -540,6 +586,8 @@ public enum ModelNames {
             return ModelNames.LSEEND.requiredModels
         case .qwen3Asr, .qwen3AsrInt8:
             return ModelNames.Qwen3ASR.requiredModelsFull
+        case .qwen3Tts:
+            return ModelNames.Qwen3TTS.requiredModels
         }
     }
 }
diff --git a/Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift b/Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift
@@ -304,7 +304,22 @@ public struct KokoroSynthesizer {
             zeroFill: true
         )
 
+        // Source noise for newer Kokoro models
+        let maxSeconds = variant.maxDurationSeconds
+        let noiseLength = TtsConstants.audioSampleRate * maxSeconds
+        let sourceNoise = try await multiArrayPool.rent(
+            shape: [1, noiseLength, 9],
+            dataType: .float16,
+            zeroFill: false
+        )
+        let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9)
+        for i in 0..<(noiseLength * 9) {
+            let randomValue = Float.random(in: -1...1)
+            noisePointer[i] = Float16(randomValue).bitPattern
+        }
+
         func recycleModelArrays() async {
+            await multiArrayPool.recycle(sourceNoise, zeroFill: false)
             await multiArrayPool.recycle(phasesArray, zeroFill: true)
             await multiArrayPool.recycle(attentionMask, zeroFill: false)
             await multiArrayPool.recycle(inputArray, zeroFill: false)
@@ -338,6 +353,7 @@ public struct KokoroSynthesizer {
             "attention_mask": attentionMask,
             "ref_s": refStyle,
             "random_phases": phasesArray,
+            "source_noise": sourceNoise,
         ])
 
         let predictionStart = Date()

diff --git a/Sources/FluidAudio/TTS/Qwen3TTS/Qwen3TtsConstants.swift b/Sources/FluidAudio/TTS/Qwen3TTS/Qwen3TtsConstants.swift
@@ -0,0 +1,65 @@
+import Foundation
+
+/// Constants for the Qwen3-TTS language model TTS backend.
+public enum Qwen3TtsConstants {
+
+    // MARK: - Audio
+
+    public static let audioSampleRate: Int = 24_000
+
+    // MARK: - Model dimensions
+
+    public static let hiddenSize: Int = 1024
+    public static let numHeads: Int = 16
+    public static let numKvHeads: Int = 8
+    public static let headDim: Int = 128
+    public static let numLayers: Int = 28
+    public static let vocabSize: Int = 152064
+    public static let numCodebooks: Int = 16
+    public static let numCodeGroups: Int = 16
+
+    // MARK: - Special token IDs
+
+    public static let ttsBosTokenId: Int = 151672
+    public static let ttsPadTokenId: Int = 151671
+    public static let ttsEosTokenId: Int = 151673
+    public static let codecBosTokenId: Int = 2149
+    public static let codecEosTokenId: Int = 2150
+    public static let codecPadTokenId: Int = 2050
+
+    // MARK: - Language IDs
+
+    public static let languageEnglish: Int = 2050
+    public static let languageChinese: Int = 2055
+
+    // MARK: - Role prefix tokens
+
+    public static let rolePrefixTokens: [Int] = [151644, 77091, 198]
+
+    // MARK: - Generation parameters
+
+    public static let maxTextLength: Int = 128
+    public static let maxCodecTokens: Int = 125
+
+    /// CB0 (outer LM) sampling parameters
+    public static let temperature: Float = 0.7
+    public static let topK: Int = 30
+    public static let repetitionPenalty: Float = 1.05
+    public static let minNewTokens: Int = 2
+
+    /// CB1-15 (code predictor) sampling parameters
+    public static let cpTemperature: Float = 0.9
+    public static let cpTopK: Int = 50
+
+    // MARK: - KV cache
+
+    /// Maximum KV cache length (prefill + generated tokens)
+    public static let maxKvLength: Int = 200
+
+    /// Number of KV cache entries (2 per layer: key + value)
+    public static let kvCacheEntries: Int = 56
+
+    // MARK: - Default voice
+
+    public static let defaultVoice: String = "default"
+}