feat: add Qwen3-TTS backend for multilingual text-to-speech#290
feat: add Qwen3-TTS backend for multilingual text-to-speech#290Alex-Wengg wants to merge 11 commits intomainfrom
Conversation
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 49.8s diarization time • Test runtime: 4m 21s • 03/22/2026, 01:04 AM EST |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m17s • 03/22/2026, 12:50 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 261.8s processing • Test runtime: 4m 36s • 03/22/2026, 01:06 AM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m53s • 03/22/2026, 01:04 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 24s • 2026-03-22T04:54:08.284Z |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
ca5bc7c to
acca996
Compare
acca996 to
37ef324
Compare
Add CoreML-based Qwen3-TTS inference pipeline supporting English and Chinese synthesis. The pipeline implements prefill → LM decode (CB0) → code predictor (CB1-15) → audio decoder with temperature+top_k sampling for natural speech generation and proper EOS detection. Key components: - Qwen3TtsSynthesizer: Full inference pipeline with KV-cache management, 16-codebook generation, and automatic silence trimming - Qwen3TtsModelStore: CoreML model loading for prefill, decode, code predictor, and audio decoder models - Qwen3TtsManager: High-level API for model loading and synthesis - Qwen3TtsConstants: Model dimensions, special tokens, and generation parameters matching the PyTorch reference implementation - CLI support via --backend qwen3 flag with bilingual test sentences
Add automatic model download from alexwengg/qwen3-tts-coreml repo, matching the PocketTTS download pattern. Models are cached locally at ~/.cache/fluidaudio/Models/qwen3-tts/. Changes: - Add qwen3Tts repo to ModelNames.swift with model file definitions - Add Qwen3TtsResourceDownloader for HuggingFace auto-download - Update Qwen3TtsModelStore to use mlmodelc bundles and support both auto-download (loadIfNeeded) and local directory loading - Add Qwen3TtsManager.initialize() for auto-download workflow - Update CLI to auto-download by default (QWEN3_TTS_MODEL_DIR env var still supported for local override)
- Add repetition_penalty=1.3 matching PyTorch default - Penalize last 20 CB0 tokens to prevent repetitive loops - Fix Chinese TTS producing silent audio - Adjust temperature (0.7) and topK (30) for cleaner output - Add audio post-processing with de-essing - Document issues and fixes in docs/qwen3-tts-coreml-issues.md Before: CB0 stuck at same values, only 27/125 unique, Chinese silent After: 98% unique CB0, natural EOS, both EN/ZH transcribe correctly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CB0: repetition_penalty 1.3→1.05 on ALL prior tokens (was last 20) - CB0: add min_new_tokens=2 (suppress EOS for first 2 steps) - CB0: fix processing order to match transformers _get_logits_processor (rep_penalty → suppress → min_new_tokens → temp → top_k) - CP: temperature 0.7→0.9, topK 30→50 (matches PyTorch CP generate) - Disable audio post-processing (de-essing was muffling output) - Add codebook dump for debugging comparison with Python pipeline Python CoreML pipeline verified byte-for-byte identical to PyTorch with these params. Swift pipeline untested with new params. Co-Authored-By: Claude <noreply@anthropic.com>
FluidAudioTTS was renamed to FluidAudioEspeak on main. Move Qwen3TTS files to the new module location so the package builds correctly.
a2157d2 to
bfbf3ac
Compare
|
Amazing @Alex-Wengg! What's blocking the completion of this? :) |
|
It's not fully developed to satisfactorily level |
Resolve merge conflicts in ModelNames.swift by merging both: - Qwen3-TTS support from feature branch - Qwen3 ASR Int8 + G2P models from main
Qwen3TTS files were in Sources/FluidAudioEspeak/ which was never declared as a target in Package.swift, causing TTSCommand.swift to fail with "cannot find Qwen3TtsManager in scope". Move files into Sources/FluidAudio/TTS/Qwen3TTS/ and remove self-imports.
|
|
||
| // 3. Run greedy decode loop to generate all 16 codebooks per step | ||
| let decodeStart = Date() | ||
| let actualPrefillLen = textTokens.count + 11 // role(3) + text + think(7) + speaker(1) |
There was a problem hiding this comment.
🔴 Decode loop startPosition not capped at maxTextLength, mismatches trimmed KV cache
When textTokens.count > Qwen3TtsConstants.maxTextLength (128), the actualPrefillLen at line 97 is computed as textTokens.count + 11 (uncapped), but the KV cache is trimmed to min(textTokens.count, 128) + 11 at Qwen3TtsSynthesizer.swift:222. The createTextInputs function (Qwen3TtsSynthesizer.swift:444) also caps the actual text length to 128. This means the decode loop's startPosition will exceed the KV cache length, causing the rotary position embeddings in the decode model to be computed at incorrect positions, producing garbled output for any input with more than 128 tokens.
| let actualPrefillLen = textTokens.count + 11 // role(3) + text + think(7) + speaker(1) | |
| let actualPrefillLen = min(textTokens.count, Qwen3TtsConstants.maxTextLength) + 11 // role(3) + text + think(7) + speaker(1) |
Was this helpful? React with 👍 or 👎 to provide feedback.
| // DEBUG: Dump codebooks for comparison with PyTorch | ||
| do { | ||
| let dumpPath = "/tmp/swift_codebooks.txt" | ||
| var lines: [String] = ["# Swift CoreML codebooks: \(allCodebooks.count) frames x 16 codebooks"] | ||
| for (t, frame) in allCodebooks.enumerated() { | ||
| lines.append("frame \(t): \(frame)") | ||
| } | ||
| try lines.joined(separator: "\n").write(toFile: dumpPath, atomically: true, encoding: .utf8) | ||
| logger.info("Dumped codebooks to \(dumpPath)") | ||
| } catch { | ||
| logger.warning("Failed to dump codebooks: \(error)") | ||
| } |
There was a problem hiding this comment.
🔴 Debug file dump to /tmp left in production synthesis path
Lines 111-122 write codebook data to /tmp/swift_codebooks.txt on every call to synthesize(). This is debug code that should not be in production: it performs unnecessary file I/O on every synthesis, writes potentially sensitive data to a world-readable temp directory, and the surrounding do/catch silently swallows errors. Per the repo rules (CLAUDE.md), logging should use AppLogger — not file writes.
| // DEBUG: Dump codebooks for comparison with PyTorch | |
| do { | |
| let dumpPath = "/tmp/swift_codebooks.txt" | |
| var lines: [String] = ["# Swift CoreML codebooks: \(allCodebooks.count) frames x 16 codebooks"] | |
| for (t, frame) in allCodebooks.enumerated() { | |
| lines.append("frame \(t): \(frame)") | |
| } | |
| try lines.joined(separator: "\n").write(toFile: dumpPath, atomically: true, encoding: .utf8) | |
| logger.info("Dumped codebooks to \(dumpPath)") | |
| } catch { | |
| logger.warning("Failed to dump codebooks: \(error)") | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| guard data.count >= 10 else { | ||
| throw TTSError.processingFailed("Invalid NPY file: too small") |
There was a problem hiding this comment.
🔴 NPY v2 header parsing accesses out-of-bounds indices
The minimum size guard at Qwen3TtsModelStore.swift:210 only checks data.count >= 10, but for NPY version 2+ files, line 230 reads data[10] and data[11], which requires at least 12 bytes. A truncated or corrupt v2 NPY file with 10 or 11 bytes would pass the guard but crash with an out-of-bounds access.
Was this helpful? React with 👍 or 👎 to provide feedback.
| let embedArray = try createEmbeddingFromTable( | ||
| cpEmbeddings: cpEmbeddings, | ||
| tableIndex: step - 1, | ||
| tokenId: tokens.last! |
There was a problem hiding this comment.
🔴 Force unwrap tokens.last! violates repo rule against force unwrapping in production
AGENTS.md mandates: "no force unwrapping in production." Line 354 uses tokens.last!. While tokens is guaranteed non-empty in this context (initialized with [cb1] at line 345 and only appended to), this still violates the explicit repository rule.
| tokenId: tokens.last! | |
| tokenId: tokens[tokens.count - 1] |
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// | ||
| /// NOTE: This implementation requires pre-tokenized input. The text must be | ||
| /// tokenized using the Qwen3 tokenizer externally (e.g., in Python). | ||
| public actor Qwen3TtsManager { |
There was a problem hiding this comment.
🔴 No unit tests for new Qwen3-TTS code violates mandatory repo rule
AGENTS.md mandates: "Add unit tests when writing new code." This PR adds 5 new Swift files (Qwen3TtsConstants, Qwen3TtsManager, Qwen3TtsModelStore, Qwen3TtsResourceDownloader, Qwen3TtsSynthesizer) with no corresponding test files. The Tests/ directory has no Qwen3-TTS test coverage.
Was this helpful? React with 👍 or 👎 to provide feedback.
PocketTTS Smoke Test ✅
Runtime: 0m33s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Qwen3-ASR int8 Smoke Test ✅
Runtime: 3m12s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
English synthesis was robotic due to incorrect hardcoded token IDs (verified correct tokens from mobius test files). Chinese audio had ~3.5 seconds of leading silence from conservative trimming thresholds. Changes: - Fix English token IDs: corrected 3 tokens (311,8806 → 4686,1331,39586) - More aggressive silence trimming: threshold 0.02→0.005, window 20ms→10ms - Clean up unused CB0 sampling code (already using greedy decoding) Both languages now produce natural speech with no leading silence.
Newer Kokoro CoreML models require a source_noise feature that wasn't being provided, causing CI failures with "Feature source_noise is required but not specified" errors. Changes: - Add source_noise tensor [1, sampleRate*duration, 9] with random Float16 values - Update both synthesis pipeline and warm-up prediction - Size adapts to model variant: 5s (120k samples) or 15s (360k samples) - Use multiarray pooling for memory efficiency Fixes #290 CI test-tts workflow failure.
Summary
New files
Qwen3TtsSynthesizer.swift— Full inference pipeline: KV-cache prefill, CB0 sampling with EOS masking, CB1-15 code prediction, audio decoding, and silence trimmingQwen3TtsModelStore.swift— CoreML model loading for prefill, decode, code predictor, and audio decoderQwen3TtsManager.swift— High-level API for model loading and synthesisQwen3TtsConstants.swift— Model dimensions, special token IDs, and generation parametersModified files
TtsBackend.swift— Addqwen3TtscaseTTSCommand.swift— CLI support via--backend qwen3with bilingual test sentencesValidation
Test plan
swift buildswift run fluidaudio tts --backend qwen3 "Hello world, this is a test of the text to speech system."swift run fluidaudio tts --backend qwen3 "你好世界,这是一个文字转语音系统的测试。"🤖 Generated with Claude Code