Conversation
|
Claude finished @Alex-Wengg's task —— View job KittenTTS Backend PR Review ✅I've completed a comprehensive review of this KittenTTS integration PR and submitted my feedback through GitHub's review system. Final Assessment✅ APPROVED - This is a high-quality implementation that adds KittenTTS as a third TTS backend alongside Kokoro and PocketTTS. Key Strengths
Minor Considerations Noted
The PR adds valuable functionality without technical debt or breaking changes. The single-shot StyleTTS2 approach with two model variants (15M Nano, 82M Mini) provides good options for different use cases. |
There was a problem hiding this comment.
Overall Assessment
This is a well-architected and high-quality implementation that adds KittenTTS as a third TTS backend. The code follows FluidAudio's established patterns and maintains consistency with existing TTS backends.
Strengths
🏗️ Architecture
- Clean separation of concerns with proper actor-based concurrency
- Smart reuse of Kokoro's G2P pipeline (no espeak dependency)
- Efficient model selection based on token count (5s vs 10s variants)
🚀 Performance
- Strong benchmarks: 1.12x RTFx (Nano), 3.13x RTFx (Mini)
- CPU+GPU configuration avoids ANE float16 precision artifacts
- Single-shot synthesis for low latency
🎛️ API Design
- Consistent interface matching Kokoro/PocketTTS patterns
- Speed control for Mini variant (nice feature addition)
- Comprehensive error handling with descriptive messages
✅ Code Quality
- Proper Swift documentation with examples
- 21 unit tests covering tokenizer and manager
- Follows project's coding standards and patterns
Minor Considerations
Logging Subsystem:
KittenTtsModelStore.swift:10 - Consider using com.fluidaudio.tts.kittentts instead of com.fluidaudio.tts for better log filtering
Error Handling:
Speed parameter validation could be added for Mini variant to prevent extreme values
CLI Integration:
The --speed flag is nicely integrated but only applies to Mini - this is clearly documented
Recommendation
✅ APPROVE - This PR is ready to merge. The implementation is solid, well-tested, and follows all established patterns. It adds valuable functionality without introducing technical debt or breaking changes.
| wordToPhonemes: lexicons.word, | ||
| caseSensitiveLexicon: lexicons.caseSensitive, | ||
| customLexicon: nil, | ||
| targetTokens: 500, |
There was a problem hiding this comment.
🔴 Long text silently truncated: phonemize flattens all chunks but inference drops tokens beyond maxTokens
The phonemize function at Sources/FluidAudio/TTS/KittenTTS/Pipeline/KittenTtsSynthesizer.swift:193-210 uses KokoroChunker.chunk with targetTokens: 70 to split text into chunks, then flattens ALL chunk phonemes into a single [String] array. For any text that produces multiple chunks (e.g., a paragraph), the flattened phoneme count can easily exceed 140 tokens. However, the inference functions runNanoInference and runMiniInference allocate a fixed-size input of n = maxTokens (70 or 140) and silently drop all tokens beyond that limit via inputIdsPtr[i] = i < tokenIds.count ? tokenIds[i] : padTokenId (lines 236, 303). This means the user gets truncated audio with no error or warning. By contrast, Kokoro's synthesizer (KokoroSynthesizer.swift:499-508) correctly synthesizes each chunk independently and concatenates the results. KittenTTS should either synthesize each chunk separately and concatenate, or reject/warn when the token count exceeds the model's capacity.
Was this helpful? React with 👍 or 👎 to provide feedback.
Qwen3-ASR int8 Smoke Test ✅
Runtime: 3m8s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 5m 9s • 2026-03-22T17:10:17.191Z |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 238.6s processing • Test runtime: 4m 9s • 03/22/2026, 01:07 PM EST |
PocketTTS Smoke Test ✅
Runtime: 0m24s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m17s • 03/22/2026, 12:56 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 42.8s diarization time • Test runtime: 5m 33s • 03/22/2026, 01:10 PM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m19s • 03/22/2026, 01:00 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
|
Hi, I tested it, and it failed with following: Building for debugging... |
KittenTTS reuses Kokoro's G2P pipeline for phonemization, which requires us_lexicon_cache.json. The loadSimplePhonemeDictionary() method was attempting to load the cache without first downloading it, causing a "Missing lexicon cache" error on first use. Changes: - Add TtsResourceDownloader.ensureLexiconFile() call before loading cache - Auto-downloads us_lexicon_cache.json from HuggingFace on first use - Add kitten-tts-test.yml workflow to verify both Nano/Mini variants Fixes issue reported by @Josscii in PR #409 comment
a700ec3 to
9ca3c3f
Compare
KittenTTS Smoke TestTest Results
Dependencies
Note: KittenTTS reuses Kokoro's G2P pipeline for phonemization. This test verifies the lexicon cache auto-downloads correctly and both Nano/Mini variants can synthesize audio. |
c4806b0 to
9ca3c3f
Compare
## Summary Fixes CI failure in `test-tts` workflow caused by missing `source_noise` input after PR #411 merged. PR #411 (Kokoro ANE optimization) updated the Kokoro CoreML models to fp16, which introduced a new required input `source_noise` that the inference code wasn't providing. ## Changes - Add `source_noise` tensor [1, sampleRate*duration, 9] with random Float16 values - Update both synthesis pipeline and warm-up prediction - Size adapts to model variant: 5s (120k samples) or 15s (360k samples) - Use multiarray pooling for memory efficiency ## Error Fixed ``` Feature source_noise is required but not specified. ``` ## Test Plan - [x] Cherry-picked from commit c8a5056 (originally on feature/qwen3-tts-coreml) - [ ] CI `test-tts` workflow should pass - [ ] Verify Kokoro TTS synthesis completes successfully Fixes the CI failure blocking PR #409 and other PRs. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/412" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
KittenTTS reuses Kokoro's G2P pipeline for phonemization, which requires us_lexicon_cache.json. The loadSimplePhonemeDictionary() method was attempting to load the cache without first downloading it, causing a "Missing lexicon cache" error on first use. Changes: - Add TtsResourceDownloader.ensureLexiconFile() call before loading cache - Auto-downloads us_lexicon_cache.json from HuggingFace on first use - Add kitten-tts-test.yml workflow to verify both Nano/Mini variants Fixes issue reported by @Josscii in PR #409 comment
9ca3c3f to
817ed87
Compare
|
Another issue I found is that some simple words is spoken weirdly. For example, the test phrase |
f6c4521 to
817ed87
Compare
could you give me the wav file for examination and which model type. Mini has more parameters vs nano |
KittenTTS reuses Kokoro's G2P pipeline for phonemization, which requires us_lexicon_cache.json. The loadSimplePhonemeDictionary() method was attempting to load the cache without first downloading it, causing a "Missing lexicon cache" error on first use. Changes: - Add TtsResourceDownloader.ensureLexiconFile() call before loading cache - Auto-downloads us_lexicon_cache.json from HuggingFace on first use - Add kitten-tts-test.yml workflow to verify both Nano/Mini variants Fixes issue reported by @Josscii in PR #409 comment
Add 'kitten' backend option that defaults to Mini (82M params) instead of requiring explicit 'kitten-mini' flag. Users can still use 'kitten-nano' for the smaller 15M model. Rationale: - Mini has better quality (3.13x RTF vs 1.12x for Nano) - Mini supports speed control, Nano does not - 82M is still relatively small and runs well on Apple Silicon Changes: - Add 'kitten' and 'kittentts' backend options → .kittenTts(.mini) - Update help text to show 'kitten (Mini 82M)' option - KittenTtsManager already defaults to .mini in its initializer
817ed87 to
762a3de
Compare
|
I'm using the Hello world One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. |
Fixes four issues identified in PR #409 review: 1. Token truncation: Reduce targetTokens from 500 to 70 - KittenTTS models support max 70 tokens (5s) or 140 tokens (10s) - Using 500 caused silent audio cutoff for longer inputs - Now uses conservative 70 token limit to fit all variants 2. Missing exit code: Add exit(1) on synthesis failure - runKittenTts() was logging errors but not exiting - CI smoke tests were reporting PASSED even on failures - Now properly exits with code 1 on error 3. Cache path mismatch: Fix CI workflow cache path - Workflow specified 'kittentts' but models store under 'kittentts-coreml' - Prevented effective caching across CI runs - Updated to correct path: ~/.cache/fluidaudio/Models/kittentts-coreml 4. Code style: Replace nested if-statements with guard - tokenize() used nested if-statements violating project guidelines - Replaced with early-exit guard statements per style guide - Cleaner control flow, consistent with codebase patterns Addresses feedback from Devin review comment #4106567814
Summary
alexwengg/kittentts-coremlon first useexpr-voice-{2,3,4,5}-{m,f}Usage
Benchmarks (M2, warm cache, longer text)
New files
Modified files
ModelNames.swift— repos, filenames, voices for KittenTTSTtsBackend.swift—.kittenTts(KittenTtsVariant)caseTTSCommand.swift— CLI dispatch +--speedflagCloses #49 (requested by @Josscii)