Fix missing source_noise input in Kokoro TTS models#412
Conversation
Newer Kokoro CoreML models require a source_noise feature that wasn't being provided, causing CI failures with "Feature source_noise is required but not specified" errors. Changes: - Add source_noise tensor [1, sampleRate*duration, 9] with random Float16 values - Update both synthesis pipeline and warm-up prediction - Size adapts to model variant: 5s (120k samples) or 15s (360k samples) - Use multiarray pooling for memory efficiency Fixes #290 CI test-tts workflow failure.
|
Claude finished @Alex-Wengg's task —— View job Code Review Complete ✅I've successfully reviewed PR #412 that fixes the missing Todo List
Summary✅ Approval: This PR correctly fixes the CI failure by adding the required Key Findings:
The change addresses the immediate blocker and follows existing patterns in the codebase. Performance improvements and code deduplication can be addressed in follow-up work. |
There was a problem hiding this comment.
Code Review Summary
This PR correctly addresses the missing source_noise input that was causing CI failures after the Kokoro fp16 model updates. The implementation is functionally sound but has several areas for improvement:
✅ Positives:
- Fixes the immediate issue: Adds required
source_noiseinput to prevent CoreML errors - Proper memory management: Uses multiarray pooling and correctly recycles arrays
- Adaptive sizing: Correctly adapts tensor size based on model variant (5s vs 15s)
- Consistent implementation: Applied to both synthesis pipeline and warm-up prediction
⚠️ Issues Identified:
1. Performance Concern (KokoroSynthesizer.swift:315, TtsModels.swift:162)
The random noise generation loop could be expensive for larger models:
- 15s variant creates 360k × 9 = 3.24M Float16 values
- Current approach uses pointer arithmetic in a tight loop
- Suggestion: Use vectorized operations with
vDSPfor better performance
2. Code Duplication
Identical noise generation logic appears in both files:
KokoroSynthesizer.swift(lines 307-319)TtsModels.swift(lines 155-166)- Suggestion: Extract to a shared utility function in
TtsConstantsor similar
3. Minor: Magic Numbers
The tensor shape [1, noiseLength, 9] uses hardcoded 9 - consider making this a named constant for clarity.
🔧 Recommended Improvements:
// In TtsConstants or similar shared location
static func generateSourceNoise(variant: ModelNames.TTS.Variant, pool: MultiArrayPool? = nil) async throws -> MLMultiArray {
let maxSeconds = variant.maxDurationSeconds
let noiseLength = TtsConstants.audioSampleRate * maxSeconds
let channels = 9 // Named constant
let sourceNoise = if let pool = pool {
try await pool.rent(shape: [1, noiseLength, channels], dataType: .float16, zeroFill: false)
} else {
try MLMultiArray(shape: [1, NSNumber(value: noiseLength), NSNumber(value: channels)], dataType: .float16)
}
// Use vDSP for better performance on large arrays
// ... vectorized implementation
return sourceNoise
}✅ Approval Status:
This change is safe to merge as it fixes a critical CI blocker. The performance optimizations can be addressed in a follow-up PR.
| // Source noise for newer Kokoro models | ||
| let maxSeconds = variant.maxDurationSeconds | ||
| let noiseLength = TtsConstants.audioSampleRate * maxSeconds | ||
| let sourceNoise = try await multiArrayPool.rent( | ||
| shape: [1, noiseLength, 9], | ||
| dataType: .float16, | ||
| zeroFill: false | ||
| ) | ||
| let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9) | ||
| for i in 0..<(noiseLength * 9) { | ||
| let randomValue = Float.random(in: -1...1) | ||
| noisePointer[i] = Float16(randomValue).bitPattern | ||
| } |
There was a problem hiding this comment.
🟡 Missing pool preallocation for sourceNoise arrays
In synthesizeDetailed, the multiArrayPool is preallocated for phasesArray, inputArray/attentionMask, and refStyle (lines 533–554), but the new sourceNoise array is not preallocated. Since synthesizeChunk is called concurrently via withThrowingTaskGroup at KokoroSynthesizer.swift:564, each concurrent task will individually allocate a fresh sourceNoise MLMultiArray through the actor-serialized rent() call instead of reusing preallocated pooled arrays. This is especially impactful because sourceNoise is by far the largest array in the inference pipeline — shape [1, noiseLength, 9] where noiseLength is 120,000 (5s) or 360,000 (15s), yielding ~2.16 MB or ~6.48 MB per array in float16. The omission breaks the established preallocation pattern and causes unnecessary heap allocation pressure during the latency-sensitive prediction phase.
Prompt for agents
In Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift, in the synthesizeDetailed method around line 554 (after the refStyle preallocation block), add preallocation for the sourceNoise arrays. Group entries by variant (to get the correct noiseLength per variant), then preallocate for each group. For example, after the refShape preallocation (line 554), add something like:
let groupedByVariant = Dictionary(grouping: entries, by: { $0.template.variant })
for (variant, group) in groupedByVariant {
let maxSeconds = variant.maxDurationSeconds
let noiseLength = TtsConstants.audioSampleRate * maxSeconds
let noiseShape: [NSNumber] = [1, NSNumber(value: noiseLength), 9]
try await multiArrayPool.preallocate(
shape: noiseShape,
dataType: .float16,
count: max(1, group.count),
zeroFill: false
)
}
This follows the same pattern used for inputArray/attentionMask preallocation (grouped by targetTokens) and ensures the large sourceNoise arrays are ready before concurrent chunk synthesis begins.
Was this helpful? React with 👍 or 👎 to provide feedback.
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: N/A • 03/22/2026, 12:15 PM EST |
Summary
Fixes CI failure in
test-ttsworkflow caused by missingsource_noiseinput after PR #411 merged.PR #411 (Kokoro ANE optimization) updated the Kokoro CoreML models to fp16, which introduced a new required input
source_noisethat the inference code wasn't providing.Changes
source_noisetensor [1, sampleRate*duration, 9] with random Float16 valuesError Fixed
Test Plan
test-ttsworkflow should passFixes the CI failure blocking PR #409 and other PRs.