Fix missing source_noise input in Kokoro TTS models by Alex-Wengg · Pull Request #412 · FluidInference/FluidAudio

Alex-Wengg · 2026-03-22T16:09:35Z

Summary

Fixes CI failure in test-tts workflow caused by missing source_noise input after PR #411 merged.

PR #411 (Kokoro ANE optimization) updated the Kokoro CoreML models to fp16, which introduced a new required input source_noise that the inference code wasn't providing.

Changes

Add source_noise tensor [1, sampleRate*duration, 9] with random Float16 values
Update both synthesis pipeline and warm-up prediction
Size adapts to model variant: 5s (120k samples) or 15s (360k samples)
Use multiarray pooling for memory efficiency

Error Fixed

Feature source_noise is required but not specified.

Test Plan

Cherry-picked from commit c8a5056 (originally on feature/qwen3-tts-coreml)
CI test-tts workflow should pass
Verify Kokoro TTS synthesis completes successfully

Fixes the CI failure blocking PR #409 and other PRs.

Newer Kokoro CoreML models require a source_noise feature that wasn't being provided, causing CI failures with "Feature source_noise is required but not specified" errors. Changes: - Add source_noise tensor [1, sampleRate*duration, 9] with random Float16 values - Update both synthesis pipeline and warm-up prediction - Size adapts to model variant: 5s (120k samples) or 15s (360k samples) - Use multiarray pooling for memory efficiency Fixes #290 CI test-tts workflow failure.

claude · 2026-03-22T16:09:54Z

Claude finished @Alex-Wengg's task —— View job

Code Review Complete ✅

I've successfully reviewed PR #412 that fixes the missing source_noise input in Kokoro TTS models.

Todo List

Create pending GitHub review
Get PR diff to understand specific changes
Read modified files for context
Analyze code changes for potential issues
Submit comprehensive review

Summary

✅ Approval: This PR correctly fixes the CI failure by adding the required source_noise input to Kokoro TTS models. The implementation is functionally sound and safe to merge.

Key Findings:

Fixes critical issue: Resolves CoreML "Feature source_noise is required" error
Proper memory management: Uses multiarray pooling correctly
Performance opportunity: Random noise generation could be optimized with vectorized operations
Code duplication: Identical logic appears in both files - could be refactored to shared utility

The change addresses the immediate blocker and follows existing patterns in the codebase. Performance improvements and code deduplication can be addressed in follow-up work.

claude

Code Review Summary

This PR correctly addresses the missing source_noise input that was causing CI failures after the Kokoro fp16 model updates. The implementation is functionally sound but has several areas for improvement:

✅ Positives:

Fixes the immediate issue: Adds required source_noise input to prevent CoreML errors
Proper memory management: Uses multiarray pooling and correctly recycles arrays
Adaptive sizing: Correctly adapts tensor size based on model variant (5s vs 15s)
Consistent implementation: Applied to both synthesis pipeline and warm-up prediction

⚠️ Issues Identified:

1. Performance Concern (KokoroSynthesizer.swift:315, TtsModels.swift:162)

The random noise generation loop could be expensive for larger models:

15s variant creates 360k × 9 = 3.24M Float16 values
Current approach uses pointer arithmetic in a tight loop
Suggestion: Use vectorized operations with vDSP for better performance

2. Code Duplication

Identical noise generation logic appears in both files:

KokoroSynthesizer.swift (lines 307-319)
TtsModels.swift (lines 155-166)
Suggestion: Extract to a shared utility function in TtsConstants or similar

3. Minor: Magic Numbers

The tensor shape [1, noiseLength, 9] uses hardcoded 9 - consider making this a named constant for clarity.

🔧 Recommended Improvements:

// In TtsConstants or similar shared location
static func generateSourceNoise(variant: ModelNames.TTS.Variant, pool: MultiArrayPool? = nil) async throws -> MLMultiArray {
    let maxSeconds = variant.maxDurationSeconds
    let noiseLength = TtsConstants.audioSampleRate * maxSeconds
    let channels = 9 // Named constant
    
    let sourceNoise = if let pool = pool {
        try await pool.rent(shape: [1, noiseLength, channels], dataType: .float16, zeroFill: false)
    } else {
        try MLMultiArray(shape: [1, NSNumber(value: noiseLength), NSNumber(value: channels)], dataType: .float16)
    }
    
    // Use vDSP for better performance on large arrays
    // ... vectorized implementation
    
    return sourceNoise
}

✅ Approval Status:

This change is safe to merge as it fixes a critical CI blocker. The performance optimizations can be addressed in a follow-up PR.

devin-ai-integration

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-03-22T16:12:58Z

Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift

+        // Source noise for newer Kokoro models
+        let maxSeconds = variant.maxDurationSeconds
+        let noiseLength = TtsConstants.audioSampleRate * maxSeconds
+        let sourceNoise = try await multiArrayPool.rent(
+            shape: [1, noiseLength, 9],
+            dataType: .float16,
+            zeroFill: false
+        )
+        let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9)
+        for i in 0..<(noiseLength * 9) {
+            let randomValue = Float.random(in: -1...1)
+            noisePointer[i] = Float16(randomValue).bitPattern
+        }


🟡 Missing pool preallocation for sourceNoise arrays

In synthesizeDetailed, the multiArrayPool is preallocated for phasesArray, inputArray/attentionMask, and refStyle (lines 533–554), but the new sourceNoise array is not preallocated. Since synthesizeChunk is called concurrently via withThrowingTaskGroup at KokoroSynthesizer.swift:564, each concurrent task will individually allocate a fresh sourceNoise MLMultiArray through the actor-serialized rent() call instead of reusing preallocated pooled arrays. This is especially impactful because sourceNoise is by far the largest array in the inference pipeline — shape [1, noiseLength, 9] where noiseLength is 120,000 (5s) or 360,000 (15s), yielding ~2.16 MB or ~6.48 MB per array in float16. The omission breaks the established preallocation pattern and causes unnecessary heap allocation pressure during the latency-sensitive prediction phase.

Prompt for agents

In Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift, in the synthesizeDetailed method around line 554 (after the refStyle preallocation block), add preallocation for the sourceNoise arrays. Group entries by variant (to get the correct noiseLength per variant), then preallocate for each group. For example, after the refShape preallocation (line 554), add something like: let groupedByVariant = Dictionary(grouping: entries, by: { $0.template.variant }) for (variant, group) in groupedByVariant { let maxSeconds = variant.maxDurationSeconds let noiseLength = TtsConstants.audioSampleRate * maxSeconds let noiseShape: [NSNumber] = [1, NSNumber(value: noiseLength), 9] try await multiArrayPool.preallocate( shape: noiseShape, dataType: .float16, count: max(1, group.count), zeroFill: false ) } This follows the same pattern used for inputArray/attentionMask preallocation (grouped by targetTokens) and ensures the large sourceNoise arrays are ready before concurrent chunk synthesis begins.

Was this helpful? React with 👍 or 👎 to provide feedback.

github-actions · 2026-03-22T16:15:30Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	NaN%	<20%	⚠️	Diarization Error Rate (lower is better)
RTFx	NaNx	>1.0x	⚠️	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	NaN	NaN	Fetching diarization models
Model Compile	NaN	NaN	CoreML compilation
Audio Load	NaN	NaN	Loading audio file
Segmentation	NaN	NaN	VAD + speech detection
Embedding	NaN	NaN	Speaker embedding extraction
Clustering (VBx)	NaN	NaN	Hungarian algorithm + VBx clustering
Total	NaN	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	NaN%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: N/A • 03/22/2026, 12:15 PM EST}

Alex-Wengg merged commit 4b03d1f into main Mar 22, 2026
2 of 14 checks passed

Alex-Wengg deleted the fix/kokoro-source-noise branch March 22, 2026 16:11

claude bot reviewed Mar 22, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing source_noise input in Kokoro TTS models#412

Fix missing source_noise input in Kokoro TTS models#412
Alex-Wengg merged 1 commit intomainfrom
fix/kokoro-source-noise

Alex-Wengg commented Mar 22, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

claude bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude bot left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Mar 22, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Error Fixed

Test Plan

Uh oh!

claude bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Todo List

Summary

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review Summary

✅ Positives:

⚠️ Issues Identified:

1. Performance Concern (KokoroSynthesizer.swift:315, TtsModels.swift:162)

2. Code Duplication

3. Minor: Magic Numbers

🔧 Recommended Improvements:

✅ Approval Status:

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 22, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Mar 22, 2026 •

edited by devin-ai-integration bot

Loading

claude bot commented Mar 22, 2026 •

edited

Loading