Skip to content

Fix missing source_noise input in Kokoro TTS models#412

Merged
Alex-Wengg merged 1 commit intomainfrom
fix/kokoro-source-noise
Mar 22, 2026
Merged

Fix missing source_noise input in Kokoro TTS models#412
Alex-Wengg merged 1 commit intomainfrom
fix/kokoro-source-noise

Conversation

@Alex-Wengg
Copy link
Member

@Alex-Wengg Alex-Wengg commented Mar 22, 2026

Summary

Fixes CI failure in test-tts workflow caused by missing source_noise input after PR #411 merged.

PR #411 (Kokoro ANE optimization) updated the Kokoro CoreML models to fp16, which introduced a new required input source_noise that the inference code wasn't providing.

Changes

  • Add source_noise tensor [1, sampleRate*duration, 9] with random Float16 values
  • Update both synthesis pipeline and warm-up prediction
  • Size adapts to model variant: 5s (120k samples) or 15s (360k samples)
  • Use multiarray pooling for memory efficiency

Error Fixed

Feature source_noise is required but not specified.

Test Plan

  • Cherry-picked from commit c8a5056 (originally on feature/qwen3-tts-coreml)
  • CI test-tts workflow should pass
  • Verify Kokoro TTS synthesis completes successfully

Fixes the CI failure blocking PR #409 and other PRs.


Open with Devin

Newer Kokoro CoreML models require a source_noise feature that wasn't
being provided, causing CI failures with "Feature source_noise is
required but not specified" errors.

Changes:
- Add source_noise tensor [1, sampleRate*duration, 9] with random Float16 values
- Update both synthesis pipeline and warm-up prediction
- Size adapts to model variant: 5s (120k samples) or 15s (360k samples)
- Use multiarray pooling for memory efficiency

Fixes #290 CI test-tts workflow failure.
@claude
Copy link
Contributor

claude bot commented Mar 22, 2026

Claude finished @Alex-Wengg's task —— View job


Code Review Complete ✅

I've successfully reviewed PR #412 that fixes the missing source_noise input in Kokoro TTS models.

Todo List

  • Create pending GitHub review
  • Get PR diff to understand specific changes
  • Read modified files for context
  • Analyze code changes for potential issues
  • Submit comprehensive review

Summary

✅ Approval: This PR correctly fixes the CI failure by adding the required source_noise input to Kokoro TTS models. The implementation is functionally sound and safe to merge.

Key Findings:

  • Fixes critical issue: Resolves CoreML "Feature source_noise is required" error
  • Proper memory management: Uses multiarray pooling correctly
  • Performance opportunity: Random noise generation could be optimized with vectorized operations
  • Code duplication: Identical logic appears in both files - could be refactored to shared utility

The change addresses the immediate blocker and follows existing patterns in the codebase. Performance improvements and code deduplication can be addressed in follow-up work.


@Alex-Wengg Alex-Wengg merged commit 4b03d1f into main Mar 22, 2026
2 of 14 checks passed
@Alex-Wengg Alex-Wengg deleted the fix/kokoro-source-noise branch March 22, 2026 16:11
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR correctly addresses the missing source_noise input that was causing CI failures after the Kokoro fp16 model updates. The implementation is functionally sound but has several areas for improvement:

Positives:

  • Fixes the immediate issue: Adds required source_noise input to prevent CoreML errors
  • Proper memory management: Uses multiarray pooling and correctly recycles arrays
  • Adaptive sizing: Correctly adapts tensor size based on model variant (5s vs 15s)
  • Consistent implementation: Applied to both synthesis pipeline and warm-up prediction

⚠️ Issues Identified:

1. Performance Concern (KokoroSynthesizer.swift:315, TtsModels.swift:162)

The random noise generation loop could be expensive for larger models:

  • 15s variant creates 360k × 9 = 3.24M Float16 values
  • Current approach uses pointer arithmetic in a tight loop
  • Suggestion: Use vectorized operations with vDSP for better performance

2. Code Duplication

Identical noise generation logic appears in both files:

  • KokoroSynthesizer.swift (lines 307-319)
  • TtsModels.swift (lines 155-166)
  • Suggestion: Extract to a shared utility function in TtsConstants or similar

3. Minor: Magic Numbers

The tensor shape [1, noiseLength, 9] uses hardcoded 9 - consider making this a named constant for clarity.

🔧 Recommended Improvements:

// In TtsConstants or similar shared location
static func generateSourceNoise(variant: ModelNames.TTS.Variant, pool: MultiArrayPool? = nil) async throws -> MLMultiArray {
    let maxSeconds = variant.maxDurationSeconds
    let noiseLength = TtsConstants.audioSampleRate * maxSeconds
    let channels = 9 // Named constant
    
    let sourceNoise = if let pool = pool {
        try await pool.rent(shape: [1, noiseLength, channels], dataType: .float16, zeroFill: false)
    } else {
        try MLMultiArray(shape: [1, NSNumber(value: noiseLength), NSNumber(value: channels)], dataType: .float16)
    }
    
    // Use vDSP for better performance on large arrays
    // ... vectorized implementation
    
    return sourceNoise
}

Approval Status:

This change is safe to merge as it fixes a critical CI blocker. The performance optimizations can be addressed in a follow-up PR.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines +307 to +319
// Source noise for newer Kokoro models
let maxSeconds = variant.maxDurationSeconds
let noiseLength = TtsConstants.audioSampleRate * maxSeconds
let sourceNoise = try await multiArrayPool.rent(
shape: [1, noiseLength, 9],
dataType: .float16,
zeroFill: false
)
let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9)
for i in 0..<(noiseLength * 9) {
let randomValue = Float.random(in: -1...1)
noisePointer[i] = Float16(randomValue).bitPattern
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing pool preallocation for sourceNoise arrays

In synthesizeDetailed, the multiArrayPool is preallocated for phasesArray, inputArray/attentionMask, and refStyle (lines 533–554), but the new sourceNoise array is not preallocated. Since synthesizeChunk is called concurrently via withThrowingTaskGroup at KokoroSynthesizer.swift:564, each concurrent task will individually allocate a fresh sourceNoise MLMultiArray through the actor-serialized rent() call instead of reusing preallocated pooled arrays. This is especially impactful because sourceNoise is by far the largest array in the inference pipeline — shape [1, noiseLength, 9] where noiseLength is 120,000 (5s) or 360,000 (15s), yielding ~2.16 MB or ~6.48 MB per array in float16. The omission breaks the established preallocation pattern and causes unnecessary heap allocation pressure during the latency-sensitive prediction phase.

Prompt for agents
In Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift, in the synthesizeDetailed method around line 554 (after the refStyle preallocation block), add preallocation for the sourceNoise arrays. Group entries by variant (to get the correct noiseLength per variant), then preallocate for each group. For example, after the refShape preallocation (line 554), add something like:

let groupedByVariant = Dictionary(grouping: entries, by: { $0.template.variant })
for (variant, group) in groupedByVariant {
    let maxSeconds = variant.maxDurationSeconds
    let noiseLength = TtsConstants.audioSampleRate * maxSeconds
    let noiseShape: [NSNumber] = [1, NSNumber(value: noiseLength), 9]
    try await multiArrayPool.preallocate(
        shape: noiseShape,
        dataType: .float16,
        count: max(1, group.count),
        zeroFill: false
    )
}

This follows the same pattern used for inputArray/attentionMask preallocation (grouped by targetTokens) and ensures the large sourceNoise arrays are ready before concurrent chunk synthesis begins.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@github-actions
Copy link

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER NaN% <20% ⚠️ Diarization Error Rate (lower is better)
RTFx NaNx >1.0x ⚠️ Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download NaN NaN Fetching diarization models
Model Compile NaN NaN CoreML compilation
Audio Load NaN NaN Loading audio file
Segmentation NaN NaN VAD + speech detection
Embedding NaN NaN Speaker embedding extraction
Clustering (VBx) NaN NaN Hungarian algorithm + VBx clustering
Total NaN 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) NaN% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: N/A • 03/22/2026, 12:15 PM EST

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant