Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,22 @@ public struct KokoroSynthesizer {
zeroFill: true
)

// Source noise for newer Kokoro models
let maxSeconds = variant.maxDurationSeconds
let noiseLength = TtsConstants.audioSampleRate * maxSeconds
let sourceNoise = try await multiArrayPool.rent(
shape: [1, noiseLength, 9],
dataType: .float16,
zeroFill: false
)
let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9)
for i in 0..<(noiseLength * 9) {
let randomValue = Float.random(in: -1...1)
noisePointer[i] = Float16(randomValue).bitPattern
}
Comment on lines +307 to +319
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing pool preallocation for sourceNoise arrays

In synthesizeDetailed, the multiArrayPool is preallocated for phasesArray, inputArray/attentionMask, and refStyle (lines 533–554), but the new sourceNoise array is not preallocated. Since synthesizeChunk is called concurrently via withThrowingTaskGroup at KokoroSynthesizer.swift:564, each concurrent task will individually allocate a fresh sourceNoise MLMultiArray through the actor-serialized rent() call instead of reusing preallocated pooled arrays. This is especially impactful because sourceNoise is by far the largest array in the inference pipeline — shape [1, noiseLength, 9] where noiseLength is 120,000 (5s) or 360,000 (15s), yielding ~2.16 MB or ~6.48 MB per array in float16. The omission breaks the established preallocation pattern and causes unnecessary heap allocation pressure during the latency-sensitive prediction phase.

Prompt for agents
In Sources/FluidAudio/TTS/Kokoro/Pipeline/Synthesize/KokoroSynthesizer.swift, in the synthesizeDetailed method around line 554 (after the refStyle preallocation block), add preallocation for the sourceNoise arrays. Group entries by variant (to get the correct noiseLength per variant), then preallocate for each group. For example, after the refShape preallocation (line 554), add something like:

let groupedByVariant = Dictionary(grouping: entries, by: { $0.template.variant })
for (variant, group) in groupedByVariant {
    let maxSeconds = variant.maxDurationSeconds
    let noiseLength = TtsConstants.audioSampleRate * maxSeconds
    let noiseShape: [NSNumber] = [1, NSNumber(value: noiseLength), 9]
    try await multiArrayPool.preallocate(
        shape: noiseShape,
        dataType: .float16,
        count: max(1, group.count),
        zeroFill: false
    )
}

This follows the same pattern used for inputArray/attentionMask preallocation (grouped by targetTokens) and ensures the large sourceNoise arrays are ready before concurrent chunk synthesis begins.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


func recycleModelArrays() async {
await multiArrayPool.recycle(sourceNoise, zeroFill: false)
await multiArrayPool.recycle(phasesArray, zeroFill: true)
await multiArrayPool.recycle(attentionMask, zeroFill: false)
await multiArrayPool.recycle(inputArray, zeroFill: false)
Expand Down Expand Up @@ -338,6 +353,7 @@ public struct KokoroSynthesizer {
"attention_mask": attentionMask,
"ref_s": refStyle,
"random_phases": phasesArray,
"source_noise": sourceNoise,
])

let predictionStart = Date()
Expand Down
14 changes: 14 additions & 0 deletions Sources/FluidAudio/TTS/TtsModels.swift
Original file line number Diff line number Diff line change
Expand Up @@ -152,11 +152,25 @@ public struct TtsModels: Sendable {
randomPhases[index] = NSNumber(value: Float(0))
}

// Source noise for newer Kokoro models
let maxSeconds = variant.maxDurationSeconds
let noiseLength = TtsConstants.audioSampleRate * maxSeconds
let sourceNoise = try MLMultiArray(
shape: [1, NSNumber(value: noiseLength), 9],
dataType: .float16
)
let noisePointer = sourceNoise.dataPointer.bindMemory(to: UInt16.self, capacity: noiseLength * 9)
for i in 0..<(noiseLength * 9) {
let randomValue = Float.random(in: -1...1)
noisePointer[i] = Float16(randomValue).bitPattern
}

let features = try MLDictionaryFeatureProvider(dictionary: [
"input_ids": inputIds,
"attention_mask": attentionMask,
"ref_s": refStyle,
"random_phases": randomPhases,
"source_noise": sourceNoise,
])

let options: MLPredictionOptions = optimizedPredictionOptions()
Expand Down
Loading