-
Notifications
You must be signed in to change notification settings - Fork 221
Open
Labels
enhancementNew feature or requestNew feature or requestspeech-to-textissues related to transcription/asrissues related to transcription/asr
Description
Summary
The FluidInference/nemotron-speech-streaming-en-0.6b-coreml model card references NemotronStreamingAsrManager as a FluidAudio class, but it doesn't exist in the current release (v0.12.2). We've implemented a working RNNT streaming inference pipeline in our app (Muesli) and would like to contribute it upstream as an official FluidAudio class.
What We Built
A complete RNNT streaming pipeline for the Nemotron 560ms variant:
- Preprocessor — audio
[1, N]→ mel spectrogram[1, 128, 56] - Encoder (with cache) — mel
[1, 128, 65]+ cache → encoded[1, 1024, 7]+ new cache - Decoder + Joint (greedy loop) — RNNT greedy decode per encoder frame
- Tokenizer — 1024-token vocab with
▁→ space replacement
Key Implementation Details
- Encoder cache:
cache_channel [1, 24, 70, 1024],cache_time [1, 24, 1024, 8],cache_len [1] - Decoder LSTM state:
h [2, 1, 640],c [2, 1, 640] - Critical stride handling: Encoder output shape
[1, 1024, 7]has strides[7168, 7, 1]. Frametat dimensiondmust be accessed asptr[d * stride1 + t], NOTptr[t * dim + d]. This was a non-obvious bug that caused all-blank output until fixed. - Blank token: ID 1024 (= vocab_size)
- macOS 15+ required for CoreML stateful model support
Performance (Apple M4 Pro)
| Metric | Value |
|---|---|
| Cold start (CoreML compile) | ~6s |
| Warm latency per chunk | ~0.2-0.3s |
| WER (long utterances) | Comparable to benchmarks |
Limitations Discovered
- Short utterances (< 2s) produce poor results — encoder needs a few chunks of context to warm up
- Best suited for continuous streaming (meetings, hands-free dictation), not hold-to-talk
- Model download is ~600MB (int8 encoder + float32 decoder/joint/preprocessor)
What We'd Like to See in FluidAudio
An official NemotronStreamingAsrManager actor (similar to Qwen3AsrManager) with:
- Model download via
DownloadUtils(like other FluidAudio models) - Batch transcription —
transcribe(wavURL:)ortranscribe(audioSamples:) - Chunk-level streaming API —
makeStreamState()+transcribeChunk(samples:state:)for real-time incremental output - Multi-variant support — 80ms, 160ms, 560ms, 1120ms chunk sizes
Our Implementation
Source: NemotronStreamingBackend.swift (~340 lines)
Happy to contribute this as a PR if you'd like — would just need guidance on:
- Where it should live in the FluidAudio source tree (e.g.,
Sources/FluidAudio/ASR/Nemotron/) - Whether it should integrate with the existing
RnntDecoder.swiftor be standalone - Naming conventions and code style (we saw the
.swift-formatconfig)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestspeech-to-textissues related to transcription/asrissues related to transcription/asr