Skip to content

feat: Add NemotronStreamingAsrManager for RNNT streaming inference #389

@pHequals7

Description

@pHequals7

Summary

The FluidInference/nemotron-speech-streaming-en-0.6b-coreml model card references NemotronStreamingAsrManager as a FluidAudio class, but it doesn't exist in the current release (v0.12.2). We've implemented a working RNNT streaming inference pipeline in our app (Muesli) and would like to contribute it upstream as an official FluidAudio class.

What We Built

A complete RNNT streaming pipeline for the Nemotron 560ms variant:

  1. Preprocessor — audio [1, N] → mel spectrogram [1, 128, 56]
  2. Encoder (with cache) — mel [1, 128, 65] + cache → encoded [1, 1024, 7] + new cache
  3. Decoder + Joint (greedy loop) — RNNT greedy decode per encoder frame
  4. Tokenizer — 1024-token vocab with → space replacement

Key Implementation Details

  • Encoder cache: cache_channel [1, 24, 70, 1024], cache_time [1, 24, 1024, 8], cache_len [1]
  • Decoder LSTM state: h [2, 1, 640], c [2, 1, 640]
  • Critical stride handling: Encoder output shape [1, 1024, 7] has strides [7168, 7, 1]. Frame t at dimension d must be accessed as ptr[d * stride1 + t], NOT ptr[t * dim + d]. This was a non-obvious bug that caused all-blank output until fixed.
  • Blank token: ID 1024 (= vocab_size)
  • macOS 15+ required for CoreML stateful model support

Performance (Apple M4 Pro)

Metric Value
Cold start (CoreML compile) ~6s
Warm latency per chunk ~0.2-0.3s
WER (long utterances) Comparable to benchmarks

Limitations Discovered

  • Short utterances (< 2s) produce poor results — encoder needs a few chunks of context to warm up
  • Best suited for continuous streaming (meetings, hands-free dictation), not hold-to-talk
  • Model download is ~600MB (int8 encoder + float32 decoder/joint/preprocessor)

What We'd Like to See in FluidAudio

An official NemotronStreamingAsrManager actor (similar to Qwen3AsrManager) with:

  1. Model download via DownloadUtils (like other FluidAudio models)
  2. Batch transcriptiontranscribe(wavURL:) or transcribe(audioSamples:)
  3. Chunk-level streaming APImakeStreamState() + transcribeChunk(samples:state:) for real-time incremental output
  4. Multi-variant support — 80ms, 160ms, 560ms, 1120ms chunk sizes

Our Implementation

Source: NemotronStreamingBackend.swift (~340 lines)

Happy to contribute this as a PR if you'd like — would just need guidance on:

  • Where it should live in the FluidAudio source tree (e.g., Sources/FluidAudio/ASR/Nemotron/)
  • Whether it should integrate with the existing RnntDecoder.swift or be standalone
  • Naming conventions and code style (we saw the .swift-format config)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestspeech-to-textissues related to transcription/asr

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions