feat: Add NemotronStreamingAsrManager for RNNT streaming inference

## Summary

The [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml) model card references `NemotronStreamingAsrManager` as a FluidAudio class, but it doesn't exist in the current release (v0.12.2). We've implemented a working RNNT streaming inference pipeline in our app ([Muesli](https://github.com/pHequals7/muesli)) and would like to contribute it upstream as an official FluidAudio class.

## What We Built

A complete RNNT streaming pipeline for the Nemotron 560ms variant:

1. **Preprocessor** — audio `[1, N]` → mel spectrogram `[1, 128, 56]`
2. **Encoder** (with cache) — mel `[1, 128, 65]` + cache → encoded `[1, 1024, 7]` + new cache
3. **Decoder + Joint** (greedy loop) — RNNT greedy decode per encoder frame
4. **Tokenizer** — 1024-token vocab with `▁` → space replacement

### Key Implementation Details

- **Encoder cache**: `cache_channel [1, 24, 70, 1024]`, `cache_time [1, 24, 1024, 8]`, `cache_len [1]`
- **Decoder LSTM state**: `h [2, 1, 640]`, `c [2, 1, 640]`
- **Critical stride handling**: Encoder output shape `[1, 1024, 7]` has strides `[7168, 7, 1]`. Frame `t` at dimension `d` must be accessed as `ptr[d * stride1 + t]`, NOT `ptr[t * dim + d]`. This was a non-obvious bug that caused all-blank output until fixed.
- **Blank token**: ID 1024 (= vocab_size)
- **macOS 15+** required for CoreML stateful model support

### Performance (Apple M4 Pro)

| Metric | Value |
|--------|-------|
| Cold start (CoreML compile) | ~6s |
| Warm latency per chunk | ~0.2-0.3s |
| WER (long utterances) | Comparable to benchmarks |

### Limitations Discovered

- Short utterances (< 2s) produce poor results — encoder needs a few chunks of context to warm up
- Best suited for continuous streaming (meetings, hands-free dictation), not hold-to-talk
- Model download is ~600MB (int8 encoder + float32 decoder/joint/preprocessor)

## What We'd Like to See in FluidAudio

An official `NemotronStreamingAsrManager` actor (similar to `Qwen3AsrManager`) with:

1. **Model download** via `DownloadUtils` (like other FluidAudio models)
2. **Batch transcription** — `transcribe(wavURL:)` or `transcribe(audioSamples:)`
3. **Chunk-level streaming API** — `makeStreamState()` + `transcribeChunk(samples:state:)` for real-time incremental output
4. **Multi-variant support** — 80ms, 160ms, 560ms, 1120ms chunk sizes

## Our Implementation

Source: [`NemotronStreamingBackend.swift`](https://github.com/pHequals7/muesli/blob/main/native/MuesliNative/Sources/MuesliNativeApp/NemotronStreamingBackend.swift) (~340 lines)

Happy to contribute this as a PR if you'd like — would just need guidance on:
- Where it should live in the FluidAudio source tree (e.g., `Sources/FluidAudio/ASR/Nemotron/`)
- Whether it should integrate with the existing `RnntDecoder.swift` or be standalone
- Naming conventions and code style (we saw the `.swift-format` config)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add NemotronStreamingAsrManager for RNNT streaming inference #389

Summary

What We Built

Key Implementation Details

Performance (Apple M4 Pro)

Limitations Discovered

What We'd Like to See in FluidAudio

Our Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Cold start (CoreML compile)	~6s
Warm latency per chunk	~0.2-0.3s
WER (long utterances)	Comparable to benchmarks

feat: Add NemotronStreamingAsrManager for RNNT streaming inference #389

Description

Summary

What We Built

Key Implementation Details

Performance (Apple M4 Pro)

Limitations Discovered

What We'd Like to See in FluidAudio

Our Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions