Add live audio transcription streaming support to Foundry Local C# SDK#485
Add live audio transcription streaming support to Foundry Local C# SDK#485
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Adds a new C# SDK API for live/streaming audio transcription sessions (push PCM chunks, receive incremental/final text results) and includes a Windows microphone demo sample.
Changes:
- Introduces
LiveAudioTranscriptionSession+ result/error types for streaming ASR over Core interop. - Extends Core interop to support audio stream start/push/stop (including binary payload routing).
- Adds a
samples/cs/LiveAudioTranscriptiondemo project and updates the audio client factory API.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk_v2/cs/test/FoundryLocal.Tests/Utils.cs | Replaced prior test utilities with ad-hoc top-level streaming harness code (currently breaks test build). |
| sdk_v2/cs/test/FoundryLocal.Tests/ModelTests.cs | Adds trailing blank lines (formatting noise). |
| sdk_v2/cs/src/OpenAI/LiveAudioTranscriptionTypes.cs | Adds LiveAudioTranscriptionResult and a structured Core error type. |
| sdk_v2/cs/src/OpenAI/LiveAudioTranscriptionClient.cs | Adds LiveAudioTranscriptionSession implementation (channels, retry, stop semantics). |
| sdk_v2/cs/src/OpenAI/AudioClient.cs | Adds CreateLiveTranscriptionSession() and removes the public file streaming transcription API. |
| sdk_v2/cs/src/Detail/JsonSerializationContext.cs | Registers new audio streaming types for source-gen JSON. |
| sdk_v2/cs/src/Detail/ICoreInterop.cs | Adds interop structs + methods for audio stream start/push/stop. |
| sdk_v2/cs/src/Detail/CoreInterop.cs | Implements binary command routing via execute_command_with_binary and start/stop routing via execute_command. |
| sdk_v2/cs/src/AssemblyInfo.cs | Adds InternalsVisibleTo("AudioStreamTest"). |
| samples/cs/LiveAudioTranscription/README.md | Documentation for the live transcription demo sample. |
| samples/cs/LiveAudioTranscription/Program.cs | Windows microphone demo using NAudio + new session API. |
| samples/cs/LiveAudioTranscription/LiveAudioTranscription.csproj | Adds sample project dependencies and references the SDK project (path currently incorrect). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
samples/cs/LiveAudioTranscription/LiveAudioTranscription.csproj
Outdated
Show resolved
Hide resolved
samples/cs/LiveAudioTranscription/LiveAudioTranscription.csproj
Outdated
Show resolved
Hide resolved
samples/cs/LiveAudioTranscription/LiveAudioTranscription.csproj
Outdated
Show resolved
Hide resolved
samples/cs/LiveAudioTranscription/LiveAudioTranscription.csproj
Outdated
Show resolved
Hide resolved
a061087 to
5678587
Compare
…g-support-sdk # Conflicts: # sdk/js/test/openai/chatClient.test.ts
samples/cs/GettingStarted/src/LiveAudioTranscriptionExample/Program.cs
Outdated
Show resolved
Hide resolved
…ionItem pattern (#561) ### Description Redesigns `LiveAudioTranscriptionResponse` to follow the OpenAI Realtime API's `ConversationItem` shape, enabling forward compatibility with a future WebSocket-based architecture. **Motivation:** - Customers using OpenAI's Realtime API access transcription via `result.content[0].transcript` - By adopting this pattern now, customers who write `result.Content[0].Text` won't need to change their code when we migrate to WebSocket transport - Aligns with the team's plan to move toward OpenAI Realtime API compatibility **Before:** ```csharp // Extended AudioCreateTranscriptionResponse from Betalgo await foreach (var result in session.GetTranscriptionStream()) { Console.Write(result.Text); // inherited from base bool final = result.IsFinal; // custom field var segments = result.Segments; // inherited from base } ``` **After:** ```csharp // Own type shaped like OpenAI Realtime ConversationItem await foreach (var result in session.GetTranscriptionStream()) { Console.Write(result.Content[0].Text); // ConversationItem pattern Console.Write(result.Content[0].Transcript); // alias for Text (Realtime compat) bool final = result.IsFinal; double? start = result.StartTime; } ``` **Changes:** | File | Change | |------|--------| | LiveAudioTranscriptionTypes.cs | Removed `AudioCreateTranscriptionResponse` inheritance. New standalone `LiveAudioTranscriptionResponse` with `Content` list + new `TranscriptionContentPart` type | | LiveAudioTranscriptionClient.cs | Updated text checks: `.Text` → `.Content?[0]?.Text` | | JsonSerializationContext.cs | Registered `TranscriptionContentPart`, removed `AudioCreateTranscriptionResponse.Segment` | | LiveAudioTranscriptionTests.cs | Updated assertions to match new type shape | | Program.cs (sample) | Updated result reading to `result.Content?[0]?.Text` | | README.md | Updated docs and output type table | **Key design decisions:** - `TranscriptionContentPart` has both `Text` and `Transcript` (set to the same value) for maximum compatibility with both Whisper and Realtime API patterns - `StartTime`/`EndTime` are top-level on the response (not nested in Segments) — simpler access, maps to Realtime's `audio_start_ms`/`audio_end_ms` - No dependency on Betalgo's `ConversationItem` — we own the type to avoid carrying unused chat/tool-calling fields - `LiveAudioTranscriptionRaw` (Core JSON deserialization) is unchanged — this is purely an SDK presentation change, no Core/neutron-server impact **No breaking changes to:** Core API, native interop, audio pipeline, session lifecycle --------- Co-authored-by: ruiren_microsoft <ruiren@microsoft.com>
|
|
||
| For real-time microphone-to-text transcription, use `CreateLiveTranscriptionSession()`. Audio is pushed as raw PCM chunks and transcription results stream back as an `IAsyncEnumerable`. | ||
|
|
||
| The streaming result type (`LiveAudioTranscriptionResponse`) extends `AudioCreateTranscriptionResponse` from the Betalgo OpenAI SDK, so it's compatible with the file-based transcription output format while adding streaming-specific fields. |
There was a problem hiding this comment.
This line needs to be updated as it does not extend that class anymore.
| let response = await client.completeChat(messages, tools); | ||
|
|
||
| // Check that a tool call was generated | ||
| // Check response is valid |
There was a problem hiding this comment.
These unit tests are specifically for tool calling. Ex:
Foundry-Local/sdk/js/test/openai/chatClient.test.ts
Lines 185 to 224 in b247611
These tests should not be modified. If they are failing on the PR, then there is a true failure somewhere that needs to be resolved. These tests are already passing on the main branch.
| const content = chunk.choices?.[0]?.message?.content ?? chunk.choices?.[0]?.delta?.content; | ||
| if (content) { | ||
| fullResponse += content; | ||
| // The model may either call the tool or respond directly. |
| "arguments": tool_call.function.arguments, | ||
| .is_some_and(|tc| !tc.is_empty()); | ||
|
|
||
| if has_tool_calls { |
| "name": tool_call_name, | ||
| "arguments": tool_call_args | ||
| // The model may either call the tool or respond directly. | ||
| if !tool_call_name.is_empty() { |
| internal sealed class LiveAudioTranscriptionTests | ||
| { | ||
| // --- LiveAudioTranscriptionResponse.FromJson tests --- | ||
|
|
There was a problem hiding this comment.
Can we add an E2E test here that takes in pre-saved bytes, appends them to a session, transcribes the result, and checks the result object's attributes?
| let is_linux = rid.starts_with("linux"); | ||
|
|
||
| let core_version = if nightly { | ||
| resolve_latest_version("Microsoft.AI.Foundry.Local.Core", ORT_NIGHTLY_FEED) |
There was a problem hiding this comment.
Why do we need to make changes to the Rust SDK for getting a package version? I believe the only changes needed in this file are the ORT GenAI version used and the feed URLs for any packages. The remaining changes can be undone.
Here's the cleaned version:
Description:
Adds real-time audio streaming support to the Foundry Local C# SDK, enabling live microphone-to-text transcription via ONNX Runtime GenAI's StreamingProcessor API (Nemotron ASR).
The existing
OpenAIAudioClientonly supports file-based transcription. This PR introducesLiveAudioTranscriptionSessionthat accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as an async stream.What's included
New files
src/OpenAI/LiveAudioTranscriptionClient.cs— Streaming session withStartAsync(),AppendAsync(),GetTranscriptionStream(),StopAsync()src/OpenAI/LiveAudioTranscriptionTypes.cs—LiveAudioTranscriptionResponse(extendsAudioCreateTranscriptionResponse) andCoreErrorResponsetypestest/FoundryLocal.Tests/LiveAudioTranscriptionTests.cs— Unit tests for deserialization, settings, state guardsModified files
src/OpenAI/AudioClient.cs— AddedCreateLiveTranscriptionSession()factory methodsrc/Detail/ICoreInterop.cs— AddedStreamingRequestBufferstruct,StartAudioStream,PushAudioData,StopAudioStreaminterface methodssrc/Detail/CoreInterop.cs— Routes audio commands through existingexecute_command/execute_command_with_binarynative entry pointssrc/Detail/JsonSerializationContext.cs— RegisteredLiveAudioTranscriptionResponsefor AOT compatibilityAPI surface
Design highlights
LiveAudioTranscriptionResponseextendsAudioCreateTranscriptionResponsefor consistent output format with file-based transcriptionChannel<T>serializes audio pushes from any thread (safe for mic callbacks) with backpressureStartAsync()and immutable during the sessionStopAsyncalways calls native stop even if cancelled, preventing native session leaksCancellationTokenSource, decoupled from the caller's tokenStartAudioStreamandStopAudioStreamroute throughexecute_command;PushAudioDataroutes throughexecute_command_with_binary— no new native entry points requiredCore integration (neutron-server)
The Core side (AudioStreamingSession.cs) uses
StreamingProcessor+Generator+Tokenizer+TokenizerStreamfrom onnxruntime-genai to perform real-time RNNT decoding. The native commands (audio_stream_start/push/stop) are handled as cases inNativeInterop.ExecuteCommandManaged/ExecuteCommandWithBinaryManaged.Verified working
StreamingProcessorpipeline verified with WAV file (correct transcript)TranscribeChunkbyte[] PCM path matches reference float[] path exactly