Skip to content

feat(gateway): shared speech-engine control protocol + session helper#16286

Draft
kkdawkins wants to merge 1 commit into
mainfrom
kdawkins.gateway-speech-engine
Draft

feat(gateway): shared speech-engine control protocol + session helper#16286
kkdawkins wants to merge 1 commit into
mainfrom
kdawkins.gateway-speech-engine

Conversation

@kkdawkins

Copy link
Copy Markdown
Contributor

Summary

Adds a vendor-neutral wire contract in @ai-sdk/gateway for Gateway-owned realtime voice control — the model where the Gateway owns the audio loop (STT, TTS, turn-taking) and a controller ("bring your own brain") drives turns over a server-side control socket. This makes @ai-sdk/gateway the single source of truth for the protocol so the Gateway and any controller can't drift.

What's added

  • gateway-speech-engine — the shared contract:
    • GATEWAY_SPEECH_ENGINE_SUBPROTOCOL (control-socket subprotocol)
    • engine↔controller event schemas (SpeechEngineServerEvent / SpeechEngineClientEvent, with turnId on client events for post-barge-in stale-frame dropping)
    • SpeechEngineCapabilities + DEFAULT_SPEECH_ENGINE_CAPABILITIES, SpeechEngineDescriptor
    • envelope codec: encodeSpeechEngineEvent / parseSpeechEngineServerEvent / parseSpeechEngineClientEvent
  • GatewaySpeechEngineSession — a controller-side helper (the AI-SDK analogue of ElevenLabs' session): surface finalized transcripts, stream a reply back for TTS (string or LLM-chunk stream, with auto text extraction), and implicit barge-in (a new transcript aborts the prior turn's AbortSignal and cancels it on the wire by turnId).
  • Thread an optional control config through experimental_realtime.getToken (sealed into the minted client secret; server-side mint only) and re-export GatewayRealtimeControlConfig from ai.

Why

The protocol was being hand-mirrored across consumers. Centralizing it here lets the AI Gateway server and Eve (and any future client, including ElevenLabs-as-a-provider use cases) import one contract.

Tests

gateway-speech-engine.test.ts covers the codec (round-trips, capability handshake, malformed/turnId-required rejection) and the session helper (transcript→reply with a consistent turnId, LLM-chunk extraction, supersede/cancel-by-id, no-cancel-without-capability). 8 tests, pnpm build clean (ESM + DTS).

Consumers (separate PRs)

Draft until the consumers are validated end-to-end against this build.

…helper

Introduce a vendor-neutral wire contract for Gateway-owned realtime voice
control so the Gateway and any controller (Eve, or any "bring your own
brain" client) share one source of truth:

- gateway-speech-engine: subprotocol constant, event schemas
  (engine<->controller), capabilities + permissive defaults, engine
  descriptor, and the envelope encode/parse codec (turnId on client
  events for post-barge-in stale-frame dropping).
- GatewaySpeechEngineSession: controller-side helper that surfaces
  finalized transcripts, streams replies back for TTS (string or LLM
  chunk stream), and does implicit barge-in (new transcript aborts and
  cancels the prior turn by id).

Thread an optional `control` config through experimental_realtime.getToken
into the minted client secret, and re-export GatewayRealtimeControlConfig
from `ai`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant