Summary
Add speech-to-text and text-to-speech support through OpenAI-compatible sidecar services. The agent server routes audio requests to sidecars, keeping the core text pipeline unchanged. This enables voice-driven agent interactions without modifying BaseAgent.
Requirements
- Audio content block support in
ChatCompletionRequest: {"type": "input_audio", "input_audio": {"data": "base64...", "format": "wav"}}
- STT sidecar configuration in
agent.yaml (endpoint URL for /v1/audio/transcriptions)
- TTS sidecar configuration in
agent.yaml (endpoint URL for /v1/audio/speech)
- Server-layer
MediaPreprocessorMiddleware that converts audio content blocks to text (via STT) before reaching BaseAgent
- Optional TTS post-processing that converts agent text response to audio
- Streaming support for real-time voice via WebSocket (future phase)
FIPS Considerations
No blockers. AI inference is unaffected by FIPS. Audio codecs (WAV, MP3, OPUS) are not cryptographic. TLS for sidecar communication uses system OpenSSL. Recommended STT: Granite Speech 3.3-8B on vLLM (Apache 2.0, Red Hat ecosystem) or Faster-Whisper. Recommended TTS: Kokoro-FastAPI (OpenAI-compatible, Apache 2.0).
Implementation Notes
MediaPreprocessorMiddleware is server-layer only — BaseAgent works exclusively with text. Sidecar deployment is infrastructure (Helm subchart or separate Deployment), not framework code. The middleware pattern keeps the audio concern cleanly separated from agent logic. Part of the multimodal initiative.
Companion Issues
Companion issues will be filed on fips-agents/gateway-template, fips-agents/ui-template, fips-agents/fips-agents-cli, and fips-agents/examples.
Size
M
Summary
Add speech-to-text and text-to-speech support through OpenAI-compatible sidecar services. The agent server routes audio requests to sidecars, keeping the core text pipeline unchanged. This enables voice-driven agent interactions without modifying BaseAgent.
Requirements
ChatCompletionRequest:{"type": "input_audio", "input_audio": {"data": "base64...", "format": "wav"}}agent.yaml(endpoint URL for/v1/audio/transcriptions)agent.yaml(endpoint URL for/v1/audio/speech)MediaPreprocessorMiddlewarethat converts audio content blocks to text (via STT) before reaching BaseAgentFIPS Considerations
No blockers. AI inference is unaffected by FIPS. Audio codecs (WAV, MP3, OPUS) are not cryptographic. TLS for sidecar communication uses system OpenSSL. Recommended STT: Granite Speech 3.3-8B on vLLM (Apache 2.0, Red Hat ecosystem) or Faster-Whisper. Recommended TTS: Kokoro-FastAPI (OpenAI-compatible, Apache 2.0).
Implementation Notes
MediaPreprocessorMiddlewareis server-layer only — BaseAgent works exclusively with text. Sidecar deployment is infrastructure (Helm subchart or separate Deployment), not framework code. The middleware pattern keeps the audio concern cleanly separated from agent logic. Part of the multimodal initiative.Companion Issues
Companion issues will be filed on fips-agents/gateway-template, fips-agents/ui-template, fips-agents/fips-agents-cli, and fips-agents/examples.
Size
M