Skip to content

feat(voice): full Yapper voice integration (STT + TTS + streaming)#110

Merged
dimakis merged 11 commits intomainfrom
feat/voice-streaming-stt
Apr 6, 2026
Merged

feat(voice): full Yapper voice integration (STT + TTS + streaming)#110
dimakis merged 11 commits intomainfrom
feat/voice-streaming-stt

Conversation

@dimakis
Copy link
Copy Markdown
Owner

@dimakis dimakis commented Apr 5, 2026

Summary

  • Phase 1 — Batch STT: Audio capture with MediaRecorder format negotiation (WebM/Opus → MP4 fallback), useVoice hook with Yapper health polling, push-to-talk MicButton, wired into ChatInput/ChatView
  • Phase 2 — TTS Playback: Chunked text synthesis via /v1/synthesize, AudioContext singleton, sequential playback with AbortController cancellation, VoiceSettings component (toggle + voice picker), auto-speak on assistant message completion
  • Phase 3 — Streaming STT: WebSocket client for /v1/transcribe/stream, streaming MediaRecorder with timeslice chunks, live partial transcript overlay in ChatInput, batch fallback on WS failure

What's new

Layer Files What
Audio capture audio.ts createRecorder() + createStreamingRecorder() with format negotiation
WS client yapper-ws.ts createYapperStreamClient() — queued sends, JSON transcript parsing
TTS tts.ts chunkText(), synthesize(), playAudio(), AudioContext singleton
Hook useVoice.ts Central voice state — health, recording (stream+batch), TTS, partials
Components MicButton, VoiceSettings, ChatInput updates Push-to-talk, TTS controls, partial transcript overlay
Page ChatView.tsx Owns useVoice(), auto-speak effect, stops speaking on send
Design docs streaming-stt.md, tts-playback.md Architecture decisions and protocol details

Test plan

  • 206 tests pass (npm test — all 25 test files green)
  • Manual: verify Yapper health detection (start/stop Yapper, check mic button appears/disappears)
  • Manual: batch STT — hold mic, speak, release, verify transcript appears in input
  • Manual: streaming STT — hold mic, verify live partial transcript overlay updates in real time
  • Manual: TTS — enable in settings, send message, verify auto-speak on response
  • Manual: cancel mid-recording and mid-speak both clean up correctly
  • Manual: Safari — verify MP4 fallback works for MediaRecorder

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner Author

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: feat(voice): full Yapper voice integration (STT + TTS + streaming)

Full rollup of phases 1–3. Solid architecture and test coverage (587 tests). Three issues to address:

🐛 Dual-recorder stream sharing bug (useVoice.ts)

startRecording creates both a StreamingRecorder and a batch Recorder on the same MediaStream. Both call stopTracks(stream) on stop/cancel. Whichever stops first kills the mic for the other — so if the streaming recorder stops tracks before the batch fallback tries to use the stream, recorder.stop() produces an empty/truncated blob.

Fix: only stop tracks once, in a single cleanup path, rather than letting both recorders independently kill the stream.

⚠️ Auto-speak fires during streaming, not on completion (ChatView.tsx)

The auto-speak useEffect triggers on msgState.messages changes but doesn't check msgState.running. This means TTS can fire before the assistant message is fully streamed — the messageId ref prevents duplicate speaks, but the message content may be incomplete when it first triggers.

Note: PR #108 already fixes this by extracting useAutoSpeak with a running guard, plus stripCodeForTts and truncateForTts. Merging #108 first and rebasing #110 resolves this and the stale tts.ts (which is missing those helpers and the playAudio idempotency fix).

💡 Use last partial as fallback on WS timeout (useVoice.ts)

When the 5s WS timeout fires without a final transcript, stopRecording returns empty string silently. The user loses their dictation with no indication. Consider returning partialTranscript as best-effort instead of ''.

Minor notes

  • mimeToFormat() is a plain function recreated every render — could be module-level
  • tts-playback.md still references messages.length tracking but implementation uses messageId
  • WS send queue is unbounded (not a real concern at audio chunk sizes, but worth a comment)
  • No test for the dual-recorder stream conflict scenario

Recommended merge order

  1. Merge #108 into feat/voice-stt-batch
  2. Rebase #110 onto updated main (picks up #108 fixes)
  3. Merge #110

🤖 Generated with Claude Code

dimakis and others added 11 commits April 6, 2026 14:05
Text chunking at sentence boundaries with fragment merging, synthesis
fetch with AbortSignal support, singleton AudioContext management,
and WAV playback via AudioBufferSourceNode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds ttsAvailable (from health poll), ttsEnabled/selectedVoice with
localStorage persistence, lazy voice list fetch, speak() with
sequential chunk synthesis and AbortController cancellation,
stopSpeaking(), and AudioContext lifecycle cleanup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Speaker icon toggle for TTS on/off, voice selector dropdown grouped
by language. Hidden when Yapper TTS is unavailable, voice picker
shown only when TTS is enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auto-speak assistant text on message completion (tracked by messageId
ref). Stop playback on user send. Render VoiceSettings in chat header.
Update ChatInput voice mock with TTS fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Speaker toggle with active/speaking states, voice picker dropdown,
pulse animation during playback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers: WebSocket client for Yapper /v1/transcribe/stream, streaming
MediaRecorder with timeslice, live partial transcript preview in
ChatInput, batch fallback, and protocol gotchas.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
createStreamingRecorder uses MediaRecorder.start(timeslice) to emit
audio chunks during recording via onChunk callback. Supports cancel,
auto-stop timer, and onStop notification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WebSocket wrapper for /v1/transcribe/stream with format negotiation,
binary audio send, END signal, and partial/final transcript callbacks.
Queues messages until connection is open.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend useVoice with streaming transcription via WebSocket client and
streaming recorder. Adds partialTranscript state, sends audio chunks
over WS for live partials, and falls back to batch transcription on
WS error. Updates existing tests for streaming-first flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Show streaming transcription preview above the input row during
recording. Overlay appears only when recording with a non-empty
partial transcript, and disappears on stop or cancel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…back

- Add ownsStream option to createRecorder and createStreamingRecorder
  so multiple recorders can share a MediaStream without racing to kill
  tracks. The hook now owns stream cleanup via releaseStream().
- Return partialTranscript as best-effort fallback when WS timeout
  fires without a final transcript, instead of silently losing input.
- Move mimeToFormat() to module level (pure function, no hook state).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis dimakis force-pushed the feat/voice-streaming-stt branch from 63d7fad to afe2c57 Compare April 6, 2026 13:08
@dimakis dimakis merged commit 88d8723 into main Apr 6, 2026
1 check passed
dimakis added a commit that referenced this pull request Apr 6, 2026
#110's squash merge overwrote #108's review fixes in ChatView and
tts.ts. This restores:

- stripCodeForTts() and truncateForTts() so TTS doesn't read code
- TTS_MAX_SPEAK_CHARS (2000) length guard
- TTS_CHUNK_MIN_CHARS moved to constants
- useAutoSpeak hook extraction with running guard
- playAudio re-entrancy guard (started flag)
- Tests for all of the above

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis added a commit that referenced this pull request Apr 6, 2026
#110's squash merge overwrote #108's review fixes in ChatView and
tts.ts. This restores:

- stripCodeForTts() and truncateForTts() so TTS doesn't read code
- TTS_MAX_SPEAK_CHARS (2000) length guard
- TTS_CHUNK_MIN_CHARS moved to constants
- useAutoSpeak hook extraction with running guard
- playAudio re-entrancy guard (started flag)
- Tests for all of the above

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant