feat(voice): full Yapper voice integration (STT + TTS + streaming) by dimakis · Pull Request #110 · dimakis/mitzo

dimakis · 2026-04-05T20:09:47Z

Summary

Phase 1 — Batch STT: Audio capture with MediaRecorder format negotiation (WebM/Opus → MP4 fallback), useVoice hook with Yapper health polling, push-to-talk MicButton, wired into ChatInput/ChatView
Phase 2 — TTS Playback: Chunked text synthesis via /v1/synthesize, AudioContext singleton, sequential playback with AbortController cancellation, VoiceSettings component (toggle + voice picker), auto-speak on assistant message completion
Phase 3 — Streaming STT: WebSocket client for /v1/transcribe/stream, streaming MediaRecorder with timeslice chunks, live partial transcript overlay in ChatInput, batch fallback on WS failure

What's new

Layer	Files	What
Audio capture	`audio.ts`	`createRecorder()` + `createStreamingRecorder()` with format negotiation
WS client	`yapper-ws.ts`	`createYapperStreamClient()` — queued sends, JSON transcript parsing
TTS	`tts.ts`	`chunkText()`, `synthesize()`, `playAudio()`, AudioContext singleton
Hook	`useVoice.ts`	Central voice state — health, recording (stream+batch), TTS, partials
Components	`MicButton`, `VoiceSettings`, `ChatInput` updates	Push-to-talk, TTS controls, partial transcript overlay
Page	`ChatView.tsx`	Owns `useVoice()`, auto-speak effect, stops speaking on send
Design docs	`streaming-stt.md`, `tts-playback.md`	Architecture decisions and protocol details

Test plan

206 tests pass (npm test — all 25 test files green)
Manual: verify Yapper health detection (start/stop Yapper, check mic button appears/disappears)
Manual: batch STT — hold mic, speak, release, verify transcript appears in input
Manual: streaming STT — hold mic, verify live partial transcript overlay updates in real time
Manual: TTS — enable in settings, send message, verify auto-speak on response
Manual: cancel mid-recording and mid-speak both clean up correctly
Manual: Safari — verify MP4 fallback works for MediaRecorder

🤖 Generated with Claude Code

dimakis

PR Review: feat(voice): full Yapper voice integration (STT + TTS + streaming)

Full rollup of phases 1–3. Solid architecture and test coverage (587 tests). Three issues to address:

🐛 Dual-recorder stream sharing bug (useVoice.ts)

startRecording creates both a StreamingRecorder and a batch Recorder on the same MediaStream. Both call stopTracks(stream) on stop/cancel. Whichever stops first kills the mic for the other — so if the streaming recorder stops tracks before the batch fallback tries to use the stream, recorder.stop() produces an empty/truncated blob.

Fix: only stop tracks once, in a single cleanup path, rather than letting both recorders independently kill the stream.

⚠️ Auto-speak fires during streaming, not on completion (ChatView.tsx)

The auto-speak useEffect triggers on msgState.messages changes but doesn't check msgState.running. This means TTS can fire before the assistant message is fully streamed — the messageId ref prevents duplicate speaks, but the message content may be incomplete when it first triggers.

Note: PR #108 already fixes this by extracting useAutoSpeak with a running guard, plus stripCodeForTts and truncateForTts. Merging #108 first and rebasing #110 resolves this and the stale tts.ts (which is missing those helpers and the playAudio idempotency fix).

💡 Use last partial as fallback on WS timeout (useVoice.ts)

When the 5s WS timeout fires without a final transcript, stopRecording returns empty string silently. The user loses their dictation with no indication. Consider returning partialTranscript as best-effort instead of ''.

Minor notes

mimeToFormat() is a plain function recreated every render — could be module-level
tts-playback.md still references messages.length tracking but implementation uses messageId
WS send queue is unbounded (not a real concern at audio chunk sizes, but worth a comment)
No test for the dual-recorder stream conflict scenario

Recommended merge order

Merge #108 into feat/voice-stt-batch
Rebase #110 onto updated main (picks up #108 fixes)
Merge #110

🤖 Generated with Claude Code

Text chunking at sentence boundaries with fragment merging, synthesis fetch with AbortSignal support, singleton AudioContext management, and WAV playback via AudioBufferSourceNode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds ttsAvailable (from health poll), ttsEnabled/selectedVoice with localStorage persistence, lazy voice list fetch, speak() with sequential chunk synthesis and AbortController cancellation, stopSpeaking(), and AudioContext lifecycle cleanup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Speaker icon toggle for TTS on/off, voice selector dropdown grouped by language. Hidden when Yapper TTS is unavailable, voice picker shown only when TTS is enabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Auto-speak assistant text on message completion (tracked by messageId ref). Stop playback on user send. Render VoiceSettings in chat header. Update ChatInput voice mock with TTS fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Speaker toggle with active/speaking states, voice picker dropdown, pulse animation during playback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Covers: WebSocket client for Yapper /v1/transcribe/stream, streaming MediaRecorder with timeslice, live partial transcript preview in ChatInput, batch fallback, and protocol gotchas. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

createStreamingRecorder uses MediaRecorder.start(timeslice) to emit audio chunks during recording via onChunk callback. Supports cancel, auto-stop timer, and onStop notification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WebSocket wrapper for /v1/transcribe/stream with format negotiation, binary audio send, END signal, and partial/final transcript callbacks. Queues messages until connection is open. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend useVoice with streaming transcription via WebSocket client and streaming recorder. Adds partialTranscript state, sends audio chunks over WS for live partials, and falls back to batch transcription on WS error. Updates existing tests for streaming-first flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Show streaming transcription preview above the input row during recording. Overlay appears only when recording with a non-empty partial transcript, and disappears on stop or cancel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…back - Add ownsStream option to createRecorder and createStreamingRecorder so multiple recorders can share a MediaStream without racing to kill tracks. The hook now owns stream cleanup via releaseStream(). - Return partialTranscript as best-effort fallback when WS timeout fires without a final transcript, instead of silently losing input. - Move mimeToFormat() to module level (pure function, no hook state). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

#110's squash merge overwrote #108's review fixes in ChatView and tts.ts. This restores: - stripCodeForTts() and truncateForTts() so TTS doesn't read code - TTS_MAX_SPEAK_CHARS (2000) length guard - TTS_CHUNK_MIN_CHARS moved to constants - useAutoSpeak hook extraction with running guard - playAudio re-entrancy guard (started flag) - Tests for all of the above Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

#110's squash merge overwrote #108's review fixes in ChatView and tts.ts. This restores: - stripCodeForTts() and truncateForTts() so TTS doesn't read code - TTS_MAX_SPEAK_CHARS (2000) length guard - TTS_CHUNK_MIN_CHARS moved to constants - useAutoSpeak hook extraction with running guard - playAudio re-entrancy guard (started flag) - Tests for all of the above Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

dimakis commented Apr 6, 2026

View reviewed changes

dimakis and others added 11 commits April 6, 2026 14:05

style(voice): add VoiceSettings css with toggle and picker styling

fe8e1c0

Speaker toggle with active/speaking states, voice picker dropdown, pulse animation during playback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimakis force-pushed the feat/voice-streaming-stt branch from 63d7fad to afe2c57 Compare April 6, 2026 13:08

dimakis merged commit 88d8723 into main Apr 6, 2026
1 check passed

dimakis mentioned this pull request Apr 6, 2026

fix(voice): restore TTS review fixes lost in #110 squash merge #114

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice): full Yapper voice integration (STT + TTS + streaming)#110

feat(voice): full Yapper voice integration (STT + TTS + streaming)#110
dimakis merged 11 commits intomainfrom
feat/voice-streaming-stt

dimakis commented Apr 5, 2026

Uh oh!

dimakis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dimakis commented Apr 5, 2026

Summary

What's new

Test plan

Uh oh!

dimakis left a comment

Choose a reason for hiding this comment

PR Review: feat(voice): full Yapper voice integration (STT + TTS + streaming)

🐛 Dual-recorder stream sharing bug (useVoice.ts)

⚠️ Auto-speak fires during streaming, not on completion (ChatView.tsx)

💡 Use last partial as fallback on WS timeout (useVoice.ts)

Minor notes

Recommended merge order

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant