docs(design): voice integration with Yapper by dimakis · Pull Request #104 · dimakis/mitzo

dimakis · 2026-04-05T18:32:15Z

Summary

Design doc for Mitzo ↔ Yapper voice integration
Client-direct architecture: frontend talks to Yapper for STT/TTS, server stays text-only
Push-to-talk input, optional TTS output, graceful degradation when Yapper is down
No v2 protocol changes — voice is a preprocessing/postprocessing layer

Implementation phases

Batch STT — mic button, record, POST to Yapper, insert transcript
TTS playback — toggle to hear responses via Yapper synthesize
Streaming STT — live partial transcripts via WebSocket

Test plan

Review design doc for completeness and feasibility
Confirm architecture aligns with existing Mitzo patterns

🤖 Generated with Claude Code

dimakis

Design review — looks solid. A few gaps to address before or during implementation:

Browser permissions

navigator.mediaDevices.getUserMedia requires explicit user permission. Need a permission flow: prompt on first mic tap, handle denial gracefully (show "mic blocked" state, not just hidden button).

Safari MediaRecorder compatibility

Safari's MediaRecorder support for video/webm / audio/webm;codecs=opus is inconsistent. iOS Safari may require audio/mp4 fallback. The audio.ts module should negotiate mimeType at runtime (MediaRecorder.isTypeSupported()), and Yapper's format negotiation frame needs to handle whatever format the browser actually produces.

Yapper model readiness (resolved)

Open question #5 is addressed by dimakis/yapper#7 — /health now returns {"status": "ready"|"loading", "models": {"stt": bool, "tts": bool}} with 503 while loading. Mitzo can use this to show "loading models..." instead of hiding the mic.

CORS dependency

Client-direct architecture requires Yapper to have permissive CORS (dimakis/yapper#5 adds allow_origins=["*"]). Worth noting as a hard dependency in the doc.

Mixed content (future)

If Mitzo is ever served over HTTPS, MediaRecorder requires a secure context and HTTP calls to Yapper would be blocked as mixed content. Not a blocker now (LAN-only), but worth a "Future Considerations" note.

Minor gaps

MAX_RECORDING_DURATION_MS mentioned as a constant but no value or auto-stop behavior defined.
Error UX for Yapper 500s or empty/too-short recordings not specified.
Doc says useLongPress "already exists" — verify before assuming reuse.

Client-direct architecture: frontend talks to Yapper for STT/TTS, server stays text-only. Three phases: batch STT, TTS playback, streaming STT. Each phase ships as a separate PR with tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimakis commented Apr 5, 2026

View reviewed changes

dimakis mentioned this pull request Apr 5, 2026

feat(voice): batch STT via Yapper (Phase 1) #105

Merged

6 tasks

dimakis force-pushed the docs/voice-integration branch from 3ae8cf3 to 0a9aba2 Compare April 6, 2026 20:43

dimakis merged commit d7b18dc into main Apr 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(design): voice integration with Yapper#104

docs(design): voice integration with Yapper#104
dimakis merged 1 commit intomainfrom
docs/voice-integration

dimakis commented Apr 5, 2026

Uh oh!

dimakis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dimakis commented Apr 5, 2026

Summary

Implementation phases

Test plan

Uh oh!

dimakis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant