Skip to content

docs(design): voice integration with Yapper#104

Merged
dimakis merged 1 commit intomainfrom
docs/voice-integration
Apr 6, 2026
Merged

docs(design): voice integration with Yapper#104
dimakis merged 1 commit intomainfrom
docs/voice-integration

Conversation

@dimakis
Copy link
Copy Markdown
Owner

@dimakis dimakis commented Apr 5, 2026

Summary

  • Design doc for Mitzo ↔ Yapper voice integration
  • Client-direct architecture: frontend talks to Yapper for STT/TTS, server stays text-only
  • Push-to-talk input, optional TTS output, graceful degradation when Yapper is down
  • No v2 protocol changes — voice is a preprocessing/postprocessing layer

Implementation phases

  1. Batch STT — mic button, record, POST to Yapper, insert transcript
  2. TTS playback — toggle to hear responses via Yapper synthesize
  3. Streaming STT — live partial transcripts via WebSocket

Test plan

  • Review design doc for completeness and feasibility
  • Confirm architecture aligns with existing Mitzo patterns

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner Author

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design review — looks solid. A few gaps to address before or during implementation:

Browser permissions

  • navigator.mediaDevices.getUserMedia requires explicit user permission. Need a permission flow: prompt on first mic tap, handle denial gracefully (show "mic blocked" state, not just hidden button).

Safari MediaRecorder compatibility

  • Safari's MediaRecorder support for video/webm / audio/webm;codecs=opus is inconsistent. iOS Safari may require audio/mp4 fallback. The audio.ts module should negotiate mimeType at runtime (MediaRecorder.isTypeSupported()), and Yapper's format negotiation frame needs to handle whatever format the browser actually produces.

Yapper model readiness (resolved)

  • Open question #5 is addressed by dimakis/yapper#7 — /health now returns {"status": "ready"|"loading", "models": {"stt": bool, "tts": bool}} with 503 while loading. Mitzo can use this to show "loading models..." instead of hiding the mic.

CORS dependency

  • Client-direct architecture requires Yapper to have permissive CORS (dimakis/yapper#5 adds allow_origins=["*"]). Worth noting as a hard dependency in the doc.

Mixed content (future)

  • If Mitzo is ever served over HTTPS, MediaRecorder requires a secure context and HTTP calls to Yapper would be blocked as mixed content. Not a blocker now (LAN-only), but worth a "Future Considerations" note.

Minor gaps

  • MAX_RECORDING_DURATION_MS mentioned as a constant but no value or auto-stop behavior defined.
  • Error UX for Yapper 500s or empty/too-short recordings not specified.
  • Doc says useLongPress "already exists" — verify before assuming reuse.

Client-direct architecture: frontend talks to Yapper for STT/TTS,
server stays text-only. Three phases: batch STT, TTS playback,
streaming STT. Each phase ships as a separate PR with tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis dimakis force-pushed the docs/voice-integration branch from 3ae8cf3 to 0a9aba2 Compare April 6, 2026 20:43
@dimakis dimakis merged commit d7b18dc into main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant