Skip to content

Proposal: optional local voice input via pi-web-voice sidecar package #1

@ashwin-pc

Description

@ashwin-pc

Summary

Add optional local speech-to-text support to pi-web without bundling voice models or ML dependencies into the core app.

Proposed packaging split:

@ashwin-pc/pi-web
  - Core local/Tailscale web UI
  - Composer, agent session, auth, WebSocket, settings
  - No bundled STT models, Python, CUDA, PyTorch, or ffmpeg dependency
  - Optional generic STT client that talks to a local HTTP endpoint

@ashwin-pc/pi-web-voice
  - Optional sidecar package
  - Owns voice model setup, health checks, model downloads, and sidecar launch
  - Provides a stable local HTTP transcription API for pi-web

The goal is to keep pi-web minimalist and installable on most machines while giving users who want local voice input an easy add-on path.

Motivation

Voice input is useful, but local STT brings heavy and variable dependencies:

  • Whisper / whisper.cpp model files
  • ffmpeg or audio conversion support
  • Docker or native binaries
  • Python / PyTorch / NeMo for Parakeet
  • Optional CUDA, Core ML, Vulkan, or other acceleration paths

Those should not be part of the default pi-web install. Users who do not want voice should not encounter model downloads, Docker errors, native build failures, GPU issues, or large install sizes.

Design principle

Voice input should be treated as a composer input method, not as an agent extension or core agent feature.

Browser microphone
  -> MediaRecorder audio blob
  -> pi-web /api/stt/transcribe
  -> local sidecar HTTP endpoint
  -> transcript text
  -> insert into composer for review

The transcript should be inserted into the textarea and not auto-submitted. Users can edit before sending.

Core pi-web responsibilities

@ashwin-pc/pi-web should only include a thin, generic STT bridge:

  • Add a microphone button in the composer only when STT is configured and reachable.
  • Record audio in the browser using MediaRecorder.
  • POST audio to a pi-web API route, e.g. POST /api/stt/transcribe.
  • Server forwards the audio to the configured sidecar endpoint.
  • Insert returned transcript into the composer at the cursor.
  • Report helpful unavailable states in the UI.
  • Do not install, download, or import any STT model implementation.

Suggested pi-web env vars:

PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference
PI_WEB_STT_TIMEOUT_MS=60000
PI_WEB_STT_MAX_BYTES=25000000

Optional:

PI_WEB_STT=1

Though PI_WEB_STT_ENDPOINT alone may be enough to enable the bridge.

Sidecar package responsibilities

@ashwin-pc/pi-web-voice should own all STT-specific complexity:

  • Setup commands for supported providers.
  • Model download/cache management.
  • Docker or native backend selection.
  • ffmpeg/audio conversion handling.
  • Launching the local HTTP sidecar.
  • Health checks and diagnostics.
  • Provider-specific implementations for whisper.cpp, Parakeet, etc.

Suggested binary:

pi-web-voice

Suggested commands:

pi-web-voice doctor
pi-web-voice setup whisper
pi-web-voice setup whisper --model small.en
pi-web-voice serve
pi-web-voice serve --provider whisper
pi-web-voice serve --provider parakeet

Sidecar HTTP contract

Keep the contract deliberately small and provider-agnostic.

GET /health
POST /inference

GET /health response:

{
  "ok": true,
  "provider": "whisper-cpp",
  "model": "base.en",
  "backend": "docker"
}

POST /inference:

  • Accept multipart form data with an audio file field named file.
  • Return JSON with at least text.
{
  "text": "transcribed text here"
}

Optional future fields:

{
  "text": "transcribed text here",
  "language": "en",
  "durationMs": 1234,
  "segments": []
}

Recommended user flow

Install pi-web

npm install -g @ashwin-pc/pi-web
pi-web

No voice dependencies are installed by default.

Add voice

npm install -g @ashwin-pc/pi-web-voice
pi-web-voice setup whisper
pi-web-voice serve

Then start pi-web with:

PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference pi-web

When the sidecar is reachable, pi-web shows a microphone button in the composer.

Recommended default provider: whisper.cpp

Use whisper.cpp as the default setup path because it is more practical for broad local installs than Python-based Whisper or Parakeet:

  • CPU-friendly path for most machines.
  • Docker path avoids native build complexity.
  • Native path can be documented for users who prefer it.
  • Acceleration paths can be added later.

Default setup should probably use Docker first:

pi-web-voice setup whisper

That command can:

  1. Create a model cache directory, e.g. ~/.cache/pi-web-voice/models.
  2. Pull or verify the whisper.cpp Docker image.
  3. Download a default model such as base.en.
  4. Write a config file, e.g. ~/.config/pi-web-voice/config.json.

Example generated config:

{
  "provider": "whisper-cpp",
  "backend": "docker",
  "model": "base.en",
  "port": 8789
}

Then:

pi-web-voice serve

starts a local-only HTTP service on 127.0.0.1:8789.

Parakeet support

Parakeet should be treated as an advanced provider, likely best for users with NVIDIA GPU setups.

Possible setup:

pi-web-voice setup parakeet
pi-web-voice serve --provider parakeet

The Parakeet implementation should live entirely in @ashwin-pc/pi-web-voice, not in pi-web core. It may require Python, PyTorch, NeMo, CUDA, and audio format conversion to the model's expected input format.

pi-web UI states

The mic button should explain what is happening instead of failing silently.

Possible tooltip/status messages:

Voice input unavailable: set PI_WEB_STT_ENDPOINT and start a local STT sidecar.
Voice input unavailable: could not reach http://127.0.0.1:8789/inference.
Voice input ready: whisper-cpp at 127.0.0.1:8789.

During recording/transcription:

Recording… click to stop
Transcribing…
Transcript inserted into composer

pi-web API shape

Suggested core routes:

GET /api/stt/status
POST /api/stt/transcribe

GET /api/stt/status examples:

{ "ok": true, "enabled": false, "reason": "PI_WEB_STT_ENDPOINT is not set" }
{
  "ok": true,
  "enabled": true,
  "endpoint": "http://127.0.0.1:8789/inference",
  "provider": "whisper-cpp",
  "model": "base.en"
}
{
  "ok": false,
  "enabled": true,
  "error": "Could not reach STT sidecar"
}

POST /api/stt/transcribe should:

  • Require the normal pi-web auth/token behavior.
  • Enforce max request size.
  • Enforce timeout.
  • Write temp files only if necessary.
  • Clean up temp files.
  • Forward multipart audio to the sidecar.
  • Return { ok: true, text } or { ok: false, error }.

Security considerations

  • Sidecar should bind to 127.0.0.1 by default.
  • pi-web should not expose arbitrary endpoint access without explicit config.
  • Limit upload size.
  • Limit request duration.
  • Do not auto-submit transcripts.
  • Do not persist audio by default.
  • Avoid running the sidecar as admin/root where possible.
  • Make it clear that the local sidecar processes microphone audio.

Implementation checklist

Core pi-web:

  • Add server/stt.ts for config/status/transcribe forwarding.
  • Add GET /api/stt/status.
  • Add POST /api/stt/transcribe.
  • Add src/composer/voice.ts for MediaRecorder start/stop/transcribe flow.
  • Add mic button to composer markup.
  • Add DOM element wiring in src/app/elements.ts.
  • Add mic icon to src/app/icons.ts.
  • Insert transcript at cursor instead of auto-submitting.
  • Hide or disable mic when no sidecar is configured.
  • Add tests for disabled status, unreachable status, and transcript insertion.
  • Add README docs for optional voice setup.

Voice package:

  • Create @ashwin-pc/pi-web-voice package.
  • Add pi-web-voice binary.
  • Implement doctor.
  • Implement setup whisper.
  • Implement serve.
  • Implement /health.
  • Implement /inference.
  • Add Docker-based whisper.cpp backend.
  • Add native whisper.cpp backend if feasible.
  • Add Parakeet provider later or behind an advanced flag.
  • Document model cache paths and cleanup.

Open questions

  • Should pi-web --voice auto-discover and launch pi-web-voice serve, or should users run the sidecar explicitly first?
  • Should the sidecar package be JS/Node-only and shell out to Docker/native binaries, or should provider implementations live in separate subpackages?
  • Should PI_WEB_STT_ENDPOINT alone enable voice, or should PI_WEB_STT=1 also be required?
  • What should the default model be: base.en, small.en, or a user prompt during setup?
  • Should the sidecar API eventually support streaming partial transcripts, or is request/response enough for the first version?

Non-goals for first version

  • Bundling model weights in pi-web.
  • Adding Python/PyTorch/NeMo dependencies to pi-web.
  • Supporting real-time streaming transcription.
  • Auto-submitting voice transcripts to the agent.
  • Treating voice input as a Pi agent extension.
  • Supporting every STT backend in the first release.

Proposed first milestone

  1. Add the generic pi-web STT bridge and mic UI.
  2. Create @ashwin-pc/pi-web-voice with whisper.cpp Docker support.
  3. Document a simple path:
npm install -g @ashwin-pc/pi-web @ashwin-pc/pi-web-voice
pi-web-voice setup whisper
pi-web-voice serve
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference pi-web
  1. Add Parakeet as a later advanced provider once the sidecar contract is stable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions