Summary
Add optional local speech-to-text support to pi-web without bundling voice models or ML dependencies into the core app.
Proposed packaging split:
@ashwin-pc/pi-web
- Core local/Tailscale web UI
- Composer, agent session, auth, WebSocket, settings
- No bundled STT models, Python, CUDA, PyTorch, or ffmpeg dependency
- Optional generic STT client that talks to a local HTTP endpoint
@ashwin-pc/pi-web-voice
- Optional sidecar package
- Owns voice model setup, health checks, model downloads, and sidecar launch
- Provides a stable local HTTP transcription API for pi-web
The goal is to keep pi-web minimalist and installable on most machines while giving users who want local voice input an easy add-on path.
Motivation
Voice input is useful, but local STT brings heavy and variable dependencies:
- Whisper / whisper.cpp model files
- ffmpeg or audio conversion support
- Docker or native binaries
- Python / PyTorch / NeMo for Parakeet
- Optional CUDA, Core ML, Vulkan, or other acceleration paths
Those should not be part of the default pi-web install. Users who do not want voice should not encounter model downloads, Docker errors, native build failures, GPU issues, or large install sizes.
Design principle
Voice input should be treated as a composer input method, not as an agent extension or core agent feature.
Browser microphone
-> MediaRecorder audio blob
-> pi-web /api/stt/transcribe
-> local sidecar HTTP endpoint
-> transcript text
-> insert into composer for review
The transcript should be inserted into the textarea and not auto-submitted. Users can edit before sending.
Core pi-web responsibilities
@ashwin-pc/pi-web should only include a thin, generic STT bridge:
- Add a microphone button in the composer only when STT is configured and reachable.
- Record audio in the browser using
MediaRecorder.
- POST audio to a pi-web API route, e.g.
POST /api/stt/transcribe.
- Server forwards the audio to the configured sidecar endpoint.
- Insert returned transcript into the composer at the cursor.
- Report helpful unavailable states in the UI.
- Do not install, download, or import any STT model implementation.
Suggested pi-web env vars:
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference
PI_WEB_STT_TIMEOUT_MS=60000
PI_WEB_STT_MAX_BYTES=25000000
Optional:
Though PI_WEB_STT_ENDPOINT alone may be enough to enable the bridge.
Sidecar package responsibilities
@ashwin-pc/pi-web-voice should own all STT-specific complexity:
- Setup commands for supported providers.
- Model download/cache management.
- Docker or native backend selection.
- ffmpeg/audio conversion handling.
- Launching the local HTTP sidecar.
- Health checks and diagnostics.
- Provider-specific implementations for whisper.cpp, Parakeet, etc.
Suggested binary:
Suggested commands:
pi-web-voice doctor
pi-web-voice setup whisper
pi-web-voice setup whisper --model small.en
pi-web-voice serve
pi-web-voice serve --provider whisper
pi-web-voice serve --provider parakeet
Sidecar HTTP contract
Keep the contract deliberately small and provider-agnostic.
GET /health
POST /inference
GET /health response:
{
"ok": true,
"provider": "whisper-cpp",
"model": "base.en",
"backend": "docker"
}
POST /inference:
- Accept multipart form data with an audio file field named
file.
- Return JSON with at least
text.
{
"text": "transcribed text here"
}
Optional future fields:
{
"text": "transcribed text here",
"language": "en",
"durationMs": 1234,
"segments": []
}
Recommended user flow
Install pi-web
npm install -g @ashwin-pc/pi-web
pi-web
No voice dependencies are installed by default.
Add voice
npm install -g @ashwin-pc/pi-web-voice
pi-web-voice setup whisper
pi-web-voice serve
Then start pi-web with:
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference pi-web
When the sidecar is reachable, pi-web shows a microphone button in the composer.
Recommended default provider: whisper.cpp
Use whisper.cpp as the default setup path because it is more practical for broad local installs than Python-based Whisper or Parakeet:
- CPU-friendly path for most machines.
- Docker path avoids native build complexity.
- Native path can be documented for users who prefer it.
- Acceleration paths can be added later.
Default setup should probably use Docker first:
pi-web-voice setup whisper
That command can:
- Create a model cache directory, e.g.
~/.cache/pi-web-voice/models.
- Pull or verify the whisper.cpp Docker image.
- Download a default model such as
base.en.
- Write a config file, e.g.
~/.config/pi-web-voice/config.json.
Example generated config:
{
"provider": "whisper-cpp",
"backend": "docker",
"model": "base.en",
"port": 8789
}
Then:
starts a local-only HTTP service on 127.0.0.1:8789.
Parakeet support
Parakeet should be treated as an advanced provider, likely best for users with NVIDIA GPU setups.
Possible setup:
pi-web-voice setup parakeet
pi-web-voice serve --provider parakeet
The Parakeet implementation should live entirely in @ashwin-pc/pi-web-voice, not in pi-web core. It may require Python, PyTorch, NeMo, CUDA, and audio format conversion to the model's expected input format.
pi-web UI states
The mic button should explain what is happening instead of failing silently.
Possible tooltip/status messages:
Voice input unavailable: set PI_WEB_STT_ENDPOINT and start a local STT sidecar.
Voice input unavailable: could not reach http://127.0.0.1:8789/inference.
Voice input ready: whisper-cpp at 127.0.0.1:8789.
During recording/transcription:
Recording… click to stop
Transcribing…
Transcript inserted into composer
pi-web API shape
Suggested core routes:
GET /api/stt/status
POST /api/stt/transcribe
GET /api/stt/status examples:
{ "ok": true, "enabled": false, "reason": "PI_WEB_STT_ENDPOINT is not set" }
{
"ok": true,
"enabled": true,
"endpoint": "http://127.0.0.1:8789/inference",
"provider": "whisper-cpp",
"model": "base.en"
}
{
"ok": false,
"enabled": true,
"error": "Could not reach STT sidecar"
}
POST /api/stt/transcribe should:
- Require the normal pi-web auth/token behavior.
- Enforce max request size.
- Enforce timeout.
- Write temp files only if necessary.
- Clean up temp files.
- Forward multipart audio to the sidecar.
- Return
{ ok: true, text } or { ok: false, error }.
Security considerations
- Sidecar should bind to
127.0.0.1 by default.
- pi-web should not expose arbitrary endpoint access without explicit config.
- Limit upload size.
- Limit request duration.
- Do not auto-submit transcripts.
- Do not persist audio by default.
- Avoid running the sidecar as admin/root where possible.
- Make it clear that the local sidecar processes microphone audio.
Implementation checklist
Core pi-web:
Voice package:
Open questions
- Should
pi-web --voice auto-discover and launch pi-web-voice serve, or should users run the sidecar explicitly first?
- Should the sidecar package be JS/Node-only and shell out to Docker/native binaries, or should provider implementations live in separate subpackages?
- Should
PI_WEB_STT_ENDPOINT alone enable voice, or should PI_WEB_STT=1 also be required?
- What should the default model be:
base.en, small.en, or a user prompt during setup?
- Should the sidecar API eventually support streaming partial transcripts, or is request/response enough for the first version?
Non-goals for first version
- Bundling model weights in pi-web.
- Adding Python/PyTorch/NeMo dependencies to pi-web.
- Supporting real-time streaming transcription.
- Auto-submitting voice transcripts to the agent.
- Treating voice input as a Pi agent extension.
- Supporting every STT backend in the first release.
Proposed first milestone
- Add the generic pi-web STT bridge and mic UI.
- Create
@ashwin-pc/pi-web-voice with whisper.cpp Docker support.
- Document a simple path:
npm install -g @ashwin-pc/pi-web @ashwin-pc/pi-web-voice
pi-web-voice setup whisper
pi-web-voice serve
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference pi-web
- Add Parakeet as a later advanced provider once the sidecar contract is stable.
Summary
Add optional local speech-to-text support to pi-web without bundling voice models or ML dependencies into the core app.
Proposed packaging split:
The goal is to keep pi-web minimalist and installable on most machines while giving users who want local voice input an easy add-on path.
Motivation
Voice input is useful, but local STT brings heavy and variable dependencies:
Those should not be part of the default
pi-webinstall. Users who do not want voice should not encounter model downloads, Docker errors, native build failures, GPU issues, or large install sizes.Design principle
Voice input should be treated as a composer input method, not as an agent extension or core agent feature.
The transcript should be inserted into the textarea and not auto-submitted. Users can edit before sending.
Core pi-web responsibilities
@ashwin-pc/pi-webshould only include a thin, generic STT bridge:MediaRecorder.POST /api/stt/transcribe.Suggested pi-web env vars:
Optional:
Though
PI_WEB_STT_ENDPOINTalone may be enough to enable the bridge.Sidecar package responsibilities
@ashwin-pc/pi-web-voiceshould own all STT-specific complexity:Suggested binary:
Suggested commands:
Sidecar HTTP contract
Keep the contract deliberately small and provider-agnostic.
GET /healthresponse:{ "ok": true, "provider": "whisper-cpp", "model": "base.en", "backend": "docker" }POST /inference:file.text.{ "text": "transcribed text here" }Optional future fields:
{ "text": "transcribed text here", "language": "en", "durationMs": 1234, "segments": [] }Recommended user flow
Install pi-web
No voice dependencies are installed by default.
Add voice
Then start pi-web with:
When the sidecar is reachable, pi-web shows a microphone button in the composer.
Recommended default provider: whisper.cpp
Use whisper.cpp as the default setup path because it is more practical for broad local installs than Python-based Whisper or Parakeet:
Default setup should probably use Docker first:
That command can:
~/.cache/pi-web-voice/models.base.en.~/.config/pi-web-voice/config.json.Example generated config:
{ "provider": "whisper-cpp", "backend": "docker", "model": "base.en", "port": 8789 }Then:
starts a local-only HTTP service on
127.0.0.1:8789.Parakeet support
Parakeet should be treated as an advanced provider, likely best for users with NVIDIA GPU setups.
Possible setup:
The Parakeet implementation should live entirely in
@ashwin-pc/pi-web-voice, not in pi-web core. It may require Python, PyTorch, NeMo, CUDA, and audio format conversion to the model's expected input format.pi-web UI states
The mic button should explain what is happening instead of failing silently.
Possible tooltip/status messages:
During recording/transcription:
pi-web API shape
Suggested core routes:
GET /api/stt/statusexamples:{ "ok": true, "enabled": false, "reason": "PI_WEB_STT_ENDPOINT is not set" }{ "ok": true, "enabled": true, "endpoint": "http://127.0.0.1:8789/inference", "provider": "whisper-cpp", "model": "base.en" }{ "ok": false, "enabled": true, "error": "Could not reach STT sidecar" }POST /api/stt/transcribeshould:{ ok: true, text }or{ ok: false, error }.Security considerations
127.0.0.1by default.Implementation checklist
Core pi-web:
server/stt.tsfor config/status/transcribe forwarding.GET /api/stt/status.POST /api/stt/transcribe.src/composer/voice.tsfor MediaRecorder start/stop/transcribe flow.src/app/elements.ts.src/app/icons.ts.Voice package:
@ashwin-pc/pi-web-voicepackage.pi-web-voicebinary.doctor.setup whisper.serve./health./inference.Open questions
pi-web --voiceauto-discover and launchpi-web-voice serve, or should users run the sidecar explicitly first?PI_WEB_STT_ENDPOINTalone enable voice, or shouldPI_WEB_STT=1also be required?base.en,small.en, or a user prompt during setup?Non-goals for first version
Proposed first milestone
@ashwin-pc/pi-web-voicewith whisper.cpp Docker support.