Proposal: optional local voice input via pi-web-voice sidecar package

## Summary

Add optional local speech-to-text support to pi-web without bundling voice models or ML dependencies into the core app.

Proposed packaging split:

```text
@ashwin-pc/pi-web
  - Core local/Tailscale web UI
  - Composer, agent session, auth, WebSocket, settings
  - No bundled STT models, Python, CUDA, PyTorch, or ffmpeg dependency
  - Optional generic STT client that talks to a local HTTP endpoint

@ashwin-pc/pi-web-voice
  - Optional sidecar package
  - Owns voice model setup, health checks, model downloads, and sidecar launch
  - Provides a stable local HTTP transcription API for pi-web
```

The goal is to keep pi-web minimalist and installable on most machines while giving users who want local voice input an easy add-on path.

## Motivation

Voice input is useful, but local STT brings heavy and variable dependencies:

- Whisper / whisper.cpp model files
- ffmpeg or audio conversion support
- Docker or native binaries
- Python / PyTorch / NeMo for Parakeet
- Optional CUDA, Core ML, Vulkan, or other acceleration paths

Those should not be part of the default `pi-web` install. Users who do not want voice should not encounter model downloads, Docker errors, native build failures, GPU issues, or large install sizes.

## Design principle

Voice input should be treated as a composer input method, not as an agent extension or core agent feature.

```text
Browser microphone
  -> MediaRecorder audio blob
  -> pi-web /api/stt/transcribe
  -> local sidecar HTTP endpoint
  -> transcript text
  -> insert into composer for review
```

The transcript should be inserted into the textarea and not auto-submitted. Users can edit before sending.

## Core pi-web responsibilities

`@ashwin-pc/pi-web` should only include a thin, generic STT bridge:

- Add a microphone button in the composer only when STT is configured and reachable.
- Record audio in the browser using `MediaRecorder`.
- POST audio to a pi-web API route, e.g. `POST /api/stt/transcribe`.
- Server forwards the audio to the configured sidecar endpoint.
- Insert returned transcript into the composer at the cursor.
- Report helpful unavailable states in the UI.
- Do not install, download, or import any STT model implementation.

Suggested pi-web env vars:

```bash
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference
PI_WEB_STT_TIMEOUT_MS=60000
PI_WEB_STT_MAX_BYTES=25000000
```

Optional:

```bash
PI_WEB_STT=1
```

Though `PI_WEB_STT_ENDPOINT` alone may be enough to enable the bridge.

## Sidecar package responsibilities

`@ashwin-pc/pi-web-voice` should own all STT-specific complexity:

- Setup commands for supported providers.
- Model download/cache management.
- Docker or native backend selection.
- ffmpeg/audio conversion handling.
- Launching the local HTTP sidecar.
- Health checks and diagnostics.
- Provider-specific implementations for whisper.cpp, Parakeet, etc.

Suggested binary:

```bash
pi-web-voice
```

Suggested commands:

```bash
pi-web-voice doctor
pi-web-voice setup whisper
pi-web-voice setup whisper --model small.en
pi-web-voice serve
pi-web-voice serve --provider whisper
pi-web-voice serve --provider parakeet
```

## Sidecar HTTP contract

Keep the contract deliberately small and provider-agnostic.

```http
GET /health
POST /inference
```

`GET /health` response:

```json
{
  "ok": true,
  "provider": "whisper-cpp",
  "model": "base.en",
  "backend": "docker"
}
```

`POST /inference`:

- Accept multipart form data with an audio file field named `file`.
- Return JSON with at least `text`.

```json
{
  "text": "transcribed text here"
}
```

Optional future fields:

```json
{
  "text": "transcribed text here",
  "language": "en",
  "durationMs": 1234,
  "segments": []
}
```

## Recommended user flow

### Install pi-web

```bash
npm install -g @ashwin-pc/pi-web
pi-web
```

No voice dependencies are installed by default.

### Add voice

```bash
npm install -g @ashwin-pc/pi-web-voice
pi-web-voice setup whisper
pi-web-voice serve
```

Then start pi-web with:

```bash
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference pi-web
```

When the sidecar is reachable, pi-web shows a microphone button in the composer.

## Recommended default provider: whisper.cpp

Use whisper.cpp as the default setup path because it is more practical for broad local installs than Python-based Whisper or Parakeet:

- CPU-friendly path for most machines.
- Docker path avoids native build complexity.
- Native path can be documented for users who prefer it.
- Acceleration paths can be added later.

Default setup should probably use Docker first:

```bash
pi-web-voice setup whisper
```

That command can:

1. Create a model cache directory, e.g. `~/.cache/pi-web-voice/models`.
2. Pull or verify the whisper.cpp Docker image.
3. Download a default model such as `base.en`.
4. Write a config file, e.g. `~/.config/pi-web-voice/config.json`.

Example generated config:

```json
{
  "provider": "whisper-cpp",
  "backend": "docker",
  "model": "base.en",
  "port": 8789
}
```

Then:

```bash
pi-web-voice serve
```

starts a local-only HTTP service on `127.0.0.1:8789`.

## Parakeet support

Parakeet should be treated as an advanced provider, likely best for users with NVIDIA GPU setups.

Possible setup:

```bash
pi-web-voice setup parakeet
pi-web-voice serve --provider parakeet
```

The Parakeet implementation should live entirely in `@ashwin-pc/pi-web-voice`, not in pi-web core. It may require Python, PyTorch, NeMo, CUDA, and audio format conversion to the model's expected input format.

## pi-web UI states

The mic button should explain what is happening instead of failing silently.

Possible tooltip/status messages:

```text
Voice input unavailable: set PI_WEB_STT_ENDPOINT and start a local STT sidecar.
```

```text
Voice input unavailable: could not reach http://127.0.0.1:8789/inference.
```

```text
Voice input ready: whisper-cpp at 127.0.0.1:8789.
```

During recording/transcription:

```text
Recording… click to stop
Transcribing…
Transcript inserted into composer
```

## pi-web API shape

Suggested core routes:

```http
GET /api/stt/status
POST /api/stt/transcribe
```

`GET /api/stt/status` examples:

```json
{ "ok": true, "enabled": false, "reason": "PI_WEB_STT_ENDPOINT is not set" }
```

```json
{
  "ok": true,
  "enabled": true,
  "endpoint": "http://127.0.0.1:8789/inference",
  "provider": "whisper-cpp",
  "model": "base.en"
}
```

```json
{
  "ok": false,
  "enabled": true,
  "error": "Could not reach STT sidecar"
}
```

`POST /api/stt/transcribe` should:

- Require the normal pi-web auth/token behavior.
- Enforce max request size.
- Enforce timeout.
- Write temp files only if necessary.
- Clean up temp files.
- Forward multipart audio to the sidecar.
- Return `{ ok: true, text }` or `{ ok: false, error }`.

## Security considerations

- Sidecar should bind to `127.0.0.1` by default.
- pi-web should not expose arbitrary endpoint access without explicit config.
- Limit upload size.
- Limit request duration.
- Do not auto-submit transcripts.
- Do not persist audio by default.
- Avoid running the sidecar as admin/root where possible.
- Make it clear that the local sidecar processes microphone audio.

## Implementation checklist

Core pi-web:

- [ ] Add `server/stt.ts` for config/status/transcribe forwarding.
- [ ] Add `GET /api/stt/status`.
- [ ] Add `POST /api/stt/transcribe`.
- [ ] Add `src/composer/voice.ts` for MediaRecorder start/stop/transcribe flow.
- [ ] Add mic button to composer markup.
- [ ] Add DOM element wiring in `src/app/elements.ts`.
- [ ] Add mic icon to `src/app/icons.ts`.
- [ ] Insert transcript at cursor instead of auto-submitting.
- [ ] Hide or disable mic when no sidecar is configured.
- [ ] Add tests for disabled status, unreachable status, and transcript insertion.
- [ ] Add README docs for optional voice setup.

Voice package:

- [ ] Create `@ashwin-pc/pi-web-voice` package.
- [ ] Add `pi-web-voice` binary.
- [ ] Implement `doctor`.
- [ ] Implement `setup whisper`.
- [ ] Implement `serve`.
- [ ] Implement `/health`.
- [ ] Implement `/inference`.
- [ ] Add Docker-based whisper.cpp backend.
- [ ] Add native whisper.cpp backend if feasible.
- [ ] Add Parakeet provider later or behind an advanced flag.
- [ ] Document model cache paths and cleanup.

## Open questions

- Should `pi-web --voice` auto-discover and launch `pi-web-voice serve`, or should users run the sidecar explicitly first?
- Should the sidecar package be JS/Node-only and shell out to Docker/native binaries, or should provider implementations live in separate subpackages?
- Should `PI_WEB_STT_ENDPOINT` alone enable voice, or should `PI_WEB_STT=1` also be required?
- What should the default model be: `base.en`, `small.en`, or a user prompt during setup?
- Should the sidecar API eventually support streaming partial transcripts, or is request/response enough for the first version?

## Non-goals for first version

- Bundling model weights in pi-web.
- Adding Python/PyTorch/NeMo dependencies to pi-web.
- Supporting real-time streaming transcription.
- Auto-submitting voice transcripts to the agent.
- Treating voice input as a Pi agent extension.
- Supporting every STT backend in the first release.

## Proposed first milestone

1. Add the generic pi-web STT bridge and mic UI.
2. Create `@ashwin-pc/pi-web-voice` with whisper.cpp Docker support.
3. Document a simple path:

```bash
npm install -g @ashwin-pc/pi-web @ashwin-pc/pi-web-voice
pi-web-voice setup whisper
pi-web-voice serve
PI_WEB_STT_ENDPOINT=http://127.0.0.1:8789/inference pi-web
```

4. Add Parakeet as a later advanced provider once the sidecar contract is stable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: optional local voice input via pi-web-voice sidecar package #1

Summary

Motivation

Design principle

Core pi-web responsibilities

Sidecar package responsibilities

Sidecar HTTP contract

Recommended user flow

Install pi-web

Add voice

Recommended default provider: whisper.cpp

Parakeet support

pi-web UI states

pi-web API shape

Security considerations

Implementation checklist

Open questions

Non-goals for first version

Proposed first milestone

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: optional local voice input via pi-web-voice sidecar package #1

Description

Summary

Motivation

Design principle

Core pi-web responsibilities

Sidecar package responsibilities

Sidecar HTTP contract

Recommended user flow

Install pi-web

Add voice

Recommended default provider: whisper.cpp

Parakeet support

pi-web UI states

pi-web API shape

Security considerations

Implementation checklist

Open questions

Non-goals for first version

Proposed first milestone

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions