Kyutai STT — Real-time Streaming Speech-to-Text (Modal or Local)

Run Kyutai's streaming STT model with ~0.5 second latency. This repo ships a Python WebSocket proxy + Rust moshi-server and supports two deployment paths: Modal (serverless GPU) and local Proxmox LXC with GPU passthrough.

You: "Hello, how are you today?"
     ↓ ~500ms
Server: {"type": "token", "text": " Hello"}
        {"type": "token", "text": ","}
        {"type": "token", "text": " how"}
        ...

What this is / isn't

This repo is:

A production-grade streaming STT service (Python proxy + Rust moshi-server)
Deployable on Modal or locally in Proxmox LXC with GPU passthrough
A simple WebSocket protocol (raw PCM in, JSON tokens out)

This repo is not:

A hosted SaaS or web UI
A general ASR benchmark suite
A batch transcription pipeline

Choose your path

Modal deployment: go to Quick Start (Modal)
Local Proxmox deployment: go to Local Deployment
Client usage (CLI / API): go to Usage

Features

Low latency: First token in ~0.5s using moshi streaming architecture
Real-time: Token-by-token transcription over WebSocket
Scalable: Auto-scales from 0 to handle concurrent sessions
Cost-effective: Pay only for GPU time used (scales to zero when idle)
Simple protocol: Send raw PCM float32 audio, receive JSON tokens

Prerequisites

Modal account (free tier available) if using Modal
uv package manager
Python 3.11+

Quick Start (Modal)

1. Install Modal CLI and authenticate

uvx modal setup

2. Clone and deploy

git clone https://github.com/YOUR_USERNAME/kyutai-stt-modal.git
cd kyutai-stt-modal

# Deploy to Modal
uvx modal deploy src/stt/modal_app.py

3. Set up environment

Go to your Modal workspace settings and create proxy auth tokens. Then set environment variables:

# Your Modal workspace name (shown in Modal dashboard URL)
export MODAL_WORKSPACE=your-workspace-name

# Proxy auth credentials
export MODAL_KEY=your-key
export MODAL_SECRET=your-secret

4. Test with your microphone

# Install dependencies and run
uv run scripts/transcribe_cli.py

Speak into your microphone and see real-time transcription.

Local Deployment (Proxmox LXC)

Local deployment assets live in deploy/proxmox/:

deploy/proxmox/PROGRESS.md: step-by-step runbook + status
deploy/proxmox/DEPLOY_LOG.md: full build log
deploy/proxmox/local_server.py: local proxy entrypoint
deploy/proxmox/kyutai-stt.service: systemd unit
deploy/proxmox/smoke-test.py: quick WebSocket test

If you already deployed the local service, it exposes the same WebSocket API. Use the CLI with --service local or override with --url:

uv run scripts/transcribe_cli.py --service local

You can also set a custom URL:

LOCAL_WS_URL=ws://192.168.1.101:8000/v1/stream uv run scripts/transcribe_cli.py --service local

Usage

WebSocket API

Connect to wss://{MODAL_WORKSPACE}--kyutai-stt-rust-kyutaisttrustservice-serve.modal.run/v1/stream

Protocol:

Client sends: Raw PCM float32 (little-endian) audio bytes (24kHz mono). Send in ~80ms chunks for low latency.
Server sends: JSON messages

{"type": "token", "text": " Hello"}     // Transcription token
{"type": "vad_end"}                      // Voice activity ended (sentence boundary)
{"type": "ping"}                         // Keepalive (ignore)
{"type": "error", "message": "..."}      // Error occurred

Authentication: Include headers Modal-Key and Modal-Secret with your credentials.

Python Client Example

import asyncio
import websockets
import json
import os

async def transcribe():
    workspace = os.environ["MODAL_WORKSPACE"]
    uri = f"wss://{workspace}--kyutai-stt-rust-kyutaisttrustservice-serve.modal.run/v1/stream"
    headers = {
        "Modal-Key": os.environ["MODAL_KEY"],
        "Modal-Secret": os.environ["MODAL_SECRET"],
    }

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # Send raw PCM float32 (LE) audio bytes sampled at 24kHz mono
        # Example: audio, _ = soundfile.read("audio.wav", dtype="float32")
        #          pcm_audio_bytes = audio.astype("float32").tobytes()
        await ws.send(pcm_audio_bytes)

        # Receive tokens
        async for message in ws:
            data = json.loads(message)
            if data["type"] == "token":
                print(data["text"], end="", flush=True)
            elif data["type"] == "vad_end":
                print()  # New line after sentence

asyncio.run(transcribe())

CLI Tools

# Real-time microphone transcription
uv run scripts/transcribe_cli.py

# List available audio devices
uv run scripts/transcribe_cli.py --list-devices

# Use specific microphone
uv run scripts/transcribe_cli.py --device 2

# Use local Proxmox service
uv run scripts/transcribe_cli.py --service local

# Latency benchmark
uv run scripts/latency_test.py -p 4

Configuration

Client Environment Variables

Variable	Required	Description
`MODAL_WORKSPACE`	Yes	Your Modal workspace name
`MODAL_KEY`	Yes	Modal proxy auth key
`MODAL_SECRET`	Yes	Modal proxy auth secret
`LOCAL_WS_URL`	No	Local service WebSocket URL (used with `--service local`)

Server Configuration (set when deploying)

Variable	Default	Description
`KYUTAI_GPU`	`L40S`	GPU type (`L4`, `A10G`, `L40S`, `A100`, `H100`)
`BATCH_SIZE`	`8`	Max concurrent sessions per container
`MODEL_NAME`	`kyutai/stt-1b-en_fr`	Model to use (1B or 2.6B)

Example:

KYUTAI_GPU=L4 uvx modal deploy src/stt/modal_app.py

GPU Selection & Pricing (Modal)

Modal bills per-second. Choose based on your latency and cost requirements:

Default deployment targets L40S (fastest in our latest runs). Override with KYUTAI_GPU=... when deploying to test other cards.

GPU	VRAM	Cost/Hour	First Token Latency (warm, rtf=3, 4 streams)
L40S (default)	48GB	$—	~0.58s (avg)
A10G	24GB	$1.10	~0.65s (avg)
L4	24GB	$0.80	~1.06s (avg)
A100	80GB	$2.78	~0.5s*
T4	16GB	$0.59	~0.7s*

*A100 and T4 are older baselines; latest run covered L4/L40S/A10G. Command used:

uv run scripts/latency_test.py --compare-gpus "L4,L40S,A10G" -p 4 --runs 1 --rtf 3

Warmups showed first-token times of ~1.22s (L4), ~0.76s (L40S), ~0.96s (A10G); test phase averages above reflect first-token latency with 4 parallel streams on warm containers.

Benchmarks: First token latency measured with 8 seconds of audio on warm container.

Scaling: Each container handles up to 8 concurrent sessions (configurable via BATCH_SIZE). Modal automatically scales containers to handle more concurrent users (up to 10 containers by default).

Benchmark Different GPUs

# Deploy and test each GPU (creates separate apps)
uv run scripts/latency_test.py --compare-gpus "T4,L4,A10G,A100" -p 4

# Cleanup after benchmarking
uvx modal app stop kyutai-stt-t4
uvx modal app stop kyutai-stt-l4
uvx modal app stop kyutai-stt-a10g
uvx modal app stop kyutai-stt-a100

How It Works

Audio Capture: Client captures microphone audio at 24kHz mono and streams raw PCM float32
WebSocket Streaming: PCM chunks (~80ms) are streamed to the Python proxy over WebSocket
Rust Server: A Python proxy forwards audio to the internal Rust moshi-server (supports batched inference)
Neural Codec: Audio is encoded with Mimi neural codec (80ms frames)
Language Model: Each frame is processed by the streaming transformer for immediate token output
Token Streaming: Tokens are sent back immediately as they're generated (~0.5s latency)

The key to low latency is the moshi streaming architecture - instead of waiting for complete utterances, the model processes audio frame-by-frame and outputs tokens incrementally. The Rust server enables efficient batched processing of multiple concurrent streams.

Cost Optimization

The deployment is configured for cost efficiency:

Scale to zero: Containers shut down after 60s of no connections
Idle timeout: WebSocket connections close after 10s without audio
No buffer containers: Only spin up containers when needed

Typical costs:

Cold start: ~20-30s (first request after scale-down)
Warm request: ~0.5s latency
GPU time: Only billed while processing

Troubleshooting

401 Unauthorized: Check your MODAL_KEY and MODAL_SECRET are set correctly.

Connection timeout: The service may have scaled to zero. First request takes 20-30s for cold start.

No transcription: Ensure you're sending raw PCM float32 24kHz mono audio (little-endian).

High latency: Check your GPU selection. T4 is cheapest but slower than A10G/A100.

License

MIT

Acknowledgments

Kyutai for the STT model and moshi library
Modal for serverless GPU infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
deploy/proxmox		deploy/proxmox
samples		samples
scripts		scripts
src/stt		src/stt
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kyutai STT — Real-time Streaming Speech-to-Text (Modal or Local)

What this is / isn't

Choose your path

Features

Prerequisites

Quick Start (Modal)

1. Install Modal CLI and authenticate

2. Clone and deploy

3. Set up environment

4. Test with your microphone

Local Deployment (Proxmox LXC)

Usage

WebSocket API

Python Client Example

CLI Tools

Configuration

Client Environment Variables

Server Configuration (set when deploying)

GPU Selection & Pricing (Modal)

Benchmark Different GPUs

How It Works

Cost Optimization

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kyutai STT — Real-time Streaming Speech-to-Text (Modal or Local)

What this is / isn't

Choose your path

Features

Prerequisites

Quick Start (Modal)

1. Install Modal CLI and authenticate

2. Clone and deploy

3. Set up environment

4. Test with your microphone

Local Deployment (Proxmox LXC)

Usage

WebSocket API

Python Client Example

CLI Tools

Configuration

Client Environment Variables

Server Configuration (set when deploying)

GPU Selection & Pricing (Modal)

Benchmark Different GPUs

How It Works

Cost Optimization

Troubleshooting

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages