Skip to content

codemyriad/kyutai_modal

Repository files navigation

Kyutai STT — Real-time Streaming Speech-to-Text (Modal or Local)

Run Kyutai's streaming STT model with ~0.5 second latency. This repo ships a Python WebSocket proxy + Rust moshi-server and supports two deployment paths: Modal (serverless GPU) and local Proxmox LXC with GPU passthrough.

You: "Hello, how are you today?"
     ↓ ~500ms
Server: {"type": "token", "text": " Hello"}
        {"type": "token", "text": ","}
        {"type": "token", "text": " how"}
        ...

What this is / isn't

This repo is:

  • A production-grade streaming STT service (Python proxy + Rust moshi-server)
  • Deployable on Modal or locally in Proxmox LXC with GPU passthrough
  • A simple WebSocket protocol (raw PCM in, JSON tokens out)

This repo is not:

  • A hosted SaaS or web UI
  • A general ASR benchmark suite
  • A batch transcription pipeline

Choose your path

  • Modal deployment: go to Quick Start (Modal)
  • Local Proxmox deployment: go to Local Deployment
  • Client usage (CLI / API): go to Usage

Features

  • Low latency: First token in ~0.5s using moshi streaming architecture
  • Real-time: Token-by-token transcription over WebSocket
  • Scalable: Auto-scales from 0 to handle concurrent sessions
  • Cost-effective: Pay only for GPU time used (scales to zero when idle)
  • Simple protocol: Send raw PCM float32 audio, receive JSON tokens

Prerequisites

  • Modal account (free tier available) if using Modal
  • uv package manager
  • Python 3.11+

Quick Start (Modal)

1. Install Modal CLI and authenticate

uvx modal setup

2. Clone and deploy

git clone https://github.com/YOUR_USERNAME/kyutai-stt-modal.git
cd kyutai-stt-modal

# Deploy to Modal
uvx modal deploy src/stt/modal_app.py

3. Set up environment

Go to your Modal workspace settings and create proxy auth tokens. Then set environment variables:

# Your Modal workspace name (shown in Modal dashboard URL)
export MODAL_WORKSPACE=your-workspace-name

# Proxy auth credentials
export MODAL_KEY=your-key
export MODAL_SECRET=your-secret

4. Test with your microphone

# Install dependencies and run
uv run scripts/transcribe_cli.py

Speak into your microphone and see real-time transcription.

Local Deployment (Proxmox LXC)

Local deployment assets live in deploy/proxmox/:

  • deploy/proxmox/PROGRESS.md: step-by-step runbook + status
  • deploy/proxmox/DEPLOY_LOG.md: full build log
  • deploy/proxmox/local_server.py: local proxy entrypoint
  • deploy/proxmox/kyutai-stt.service: systemd unit
  • deploy/proxmox/smoke-test.py: quick WebSocket test

If you already deployed the local service, it exposes the same WebSocket API. Use the CLI with --service local or override with --url:

uv run scripts/transcribe_cli.py --service local

You can also set a custom URL:

LOCAL_WS_URL=ws://192.168.1.101:8000/v1/stream uv run scripts/transcribe_cli.py --service local

Usage

WebSocket API

Connect to wss://{MODAL_WORKSPACE}--kyutai-stt-rust-kyutaisttrustservice-serve.modal.run/v1/stream

Protocol:

  • Client sends: Raw PCM float32 (little-endian) audio bytes (24kHz mono). Send in ~80ms chunks for low latency.
  • Server sends: JSON messages
{"type": "token", "text": " Hello"}     // Transcription token
{"type": "vad_end"}                      // Voice activity ended (sentence boundary)
{"type": "ping"}                         // Keepalive (ignore)
{"type": "error", "message": "..."}      // Error occurred

Authentication: Include headers Modal-Key and Modal-Secret with your credentials.

Python Client Example

import asyncio
import websockets
import json
import os

async def transcribe():
    workspace = os.environ["MODAL_WORKSPACE"]
    uri = f"wss://{workspace}--kyutai-stt-rust-kyutaisttrustservice-serve.modal.run/v1/stream"
    headers = {
        "Modal-Key": os.environ["MODAL_KEY"],
        "Modal-Secret": os.environ["MODAL_SECRET"],
    }

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # Send raw PCM float32 (LE) audio bytes sampled at 24kHz mono
        # Example: audio, _ = soundfile.read("audio.wav", dtype="float32")
        #          pcm_audio_bytes = audio.astype("float32").tobytes()
        await ws.send(pcm_audio_bytes)

        # Receive tokens
        async for message in ws:
            data = json.loads(message)
            if data["type"] == "token":
                print(data["text"], end="", flush=True)
            elif data["type"] == "vad_end":
                print()  # New line after sentence

asyncio.run(transcribe())

CLI Tools

# Real-time microphone transcription
uv run scripts/transcribe_cli.py

# List available audio devices
uv run scripts/transcribe_cli.py --list-devices

# Use specific microphone
uv run scripts/transcribe_cli.py --device 2

# Use local Proxmox service
uv run scripts/transcribe_cli.py --service local

# Latency benchmark
uv run scripts/latency_test.py -p 4

Configuration

Client Environment Variables

Variable Required Description
MODAL_WORKSPACE Yes Your Modal workspace name
MODAL_KEY Yes Modal proxy auth key
MODAL_SECRET Yes Modal proxy auth secret
LOCAL_WS_URL No Local service WebSocket URL (used with --service local)

Server Configuration (set when deploying)

Variable Default Description
KYUTAI_GPU L40S GPU type (L4, A10G, L40S, A100, H100)
BATCH_SIZE 8 Max concurrent sessions per container
MODEL_NAME kyutai/stt-1b-en_fr Model to use (1B or 2.6B)

Example:

KYUTAI_GPU=L4 uvx modal deploy src/stt/modal_app.py

GPU Selection & Pricing (Modal)

Modal bills per-second. Choose based on your latency and cost requirements:

Default deployment targets L40S (fastest in our latest runs). Override with KYUTAI_GPU=... when deploying to test other cards.

GPU VRAM Cost/Hour First Token Latency (warm, rtf=3, 4 streams)
L40S (default) 48GB $— ~0.58s (avg)
A10G 24GB $1.10 ~0.65s (avg)
L4 24GB $0.80 ~1.06s (avg)
A100 80GB $2.78 ~0.5s*
T4 16GB $0.59 ~0.7s*

*A100 and T4 are older baselines; latest run covered L4/L40S/A10G. Command used:

uv run scripts/latency_test.py --compare-gpus "L4,L40S,A10G" -p 4 --runs 1 --rtf 3

Warmups showed first-token times of ~1.22s (L4), ~0.76s (L40S), ~0.96s (A10G); test phase averages above reflect first-token latency with 4 parallel streams on warm containers.

Benchmarks: First token latency measured with 8 seconds of audio on warm container.

Scaling: Each container handles up to 8 concurrent sessions (configurable via BATCH_SIZE). Modal automatically scales containers to handle more concurrent users (up to 10 containers by default).

Benchmark Different GPUs

# Deploy and test each GPU (creates separate apps)
uv run scripts/latency_test.py --compare-gpus "T4,L4,A10G,A100" -p 4

# Cleanup after benchmarking
uvx modal app stop kyutai-stt-t4
uvx modal app stop kyutai-stt-l4
uvx modal app stop kyutai-stt-a10g
uvx modal app stop kyutai-stt-a100

How It Works

  1. Audio Capture: Client captures microphone audio at 24kHz mono and streams raw PCM float32
  2. WebSocket Streaming: PCM chunks (~80ms) are streamed to the Python proxy over WebSocket
  3. Rust Server: A Python proxy forwards audio to the internal Rust moshi-server (supports batched inference)
  4. Neural Codec: Audio is encoded with Mimi neural codec (80ms frames)
  5. Language Model: Each frame is processed by the streaming transformer for immediate token output
  6. Token Streaming: Tokens are sent back immediately as they're generated (~0.5s latency)

The key to low latency is the moshi streaming architecture - instead of waiting for complete utterances, the model processes audio frame-by-frame and outputs tokens incrementally. The Rust server enables efficient batched processing of multiple concurrent streams.

Cost Optimization

The deployment is configured for cost efficiency:

  • Scale to zero: Containers shut down after 60s of no connections
  • Idle timeout: WebSocket connections close after 10s without audio
  • No buffer containers: Only spin up containers when needed

Typical costs:

  • Cold start: ~20-30s (first request after scale-down)
  • Warm request: ~0.5s latency
  • GPU time: Only billed while processing

Troubleshooting

401 Unauthorized: Check your MODAL_KEY and MODAL_SECRET are set correctly.

Connection timeout: The service may have scaled to zero. First request takes 20-30s for cold start.

No transcription: Ensure you're sending raw PCM float32 24kHz mono audio (little-endian).

High latency: Check your GPU selection. T4 is cheapest but slower than A10G/A100.

License

MIT

Acknowledgments

  • Kyutai for the STT model and moshi library
  • Modal for serverless GPU infrastructure

About

A container to deploy to modal.com to run realtime kyutai STT model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages