A FastAPI server that exposes WhisperX as an OpenAI-compatible audio transcription API. Supports both a simple single-server mode and a horizontally scalable distributed mode backed by Kafka and S3.
- OpenAI-compatible — drop-in replacement for
/v1/audio/transcriptionsand/v1/audio/translations - Alignment & diarization — word-level timestamps and speaker labels out of the box
- Multiple output formats —
json,verbose_json,vtt_json,srt,vtt,aud,text - Distributed mode — offload GPU work to dedicated workers via Kafka + S3 (MinIO)
- Pluggable backends — swap transcription, alignment, and diarization implementations per stage
- API key auth — single key or a JSON key-map for multi-client setups
# GPU (CUDA)
docker compose --profile cuda up
# CPU
docker compose --profile cpu upThe API is available at http://localhost:8000.
# Copy and edit credentials before first run
cp .env.example .env
# CUDA workers
docker compose -f compose-kafka.yaml --profile cuda up
# CPU workers
docker compose -f compose-kafka.yaml --profile cpu up
# Both worker types simultaneously
docker compose -f compose-kafka.yaml --profile cuda --profile cpu upWorkers process one job at a time per container. Scale horizontally by running multiple worker replicas.
All settings are environment variables. Nested fields use __ as a delimiter (e.g. WHISPER__MODEL=large-v3).
All available settings are defined in config.py. Variables you'll most likely need to set:
| Variable | Default | Description |
|---|---|---|
WHISPER__MODEL |
large-v3 |
Transcription model name |
WHISPER__COMPUTE_TYPE |
default |
Quantization — float16 for GPU, float32 for CPU |
WHISPER__INFERENCE_DEVICE |
auto |
cpu, cuda, or auto |
HF_TOKEN |
— | Hugging Face token (required for pyannote diarization) |
API_KEY |
— | Single API key for all requests |
API_KEYS_FILE |
— | Path to JSON file mapping key → client name |
MODE |
direct |
direct or kafka |
Additional variables for Kafka mode:
| Variable | Default | Description |
|---|---|---|
KAFKA__BOOTSTRAP_SERVERS |
localhost:9092 |
Kafka broker address |
S3__ENDPOINT_URL |
http://localhost:9000 |
S3 / MinIO endpoint |
S3__BUCKET |
whisperx-audio |
Bucket for audio uploads |
MINIO_ROOT_USER |
minioadmin |
MinIO root user — change before deploying |
MINIO_ROOT_PASSWORD |
minioadmin |
MinIO root password — change before deploying |
Transcribe an audio file. Compatible with the OpenAI transcription API.
Form parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
file | — | Audio file (required) |
model |
string | config default | Model name. whisper-1 is aliased to the configured default. |
language |
string | config default | ISO-639-1 language code. Auto-detected if omitted. |
prompt |
string | — | Optional context/hotwords hint |
response_format |
string | json |
text, json, verbose_json, vtt_json, srt, vtt, aud |
temperature |
float | 0.0 |
Sampling temperature |
timestamp_granularities[] |
list | ["segment"] |
segment, word |
align |
bool | true |
Enable word-level alignment (required for subtitle formats) |
diarize |
bool | false |
Enable speaker diarization (requires align=true) |
speaker_embeddings |
bool | false |
Include speaker embeddings in diarization output |
highlight_words |
bool | false |
Highlight words in vtt/srt output |
suppress_numerals |
bool | true |
Spell out numbers |
hotwords |
string | — | Comma-separated hotwords to bias toward |
batch_size |
int | config default | Inference batch size |
chunk_size |
int | config default | VAD chunk size in seconds |
Response formats
| Format | Content-Type | Body |
|---|---|---|
json |
application/json |
{"text": "..."} |
verbose_json |
application/json |
Full transcript with segments and timestamps |
vtt_json |
application/json |
verbose_json + "vtt_text" field |
text / srt / aud |
text/plain |
Raw text / subtitle file |
vtt |
text/vtt |
WebVTT subtitle file |
Translate audio to English. Same parameters as /v1/audio/transcriptions, minus language, align, diarize, and diarization-related fields.
Returns {"status": "healthy"}. Not protected by API key auth.
| Endpoint | Description |
|---|---|
GET /models/list |
List loaded transcription models |
POST /models/load |
Load a model (model param) |
POST /models/unload |
Unload a model (model param) |
GET /align_models/list |
List loaded alignment models |
POST /align_models/load |
Load an alignment model (language param) |
POST /align_models/unload |
Unload an alignment model (language param) |
GET /diarize_models/list |
List loaded diarization models |
POST /diarize_models/load |
Load a diarization model (model param) |
POST /diarize_models/unload |
Unload a diarization model (model param) |
Each pipeline stage (transcription, alignment, diarization) can use a different backend. Set the active backend via environment variables:
BACKENDS__TRANSCRIPTION=whisperx
BACKENDS__ALIGNMENT=whisperx
BACKENDS__DIARIZATION=whisperxOnly the whisperx backend ships by default. Custom backends can be registered via the backend registry at src/whisperx_api_server/backends/.
| File | Purpose |
|---|---|
compose.yaml |
Standalone server — use --profile cuda or --profile cpu |
compose-kafka.yaml |
Distributed stack — API server + Kafka + MinIO + workers via --profile cuda / --profile cpu |
Workers are opt-in via profiles so docker compose up never accidentally starts a GPU process on a machine that doesn't have one.
Issues, forks, and pull requests are welcome.
GNU General Public License v3.0 — see LICENSE for details.