Rust implementation for the CohereLabs/cohere-transcribe-03-2026 model. Includes a self-contained CLI and an OpenAI-compatible API server for AI agents.
Supports English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Greek, Arabic, Japanese, Chinese, Vietnamese, and Korean.
No build tools needed — just download and run.
Go to Releases and download the zip for your platform:
| Platform | Asset |
|---|---|
| Linux x86_64 (CPU) | transcribe-linux-x86_64.zip |
| Linux x86_64 (CUDA 12.6) | transcribe-linux-x86_64-cuda.zip |
| Linux aarch64 (CPU, SVE) | transcribe-linux-aarch64.zip |
| Linux aarch64 (CUDA 12.6) | transcribe-linux-aarch64-cuda.zip |
| macOS Apple Silicon | transcribe-macos-aarch64.zip |
# Example for Linux x86_64:
unzip transcribe-linux-x86_64.zip
cd transcribe-linux-x86_64The model is gated — you must log in to HuggingFace and accept the license terms first: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026
Then download the weights into a models/ directory inside the release folder:
pip install huggingface_hub
huggingface-cli login # paste your HF_TOKEN when prompted
huggingface-cli download CohereLabs/cohere-transcribe-03-2026 \
--local-dir models/cohere-transcribe-03-2026The release includes a pre-generated vocab.json. Copy it to the model folder:
cp vocab.json models/cohere-transcribe-03-2026/./transcribe --model-dir models/cohere-transcribe-03-2026 --language en recording.wavNo LD_LIBRARY_PATH or DYLD_LIBRARY_PATH needed — the binary has RPATH baked in
and finds libtorch/ (Linux) or mlx.metallib (macOS) in the same directory.
# Start the server
./transcribe-server --model-dir models/cohere-transcribe-03-2026 --port 8080 &
# Wait for "Listening on ..." message, then:
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@recording.wav" \
-F "model=cohere-transcribe" \
-F "language=en"The server is OpenAI Whisper API compatible — works with any OpenAI client library.
USAGE:
transcribe --model-dir <PATH> [OPTIONS] <AUDIO>...
ARGUMENTS:
<AUDIO>... One or more audio files to transcribe
OPTIONS:
-m, --model-dir Path to the model directory
-l, --language Language code [default: en]
--no-punctuation Strip punctuation from output
--max-tokens Max tokens to generate per segment [default: 448]
-v, --verbose Enable info logging (-vv for debug)
-h, --help Print help
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
en |
English | de |
German | nl |
Dutch |
fr |
French | es |
Spanish | pl |
Polish |
ar |
Arabic | it |
Italian | el |
Greek |
ja |
Japanese | pt |
Portuguese | vi |
Vietnamese |
zh |
Chinese | ko |
Korean |
WAV, FLAC, MP3, AAC, OGG — anything supported by symphonia. Audio is automatically converted to 16 kHz mono.
Files longer than ~35 seconds are split into overlapping chunks (5 s overlap) and the results joined automatically.
# Transcribe a single file
./transcribe -m models/cohere-transcribe-03-2026 interview.mp3
# French, no punctuation
./transcribe -m models/cohere-transcribe-03-2026 --language fr --no-punctuation speech.wav
# Multiple files — prints filename before each transcript
./transcribe -m models/cohere-transcribe-03-2026 call1.wav call2.wav call3.flac
# Show model loading progress
./transcribe -m models/cohere-transcribe-03-2026 -v audio.wavtranscribe-server is an OpenAI-compatible HTTP server built on Axum.
It serves the same model as the CLI and is a drop-in replacement for the OpenAI Whisper API.
./transcribe-server \
--model-dir models/cohere-transcribe-03-2026 \
--host 0.0.0.0 \
--port 8080The server loads the model at startup (~30–90 s depending on hardware), then prints:
Listening on http://0.0.0.0:8080
Endpoint: POST http://0.0.0.0:8080/v1/audio/transcriptions
| Method | Path | Description |
|---|---|---|
POST |
/v1/audio/transcriptions |
Transcribe audio — OpenAI compatible |
GET |
/health |
Liveness probe → {"status":"ok"} |
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@recording.wav" \
-F "model=cohere-transcribe" \
-F "language=en" \
-F "response_format=json"Form fields (all OpenAI-compatible):
| Field | Required | Default | Notes |
|---|---|---|---|
file |
yes | — | Audio bytes (WAV, MP3, FLAC, AAC, OGG) |
model |
yes | — | Any string; the server ignores it |
language |
no | en |
ISO-639-1 code |
response_format |
no | json |
json, text, verbose_json, srt, vtt |
temperature |
no | — | Ignored; always greedy |
prompt |
no | — | Ignored |
Response formats:
# json (default)
{"text": "Hello, world."}
# text
Hello, world.
# verbose_json
{"task":"transcribe","language":"en","duration":3.0,"text":"Hello, world.","segments":[...]}
# srt
1
00:00:00,000 --> 00:00:03,000
Hello, world.
# vtt
WEBVTT
00:00:00.000 --> 00:00:03.000
Hello, world.from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-used", # required by the client, ignored by the server
)
with open("recording.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="cohere-transcribe",
file=f,
language="en",
)
print(transcript.text)import OpenAI from "openai";
import fs from "fs";
const openai = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "not-used",
});
const transcript = await openai.audio.transcriptions.create({
model: "cohere-transcribe",
file: fs.createReadStream("recording.wav"),
language: "en",
});
console.log(transcript.text);OPTIONS:
-m, --model-dir Model directory (required)
--host Bind address [default: 0.0.0.0]
-p, --port Port [default: 8080]
-l, --language Default language when not specified by client [default: en]
--max-tokens Max tokens per segment [default: 448]
-v, --verbose Logging (-v info, -vv debug)
-h, --help Print help
Two compute backends are available — select one at compile time:
| Backend | Platform | Feature flag | Accelerator |
|---|---|---|---|
| libtorch (default) | Linux x86_64, Linux aarch64 | --features tch-backend |
CPU (BLAS-optimized) |
| MLX | macOS Apple Silicon | --features mlx |
Apple GPU (Metal) |
Both backends produce identical output from the same weights.
All platforms:
- Rust stable (1.70+) — install from rustup.rs
- 8 GB RAM — the model weights expand to ~5.6 GB at runtime
- Python + sentencepiece — one-time only, to extract
vocab.json
Linux (libtorch backend):
- libtorch C++ library — downloaded once, ~500 MB
macOS Apple Silicon (MLX backend):
- macOS 14+ with an M-series chip
- mlx-c is built automatically from the git submodule by
build.rs
Pick the build for your platform and extract it to /opt/libtorch.
This is the C++ library from PyTorch.org; no Python runtime is involved.
Linux x86_64:
curl -Lo libtorch.zip \
'https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcpu.zip'
sudo unzip libtorch.zip -d /optLinux ARM64 (AWS Graviton3, Ampere Altra — requires SVE support):
curl -Lo libtorch.tar.gz \
'https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-2.7.1.tar.gz'
sudo tar xzf libtorch.tar.gz -C /optBoth commands produce /opt/libtorch/. Set LIBTORCH=/your/path to use a different location.
Docker on macOS: Extract libtorch to a native Linux path such as
/opt/libtorch, not onto the macOS volume mount (e.g./Users/…). The Linux linker cannot read large shared libraries through the virtiofs layer.
pip install huggingface_hub
huggingface-cli download CohereLabs/cohere-transcribe-03-2026 \
--local-dir models/cohere-transcribe-03-2026The model uses a SentencePiece tokenizer. Run this script once to produce vocab.json,
which the Rust binary reads at runtime. Python is not needed after this step.
pip install sentencepiece
python tools/extract_vocab.py --model_dir models/cohere-transcribe-03-2026Linux (libtorch backend, default):
LIBTORCH=/opt/libtorch cargo build --releaseThe LIBTORCH path is baked into the binary's RPATH by build.rs, so the binary
runs without LD_LIBRARY_PATH.
macOS Apple Silicon (MLX backend):
git submodule update --init --recursive
cargo build --release --no-default-features --features mlxDocker on macOS (Linux builds): If the project source is on a macOS volume mount, set
CARGO_TARGET_DIRto a native Linux path to prevent SIGBUS during compilation:LIBTORCH=/opt/libtorch CARGO_TARGET_DIR=/tmp/cohere_target cargo build --release -j 1
./target/release/transcribe --model-dir models/cohere-transcribe-03-2026 recording.wavNo environment variables needed — RPATH is baked in at build time.
libtorch not found (Linux)
Set LIBTORCH=/path/to/libtorch before building, or install to /opt/libtorch.
Missing required file 'vocab.json'
Run python tools/extract_vocab.py --model_dir <model_dir>, or copy vocab.json
from the release zip into the model directory.
Process killed immediately (exit 137)
Out of memory. The model needs ~5.6 GB of RAM. Check with free -h (Linux) or
Activity Monitor (macOS).
Illegal instruction (exit 132) on ARM Linux
The ARM64 libtorch build requires SVE (Scalable Vector Extension), available on
Graviton3, Ampere Altra, and similar server-class CPUs. It does not run on
Apple Silicon Linux VMs, which only expose NEON. Use a native Linux ARM64 server with SVE,
or use the macOS MLX backend on Apple hardware.
ELF section name out of range at link time (Linux)
libtorch is on a Docker volume-mounted macOS path. Move it to a native Linux
path such as /opt/libtorch.
Benchmarked on an M4 Mac Mini (MLX/Metal GPU backend). Model loading takes a few seconds on first run, but once loaded, inference is 10x–24x faster than real-time. NVIDIA GPU builds (CUDA libtorch) will be significantly faster still.
| Audio | Duration | Wall time | Notes |
|---|---|---|---|
| Demo WAV (real speech) | 5.44 s | 12.9 s | Includes 3.3 s model load + Metal compile |
| Synthetic 60 s tone | 60 s | 18.1 s | Tests chunking (chunks split at ~35 s) |
The server loads the model once at startup. Subsequent requests skip model loading and Metal shader compilation, making warm requests extremely fast.
| Request | Audio | Response time | Notes |
|---|---|---|---|
| 1st request (cold Metal) | 5.44 s | 8.3 s | Includes Metal shader compilation |
| 2nd+ requests (warm) | 5.44 s | 0.4 s | 13x faster than audio duration |
| 60 s chunked (warm) | 60 s | 2.5 s | 24x faster than audio duration |