feat(controller): add first-class whisper (speaches) audio transcription runtime by Defilan · Pull Request #613 · defilantech/LLMKube

Defilan · 2026-06-02T21:39:27Z

What

Add a first-class whisper runtime to the operator that serves OpenAI-compatible audio transcription, backed by speaches (faster-whisper / CTranslate2). An InferenceService with runtime: whisper deploys a transcription service reachable at a ClusterIP on /v1/audio/transcriptions, with the same lifecycle, scaling, GPU scheduling, and probe story as every other runtime.

Why

LLMKube only served text generation. Speech-to-text is a common on-prem need (meeting transcription, call analysis, pipeline preprocessing) and partners integrating against the OpenAI-compatible surface want audio too. speaches was chosen because it speaks the OpenAI audio API natively.

Fixes #612

How

WhisperBackend (internal/controller/runtime_whisper.go): container speaches, port 8000, HTTP /health probes, NeedsModelInit=false (speaches fetches CTranslate2 models from HuggingFace at request time), config via env vars rather than CLI flags.
Widened the optional EnvBuilder interface to BuildEnv(isvc, model) so a backend can derive env values from the Model spec (device from hardware.accelerator, compute type from quantization). Updated the vllm/tgi/personaplex implementers and the single deployment_builder call site; they ignore the new arg.
Runtime-aware endpoint path: added an optional EndpointPathProvider interface (whisper returns /v1/audio/transcriptions) and dropped the +kubebuilder:default on EndpointSpec.Path. constructEndpoint and routerProxyEndpoint already default an empty path to /v1/chat/completions, so existing runtimes are unchanged and this is backward compatible. A user-set spec.endpoint.path still wins.
Typed WhisperConfig (compute type, inference device, model TTL, UI toggle, HF/API token secret refs), modeled on TGIConfig. Added whisper to the Runtime enum and regenerated CRDs + chart CRDs.
Image pinned to ghcr.io/speaches-ai/speaches:0.8.3-cuda (CPU users override spec.image with :0.8.3-cpu).

Scope (v1): targets connected clusters; speaches downloads the model on first request. Persistent model cache + air-gapped support need a volume hook on the runtime interface and are deferred to a follow-up. speaches exposes no Prometheus metrics, so the universal PodMonitor benignly 404-scrapes these pods (DefaultHPAMetric returns "").

Verified speaches v0.8.3 upstream: /health healthcheck, port 8000, env names (WHISPER__INFERENCE_DEVICE, WHISPER__COMPUTE_TYPE, WHISPER__TTL, ENABLE_UI, API_KEY), and that there is no WHISPER__MODEL (models are per-request).

Checklist

Tests added/updated
make test passes locally
make lint passes locally (and GOOS=linux golangci-lint run ./...)
Commit messages follow conventional commits
All commits are signed off (git commit -s) per DCO
Documentation updated (if user-facing change)

…ion runtime Add a `whisper` runtime backed by speaches (faster-whisper, CTranslate2) that serves the OpenAI-compatible audio API (/v1/audio/transcriptions) on port 8000. - New WhisperBackend: port 8000, /health probes, NeedsModelInit=false (speaches fetches CTranslate2 models from HuggingFace at request time), env-driven config. - Widen the optional EnvBuilder interface to BuildEnv(isvc, model) so backends can derive env values from the Model spec; update vllm/tgi/personaplex implementers and the deployment_builder call site. - Add an optional EndpointPathProvider interface and drop the EndpointSpec.Path CRD default so the whisper runtime resolves /v1/audio/transcriptions automatically while text runtimes keep /v1/chat/completions. constructEndpoint and routerProxyEndpoint already default an empty path, so this is backward compatible. - New typed WhisperConfig (compute type, device, model TTL, UI, HF/API token secret refs); add `whisper` to the Runtime enum; regenerate CRDs + chart CRDs. - Unit + reconcile tests, examples/whisper-quickstart, and a runtime-table doc update. v1 targets connected clusters (model downloads on first request); persistent model cache + air-gapped support are deferred to a follow-up volume hook. Fixes defilantech#612 Signed-off-by: Christopher Maher <chris@mahercode.io>

codecov · 2026-06-02T21:42:40Z

Codecov Report

❌ Patch coverage is 77.94118% with 30 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
api/v1alpha1/zz_generated.deepcopy.go	0.00%	28 Missing ⚠️
internal/controller/runtime_tgi.go	0.00%	1 Missing ⚠️
internal/controller/runtime_vllm.go	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ackend Two fixes surfaced by live testing the whisper runtime on a GPU cluster: - Service/endpoint port no longer hardcoded to 8080. Add resolveServicePort (containerPort -> endpoint.port -> backend.DefaultPort) and use it in constructService, constructEndpoint, and the deployment builder. Fixes the Service targetPort / container port mismatch for runtimes whose default port is not 8080 (whisper/vllm 8000, tgi 80); llamacpp (8080) is unchanged. - Preload the whisper model. speaches does not download models on the first transcription request (it returns 400 until POST /v1/models/{id}). Add an optional LifecycleProvider interface; WhisperBackend injects a postStart hook that installs model.Spec.Source once the server is healthy, gating Ready on the model being present. The model id is passed via the LLMKUBE_WHISPER_MODEL env var to avoid interpolating CR data into the shell script. Updated the quickstart example (no endpoint block needed) and docs to reflect that the operator preloads the model rather than relying on lazy download. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan · 2026-06-02T23:32:44Z

Live-tested on a GPU cluster (Shadowstack, RTX). Two follow-up commits push the runtime to genuinely first-class:

Service/endpoint port fix. The live deploy exposed that the Service/endpoint defaulted to 8080 while speaches listens on 8000, so in-cluster routing was broken. Added resolveServicePort (containerPort → endpoint.port → backend.DefaultPort), used in constructService/constructEndpoint/deployment builder. Also fixes the latent mismatch for vllm (8000) and tgi (80); llamacpp (8080) unchanged.
Model preload. speaches v0.8.3 does not download models on first request: it 400s with "Model is not installed locally, use POST /v1/models". Added an optional LifecycleProvider interface; WhisperBackend injects a postStart hook that installs model.Spec.Source once /health is up. The hook gates Ready, so the Service only takes traffic once transcription will succeed.

End-to-end result: deploying just a Model + InferenceService(runtime: whisper) → operator preloads large-v3, pod goes Ready, and POST /v1/audio/transcriptions through the ClusterIP returns the correct transcript with no manual steps. GPU confirmed in use (~3.9 GB, float16).

Docs/example updated accordingly (no endpoint block needed; operator preloads rather than lazy-download). Persistent model cache + air-gapped remain the tracked follow-up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(controller): add first-class whisper (speaches) audio transcription runtime#613

feat(controller): add first-class whisper (speaches) audio transcription runtime#613
Defilan wants to merge 2 commits into
defilantech:mainfrom
Defilan:feat/whisper-runtime

Defilan commented Jun 2, 2026

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Defilan commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Jun 2, 2026

What

Why

How

Checklist

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Defilan commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 2, 2026 •

edited

Loading