Skip to content

feat(controller): add first-class whisper (speaches) audio transcription runtime#613

Open
Defilan wants to merge 2 commits into
defilantech:mainfrom
Defilan:feat/whisper-runtime
Open

feat(controller): add first-class whisper (speaches) audio transcription runtime#613
Defilan wants to merge 2 commits into
defilantech:mainfrom
Defilan:feat/whisper-runtime

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Jun 2, 2026

What

Add a first-class whisper runtime to the operator that serves OpenAI-compatible audio transcription, backed by speaches (faster-whisper / CTranslate2). An InferenceService with runtime: whisper deploys a transcription service reachable at a ClusterIP on /v1/audio/transcriptions, with the same lifecycle, scaling, GPU scheduling, and probe story as every other runtime.

Why

LLMKube only served text generation. Speech-to-text is a common on-prem need (meeting transcription, call analysis, pipeline preprocessing) and partners integrating against the OpenAI-compatible surface want audio too. speaches was chosen because it speaks the OpenAI audio API natively.

Fixes #612

How

  • WhisperBackend (internal/controller/runtime_whisper.go): container speaches, port 8000, HTTP /health probes, NeedsModelInit=false (speaches fetches CTranslate2 models from HuggingFace at request time), config via env vars rather than CLI flags.
  • Widened the optional EnvBuilder interface to BuildEnv(isvc, model) so a backend can derive env values from the Model spec (device from hardware.accelerator, compute type from quantization). Updated the vllm/tgi/personaplex implementers and the single deployment_builder call site; they ignore the new arg.
  • Runtime-aware endpoint path: added an optional EndpointPathProvider interface (whisper returns /v1/audio/transcriptions) and dropped the +kubebuilder:default on EndpointSpec.Path. constructEndpoint and routerProxyEndpoint already default an empty path to /v1/chat/completions, so existing runtimes are unchanged and this is backward compatible. A user-set spec.endpoint.path still wins.
  • Typed WhisperConfig (compute type, inference device, model TTL, UI toggle, HF/API token secret refs), modeled on TGIConfig. Added whisper to the Runtime enum and regenerated CRDs + chart CRDs.
  • Image pinned to ghcr.io/speaches-ai/speaches:0.8.3-cuda (CPU users override spec.image with :0.8.3-cpu).

Scope (v1): targets connected clusters; speaches downloads the model on first request. Persistent model cache + air-gapped support need a volume hook on the runtime interface and are deferred to a follow-up. speaches exposes no Prometheus metrics, so the universal PodMonitor benignly 404-scrapes these pods (DefaultHPAMetric returns "").

Verified speaches v0.8.3 upstream: /health healthcheck, port 8000, env names (WHISPER__INFERENCE_DEVICE, WHISPER__COMPUTE_TYPE, WHISPER__TTL, ENABLE_UI, API_KEY), and that there is no WHISPER__MODEL (models are per-request).

Checklist

  • Tests added/updated
  • make test passes locally
  • make lint passes locally (and GOOS=linux golangci-lint run ./...)
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • Documentation updated (if user-facing change)

…ion runtime

Add a `whisper` runtime backed by speaches (faster-whisper, CTranslate2) that
serves the OpenAI-compatible audio API (/v1/audio/transcriptions) on port 8000.

- New WhisperBackend: port 8000, /health probes, NeedsModelInit=false (speaches
  fetches CTranslate2 models from HuggingFace at request time), env-driven config.
- Widen the optional EnvBuilder interface to BuildEnv(isvc, model) so backends can
  derive env values from the Model spec; update vllm/tgi/personaplex implementers
  and the deployment_builder call site.
- Add an optional EndpointPathProvider interface and drop the EndpointSpec.Path
  CRD default so the whisper runtime resolves /v1/audio/transcriptions
  automatically while text runtimes keep /v1/chat/completions. constructEndpoint
  and routerProxyEndpoint already default an empty path, so this is backward
  compatible.
- New typed WhisperConfig (compute type, device, model TTL, UI, HF/API token
  secret refs); add `whisper` to the Runtime enum; regenerate CRDs + chart CRDs.
- Unit + reconcile tests, examples/whisper-quickstart, and a runtime-table doc
  update.

v1 targets connected clusters (model downloads on first request); persistent
model cache + air-gapped support are deferred to a follow-up volume hook.

Fixes defilantech#612

Signed-off-by: Christopher Maher <chris@mahercode.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 77.94118% with 30 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
api/v1alpha1/zz_generated.deepcopy.go 0.00% 28 Missing ⚠️
internal/controller/runtime_tgi.go 0.00% 1 Missing ⚠️
internal/controller/runtime_vllm.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ackend

Two fixes surfaced by live testing the whisper runtime on a GPU cluster:

- Service/endpoint port no longer hardcoded to 8080. Add resolveServicePort
  (containerPort -> endpoint.port -> backend.DefaultPort) and use it in
  constructService, constructEndpoint, and the deployment builder. Fixes the
  Service targetPort / container port mismatch for runtimes whose default port
  is not 8080 (whisper/vllm 8000, tgi 80); llamacpp (8080) is unchanged.
- Preload the whisper model. speaches does not download models on the first
  transcription request (it returns 400 until POST /v1/models/{id}). Add an
  optional LifecycleProvider interface; WhisperBackend injects a postStart hook
  that installs model.Spec.Source once the server is healthy, gating Ready on
  the model being present. The model id is passed via the LLMKUBE_WHISPER_MODEL
  env var to avoid interpolating CR data into the shell script.

Updated the quickstart example (no endpoint block needed) and docs to reflect
that the operator preloads the model rather than relying on lazy download.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan
Copy link
Copy Markdown
Member Author

Defilan commented Jun 2, 2026

Live-tested on a GPU cluster (Shadowstack, RTX). Two follow-up commits push the runtime to genuinely first-class:

  1. Service/endpoint port fix. The live deploy exposed that the Service/endpoint defaulted to 8080 while speaches listens on 8000, so in-cluster routing was broken. Added resolveServicePort (containerPort → endpoint.port → backend.DefaultPort), used in constructService/constructEndpoint/deployment builder. Also fixes the latent mismatch for vllm (8000) and tgi (80); llamacpp (8080) unchanged.

  2. Model preload. speaches v0.8.3 does not download models on first request: it 400s with "Model is not installed locally, use POST /v1/models". Added an optional LifecycleProvider interface; WhisperBackend injects a postStart hook that installs model.Spec.Source once /health is up. The hook gates Ready, so the Service only takes traffic once transcription will succeed.

End-to-end result: deploying just a Model + InferenceService(runtime: whisper) → operator preloads large-v3, pod goes Ready, and POST /v1/audio/transcriptions through the ClusterIP returns the correct transcript with no manual steps. GPU confirmed in use (~3.9 GB, float16).

Docs/example updated accordingly (no endpoint block needed; operator preloads rather than lazy-download). Persistent model cache + air-gapped remain the tracked follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] First-class whisper (speaches) audio-transcription runtime

1 participant