feat(controller): add first-class whisper (speaches) audio transcription runtime#613
feat(controller): add first-class whisper (speaches) audio transcription runtime#613Defilan wants to merge 2 commits into
Conversation
…ion runtime Add a `whisper` runtime backed by speaches (faster-whisper, CTranslate2) that serves the OpenAI-compatible audio API (/v1/audio/transcriptions) on port 8000. - New WhisperBackend: port 8000, /health probes, NeedsModelInit=false (speaches fetches CTranslate2 models from HuggingFace at request time), env-driven config. - Widen the optional EnvBuilder interface to BuildEnv(isvc, model) so backends can derive env values from the Model spec; update vllm/tgi/personaplex implementers and the deployment_builder call site. - Add an optional EndpointPathProvider interface and drop the EndpointSpec.Path CRD default so the whisper runtime resolves /v1/audio/transcriptions automatically while text runtimes keep /v1/chat/completions. constructEndpoint and routerProxyEndpoint already default an empty path, so this is backward compatible. - New typed WhisperConfig (compute type, device, model TTL, UI, HF/API token secret refs); add `whisper` to the Runtime enum; regenerate CRDs + chart CRDs. - Unit + reconcile tests, examples/whisper-quickstart, and a runtime-table doc update. v1 targets connected clusters (model downloads on first request); persistent model cache + air-gapped support are deferred to a follow-up volume hook. Fixes defilantech#612 Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…ackend
Two fixes surfaced by live testing the whisper runtime on a GPU cluster:
- Service/endpoint port no longer hardcoded to 8080. Add resolveServicePort
(containerPort -> endpoint.port -> backend.DefaultPort) and use it in
constructService, constructEndpoint, and the deployment builder. Fixes the
Service targetPort / container port mismatch for runtimes whose default port
is not 8080 (whisper/vllm 8000, tgi 80); llamacpp (8080) is unchanged.
- Preload the whisper model. speaches does not download models on the first
transcription request (it returns 400 until POST /v1/models/{id}). Add an
optional LifecycleProvider interface; WhisperBackend injects a postStart hook
that installs model.Spec.Source once the server is healthy, gating Ready on
the model being present. The model id is passed via the LLMKUBE_WHISPER_MODEL
env var to avoid interpolating CR data into the shell script.
Updated the quickstart example (no endpoint block needed) and docs to reflect
that the operator preloads the model rather than relying on lazy download.
Signed-off-by: Christopher Maher <chris@mahercode.io>
|
Live-tested on a GPU cluster (Shadowstack, RTX). Two follow-up commits push the runtime to genuinely first-class:
End-to-end result: deploying just a Model + InferenceService(runtime: whisper) → operator preloads large-v3, pod goes Ready, and Docs/example updated accordingly (no endpoint block needed; operator preloads rather than lazy-download). Persistent model cache + air-gapped remain the tracked follow-up. |
What
Add a first-class
whisperruntime to the operator that serves OpenAI-compatible audio transcription, backed by speaches (faster-whisper / CTranslate2). AnInferenceServicewithruntime: whisperdeploys a transcription service reachable at a ClusterIP on/v1/audio/transcriptions, with the same lifecycle, scaling, GPU scheduling, and probe story as every other runtime.Why
LLMKube only served text generation. Speech-to-text is a common on-prem need (meeting transcription, call analysis, pipeline preprocessing) and partners integrating against the OpenAI-compatible surface want audio too. speaches was chosen because it speaks the OpenAI audio API natively.
Fixes #612
How
WhisperBackend(internal/controller/runtime_whisper.go): containerspeaches, port 8000, HTTP/healthprobes,NeedsModelInit=false(speaches fetches CTranslate2 models from HuggingFace at request time), config via env vars rather than CLI flags.EnvBuilderinterface toBuildEnv(isvc, model)so a backend can derive env values from the Model spec (device fromhardware.accelerator, compute type fromquantization). Updated the vllm/tgi/personaplex implementers and the singledeployment_buildercall site; they ignore the new arg.EndpointPathProviderinterface (whisper returns/v1/audio/transcriptions) and dropped the+kubebuilder:defaultonEndpointSpec.Path.constructEndpointandrouterProxyEndpointalready default an empty path to/v1/chat/completions, so existing runtimes are unchanged and this is backward compatible. A user-setspec.endpoint.pathstill wins.WhisperConfig(compute type, inference device, model TTL, UI toggle, HF/API token secret refs), modeled onTGIConfig. Addedwhisperto theRuntimeenum and regenerated CRDs + chart CRDs.ghcr.io/speaches-ai/speaches:0.8.3-cuda(CPU users overridespec.imagewith:0.8.3-cpu).Scope (v1): targets connected clusters; speaches downloads the model on first request. Persistent model cache + air-gapped support need a volume hook on the runtime interface and are deferred to a follow-up. speaches exposes no Prometheus metrics, so the universal PodMonitor benignly 404-scrapes these pods (
DefaultHPAMetricreturns"").Verified speaches v0.8.3 upstream:
/healthhealthcheck, port 8000, env names (WHISPER__INFERENCE_DEVICE,WHISPER__COMPUTE_TYPE,WHISPER__TTL,ENABLE_UI,API_KEY), and that there is noWHISPER__MODEL(models are per-request).Checklist
make testpasses locallymake lintpasses locally (andGOOS=linux golangci-lint run ./...)git commit -s) per DCO