Commit 53ad099
committed
disagg: add readinessProbe to worker pods — fixes T12 KubeDiscoveryClient 0-instances
Root cause found in source: Dynamo runtime's KubeDiscoveryClient
(lib/runtime/src/discovery/kube/daemon.rs:246 in v1.1.0 and 0.9.0)
filters EndpointSlices by endpoint.conditions.ready==true. Workers
whose pod Ready=False are filtered out, so Frontend's /v1/models
returns {"data":[]} and /v1/completions returns 404.
Our evidence showed ready=False on both prefill + decode worker
EndpointSlices because no readinessProbe was defined — k8s used the
default Pod Ready criterion, which stays False during vLLM's 60-180s
model load. Frontend was ready but alone; workers never appeared in
its instance map despite registering DWMs with matching namespaces.
Fix: explicit TCP readinessProbe on DYN_SYSTEM_PORT 9191 for both
worker services:
- initialDelaySeconds: 180 (vLLM model load + cudagraph warmup)
- periodSeconds: 10
- timeoutSeconds: 3
- failureThreshold: 30 (tolerates 5 min total before failing)
Why DYN_SYSTEM_PORT and not vLLM's port: DYN_SYSTEM_ENABLED=true
(already set) makes the runtime open a system TCP listener on
DYN_SYSTEM_PORT as soon as the Dynamo framework initializes —
before vLLM finishes loading weights. A TCP probe on 9191 is a
cleaner "Dynamo is alive" signal than an HTTP /health probe on the
same port (the upstream operator attempts to inject the HTTP probe
but our evidence shows it wasn't actually applied on our cluster,
indicating either a regression in the operator's subComponentType:
prefill|decode injection or a default failureThreshold too short for
model-load times).
Key reconciliation: the operator code (component_worker.go) says
"ReadinessProbe in Dynamo worker context doesn't determine that the
worker is ready to receive traffic" — which is TRUE for Dynamo's
INTERNAL NATS/etcd routing but FALSE for the k8s-native
KubeDiscoveryClient path used by Frontend. The comment also
acknowledges: "Still important for external dependencies that rely
on Pod Readiness." Our case is exactly that.
This commit unblocks T12 /v1/completions end-to-end without any
upstream runtime change. Upstream issue ai-dynamo/dynamo#9200 has
been updated with the root-cause analysis + this workaround.
Validation on cluster deferred until H100 lock frees (currently
held by wave8-1778045853 until ~08:37 UTC 2026-05-06). Commit made
now so reviewers can proceed on eye-check.1 parent 23b0063 commit 53ad099
1 file changed
Lines changed: 23 additions & 0 deletions
Lines changed: 23 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
119 | 132 | | |
120 | 133 | | |
121 | 134 | | |
| |||
204 | 217 | | |
205 | 218 | | |
206 | 219 | | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
207 | 230 | | |
208 | 231 | | |
209 | 232 | | |
| |||
0 commit comments