Skip to content

Commit 53ad099

Browse files
committed
disagg: add readinessProbe to worker pods — fixes T12 KubeDiscoveryClient 0-instances
Root cause found in source: Dynamo runtime's KubeDiscoveryClient (lib/runtime/src/discovery/kube/daemon.rs:246 in v1.1.0 and 0.9.0) filters EndpointSlices by endpoint.conditions.ready==true. Workers whose pod Ready=False are filtered out, so Frontend's /v1/models returns {"data":[]} and /v1/completions returns 404. Our evidence showed ready=False on both prefill + decode worker EndpointSlices because no readinessProbe was defined — k8s used the default Pod Ready criterion, which stays False during vLLM's 60-180s model load. Frontend was ready but alone; workers never appeared in its instance map despite registering DWMs with matching namespaces. Fix: explicit TCP readinessProbe on DYN_SYSTEM_PORT 9191 for both worker services: - initialDelaySeconds: 180 (vLLM model load + cudagraph warmup) - periodSeconds: 10 - timeoutSeconds: 3 - failureThreshold: 30 (tolerates 5 min total before failing) Why DYN_SYSTEM_PORT and not vLLM's port: DYN_SYSTEM_ENABLED=true (already set) makes the runtime open a system TCP listener on DYN_SYSTEM_PORT as soon as the Dynamo framework initializes — before vLLM finishes loading weights. A TCP probe on 9191 is a cleaner "Dynamo is alive" signal than an HTTP /health probe on the same port (the upstream operator attempts to inject the HTTP probe but our evidence shows it wasn't actually applied on our cluster, indicating either a regression in the operator's subComponentType: prefill|decode injection or a default failureThreshold too short for model-load times). Key reconciliation: the operator code (component_worker.go) says "ReadinessProbe in Dynamo worker context doesn't determine that the worker is ready to receive traffic" — which is TRUE for Dynamo's INTERNAL NATS/etcd routing but FALSE for the k8s-native KubeDiscoveryClient path used by Frontend. The comment also acknowledges: "Still important for external dependencies that rely on Pod Readiness." Our case is exactly that. This commit unblocks T12 /v1/completions end-to-end without any upstream runtime change. Upstream issue ai-dynamo/dynamo#9200 has been updated with the root-cause analysis + this workaround. Validation on cluster deferred until H100 lock frees (currently held by wave8-1778045853 until ~08:37 UTC 2026-05-06). Commit made now so reviewers can proceed on eye-check.
1 parent 23b0063 commit 53ad099

1 file changed

Lines changed: 23 additions & 0 deletions

File tree

2.projects/dynamo-inference/k8s/dgd-dynamo-combined-vllm.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,19 @@ spec:
116116
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
117117
--max-model-len 4096 \
118118
--gpu-memory-utilization 0.85
119+
# Readiness probe on DYN_SYSTEM_PORT — required for Frontend discovery.
120+
# The Dynamo runtime's KubeDiscoveryClient (daemon.rs:246 in 1.1.0)
121+
# filters EndpointSlices by endpoint.conditions.ready==true. Without
122+
# a probe, k8s keeps Pod Ready=False during model load, which blocks
123+
# Service endpoint readiness, which hides the worker from Frontend.
124+
# Root-cause evidence: docs/evidence/multinode-2026-05-06-rev5/
125+
readinessProbe:
126+
tcpSocket:
127+
port: 9191
128+
initialDelaySeconds: 180
129+
periodSeconds: 10
130+
timeoutSeconds: 3
131+
failureThreshold: 30
119132
resources:
120133
limits:
121134
cpu: "16"
@@ -204,6 +217,16 @@ spec:
204217
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
205218
--max-model-len 4096 \
206219
--gpu-memory-utilization 0.85
220+
# Readiness probe — see PrefillWorker above for full rationale.
221+
# Without this, EndpointSlice stays ready=False during model load
222+
# and Frontend's KubeDiscoveryClient returns 0 instances.
223+
readinessProbe:
224+
tcpSocket:
225+
port: 9191
226+
initialDelaySeconds: 180
227+
periodSeconds: 10
228+
timeoutSeconds: 3
229+
failureThreshold: 30
207230
resources:
208231
limits:
209232
cpu: "16"

0 commit comments

Comments
 (0)