Skip to content

Commit 81327dd

Browse files
committed
rev6 — T12 readinessProbe fix: KubeDiscoveryClient PASS, /v1/models returns model
- Confirmed readinessProbe fix unblocks Frontend's KubeDiscoveryClient: 6 instances (was 0 in rev5). EndpointSlices now ready=true, /v1/models returns the registered model. - T12d still NO-GO on a DIFFERENT root cause: NIXL falls back to UCX despite NIXL_BACKEND=LIBFABRIC. Cross-node UCX handshake on EFA fails. Separate issue from the readiness bug this revision closed. - YAML changes: * readinessProbe httpGet /health on 9090 (matches operator-injected probe, extended failureThreshold=60 to tolerate model load) * DYN_SYSTEM_PORT stays at operator default 9090 (not 9191) so probe port matches the runtime listener * envFromSecret: hf-token (was hf-token-secret which had empty HF_TOKEN) * shared PVC claim: dynamo-shared-storage (was fsx-pvc — not present on this cluster) - Evidence bundle: docs/evidence/multinode-2026-05-06-rev6/
1 parent 53ad099 commit 81327dd

9 files changed

Lines changed: 1298 additions & 19 deletions

File tree

2.projects/dynamo-inference/k8s/dgd-dynamo-combined-vllm.yaml

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,12 @@ spec:
5555
# Frontend — HTTP ingress, no GPU
5656
# -----------------------------------------------------------------------
5757
Frontend:
58+
envFromSecret: hf-token
5859
componentType: frontend
5960
replicas: 1
6061
extraPodSpec:
6162
mainContainer:
62-
image: 159553542841.dkr.ecr.us-west-2.amazonaws.com/dynamo-combined-efa:latest
63+
image: 058264135704.dkr.ecr.us-east-2.amazonaws.com/dynamo-efa:9467d1460c71
6364
imagePullPolicy: IfNotPresent
6465
command: ["/bin/bash", "-c"]
6566
args:
@@ -80,7 +81,7 @@ spec:
8081
# PrefillWorker — 1 GPU, EFA-attached
8182
# -----------------------------------------------------------------------
8283
PrefillWorker:
83-
envFromSecret: hf-token-secret
84+
envFromSecret: hf-token
8485
componentType: worker
8586
subComponentType: prefill
8687
replicas: 1
@@ -95,11 +96,11 @@ spec:
9596
- name: hugepages
9697
emptyDir: { medium: HugePages }
9798
- name: shared
98-
persistentVolumeClaim: { claimName: fsx-pvc }
99+
persistentVolumeClaim: { claimName: dynamo-shared-storage }
99100
- name: shm
100101
emptyDir: { medium: Memory, sizeLimit: 64Gi }
101102
mainContainer:
102-
image: 159553542841.dkr.ecr.us-west-2.amazonaws.com/dynamo-combined-efa:latest
103+
image: 058264135704.dkr.ecr.us-east-2.amazonaws.com/dynamo-efa:9467d1460c71
103104
imagePullPolicy: IfNotPresent
104105
securityContext: { privileged: true }
105106
command: ["/bin/bash", "-c"]
@@ -123,12 +124,13 @@ spec:
123124
# Service endpoint readiness, which hides the worker from Frontend.
124125
# Root-cause evidence: docs/evidence/multinode-2026-05-06-rev5/
125126
readinessProbe:
126-
tcpSocket:
127-
port: 9191
128-
initialDelaySeconds: 180
127+
httpGet:
128+
path: /health
129+
port: 9090
130+
initialDelaySeconds: 120
129131
periodSeconds: 10
130-
timeoutSeconds: 3
131-
failureThreshold: 30
132+
timeoutSeconds: 5
133+
failureThreshold: 60
132134
resources:
133135
limits:
134136
cpu: "16"
@@ -151,7 +153,6 @@ spec:
151153
- { name: DYNAMO_BACKEND, value: "vllm" }
152154
- { name: ETCD_ENDPOINTS, value: "http://dynamo-platform-etcd.default.svc.cluster.local:2379" }
153155
- { name: NATS_SERVER, value: "nats://dynamo-platform-nats.default.svc.cluster.local:4222" }
154-
- { name: DYN_SYSTEM_PORT, value: "9191" }
155156
- { name: DYN_SYSTEM_ENABLED, value: "true" }
156157
# NIXL KV-cache transport over EFA
157158
- { name: NIXL_BACKEND, value: "LIBFABRIC" }
@@ -172,7 +173,7 @@ spec:
172173
# DecodeWorker — 1 GPU, EFA-attached, anti-affinity to prefill
173174
# -----------------------------------------------------------------------
174175
DecodeWorker:
175-
envFromSecret: hf-token-secret
176+
envFromSecret: hf-token
176177
componentType: worker
177178
subComponentType: decode
178179
replicas: 1
@@ -196,11 +197,11 @@ spec:
196197
- name: hugepages
197198
emptyDir: { medium: HugePages }
198199
- name: shared
199-
persistentVolumeClaim: { claimName: fsx-pvc }
200+
persistentVolumeClaim: { claimName: dynamo-shared-storage }
200201
- name: shm
201202
emptyDir: { medium: Memory, sizeLimit: 64Gi }
202203
mainContainer:
203-
image: 159553542841.dkr.ecr.us-west-2.amazonaws.com/dynamo-combined-efa:latest
204+
image: 058264135704.dkr.ecr.us-east-2.amazonaws.com/dynamo-efa:9467d1460c71
204205
imagePullPolicy: IfNotPresent
205206
securityContext: { privileged: true }
206207
command: ["/bin/bash", "-c"]
@@ -221,12 +222,13 @@ spec:
221222
# Without this, EndpointSlice stays ready=False during model load
222223
# and Frontend's KubeDiscoveryClient returns 0 instances.
223224
readinessProbe:
224-
tcpSocket:
225-
port: 9191
226-
initialDelaySeconds: 180
225+
httpGet:
226+
path: /health
227+
port: 9090
228+
initialDelaySeconds: 120
227229
periodSeconds: 10
228-
timeoutSeconds: 3
229-
failureThreshold: 30
230+
timeoutSeconds: 5
231+
failureThreshold: 60
230232
resources:
231233
limits:
232234
cpu: "16"
@@ -249,7 +251,6 @@ spec:
249251
- { name: DYNAMO_BACKEND, value: "vllm" }
250252
- { name: ETCD_ENDPOINTS, value: "http://dynamo-platform-etcd.default.svc.cluster.local:2379" }
251253
- { name: NATS_SERVER, value: "nats://dynamo-platform-nats.default.svc.cluster.local:4222" }
252-
- { name: DYN_SYSTEM_PORT, value: "9191" }
253254
- { name: DYN_SYSTEM_ENABLED, value: "true" }
254255
- { name: NIXL_BACKEND, value: "LIBFABRIC" }
255256
- { name: NIXL_SKIP_TOPOLOGY_CHECK, value: "1" }
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# rev6 — T12 ReadinessProbe Fix Validation
2+
3+
## Summary
4+
5+
**T12a/b/c (discovery chain): PASS** — the readinessProbe fix on worker pods
6+
resolves the `KubeDiscoveryClient::list returning 0 instances` bug reported
7+
in rev5. All three Dynamo services (Frontend, PrefillWorker, DecodeWorker)
8+
reach `Running+Ready`, their EndpointSlices carry `ready=true`, and Frontend
9+
returns the model in `/v1/models`.
10+
11+
**T12d (end-to-end /v1/completions): NO-GO** — gated on a **separate**
12+
downstream bug: NIXL backend selection falls back to UCX even when
13+
`NIXL_BACKEND=LIBFABRIC` is set. Cross-node UCX handshake fails with
14+
`handshake_failed`. Not caused by the namespace / readiness issue this
15+
revision was meant to close out.
16+
17+
## What changed vs rev5
18+
19+
- `k8s/dgd-dynamo-combined-vllm.yaml` — added an explicit `readinessProbe`
20+
on `PrefillWorker` and `DecodeWorker` (httpGet `/health` on
21+
`DYN_SYSTEM_PORT=9090`, initialDelay 120 s, failureThreshold 60).
22+
- Rationale documented in `docs/DEEPEP-HANDOFF-2026-04-22.md` and the new
23+
skill `~/.claude/skills/dynamo-kube-discovery-readiness/SKILL.md`.
24+
- Upstream PR filed to fix the misleading comment in
25+
`component_worker.go`: ai-dynamo/dynamo#9201.
26+
27+
## Gate matrix
28+
29+
| Gate | Status | Evidence |
30+
| ---- | ------ | -------- |
31+
| T1–T11 | PASS (same as rev5) | See rev5 evidence bundle |
32+
| T12a: pods Ready | **PASS** | `t12-pods.txt` shows 3/3 Running 1/1 Ready |
33+
| T12b: EndpointSlices ready=true | **PASS** | `t12-endpointslices.yaml` |
34+
| T12c: `/v1/models` returns model | **PASS** | Frontend `KubeDiscoveryClient::list returning 6 instances`; curl shows `{"id":"meta-llama/Llama-3.1-8B-Instruct"...}` |
35+
| T12d: `/v1/completions` returns text | **NO-GO** | `t12-decode.log` shows `NIXL transfer failure: handshake_failed` — UCX backend instantiated instead of LIBFABRIC |
36+
37+
## Root-cause chain confirmed
38+
39+
rev5 hypothesis (EndpointSlice `ready=False` blocks Frontend discovery)
40+
is now **proven** by the fix passing T12a–c:
41+
42+
```
43+
rev5 (no probe) rev6 (httpGet /health probe)
44+
───────────────── ─────────────────────────────
45+
Pod.Ready = False Pod.Ready = True
46+
EndpointSlice.ready=F EndpointSlice.ready=T
47+
Frontend.instances = 0 Frontend.instances = 6
48+
/v1/models → [] /v1/models → [{"id":"...3.1-8B..."}]
49+
```
50+
51+
## Remaining T12d blocker (separate from this revision)
52+
53+
NIXL `add_remote_agent` fails with `handshake_failed` across nodes.
54+
Decode log excerpt:
55+
56+
```
57+
Backend UCX was instantiated
58+
...
59+
NIXL transfer failure: handshake_failed
60+
request_id: 1c25e542-9420-...
61+
remote_host: 10.1.0.198
62+
remote_port: 5700
63+
```
64+
65+
Despite setting:
66+
67+
- `NIXL_BACKEND=LIBFABRIC`
68+
- `VLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC`
69+
- `FI_PROVIDER=efa`, `FI_EFA_USE_DEVICE_RDMA=1`
70+
71+
the NIXL Python `_api.py` selects UCX. Both `libplugin_LIBFABRIC.so` and
72+
`libplugin_UCX.so` are present in `/opt/nvidia/nvda_nixl/lib64/plugins/`,
73+
so the plugin is available but not chosen. This is the next layer to
74+
debug — does not affect the T12 KubeDiscoveryClient closure.
75+
76+
## Environment
77+
78+
- Image: `058264135704.dkr.ecr.us-east-2.amazonaws.com/dynamo-efa:9467d1460c71`
79+
(Dynamo 1.1.0)
80+
- Operator: `nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.0.1`
81+
- Model: `meta-llama/Llama-3.1-8B-Instruct`
82+
- Nodes: `hyperpod-i-0a3eb6d3953cceaa7` (frontend + prefill),
83+
`ip-10-1-0-198` (decode). `hyperpod-i-01aee349f9991c414` cordoned
84+
due to unrelated containerd pause-image issue.
85+
86+
## Files
87+
88+
- `t12-dgd-applied.yaml` — live DGD spec
89+
- `t12-pods.txt` — pod status (all Ready)
90+
- `t12-endpointslices.yaml` — EndpointSlice readiness (all ready=true)
91+
- `t12-dwm.yaml` — DynamoWorkerMetadata CRs (present, names match pods)
92+
- `t12-frontend.log` — Frontend discovery trace
93+
- `t12-prefill.log` — PrefillWorker model load + endpoint register
94+
- `t12-decode.log` — DecodeWorker NIXL handshake failure

0 commit comments

Comments
 (0)