Skip to content

Commit 4c43f0f

Browse files
committed
rev7 — T12 FULL PASS: /v1/completions returns real tokens over EFA + NIXL LIBFABRIC
THE FIX (one line in the --kv-transfer-config JSON): -'{"kv_connector":"NixlConnector","kv_role":"kv_both"}' +'{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}' ROOT CAUSE: vLLM NixlConnector defaults to backends=["UCX"] at nixl_connector.py:1022-1024. Neither NIXL_BACKEND nor VLLM_NIXL_KVCACHE_BACKEND are read anywhere — the env vars we'd been setting since rev2 were silent no-ops (vLLM even emits "Unknown vLLM environment variable detected" for the latter). Only the extra_config.backends JSON path selects the libfabric plugin. GATES (all PASS): - Pods Running & Ready - EndpointSlices ready=true (from rev6 readinessProbe fix) - KubeDiscoveryClient returns 6 instances (up from 0 in rev5) - /v1/models returns {"id":"meta-llama/Llama-3.1-8B-Instruct",...} - NIXL decode log: "Backend LIBFABRIC was instantiated" (was UCX in rev6) - /v1/completions non-stream HTTP 200: " Paris. The capital of France is Paris..." - /v1/completions SSE HTTP 200: full token stream + [DONE] - /v1/chat/completions HTTP 200: "It's nice to meet you..." EVIDENCE: - docs/evidence/multinode-2026-05-06-rev7/ — full DGD, pods, EndpointSlices, logs for Frontend + Prefill + Decode, plus the 3 curl response bodies - docs/T12-HYPOTHESES-AND-FINDINGS.md — full debug trail rev3 -> rev7 so future revs don't re-walk the same dead ends UPSTREAM: - vllm-project/vllm#41814 filed: ask vLLM to add NIXL_BACKEND env read + docs the extra_config.backends path - Side bug in Dynamo handlers.py:2014 identified (bare "error" string triggers Rust enum deserialize failure on empty-outputs sad path). Tracked as Dynamo follow-up, not blocking T12. CLIENT CAVEAT: Always send explicit "stream":true or "stream":false in the request body. Omitting it hits a Dynamo Frontend fold path that chokes on finish_reason enum shape — unrelated to transport, fixed with explicit flag.
1 parent 81327dd commit 4c43f0f

15 files changed

Lines changed: 1817 additions & 0 deletions
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# T12 — Hypotheses and findings log (rev3 → rev7)
2+
3+
This document captures the chain of hypotheses, false leads, validated findings,
4+
and upstream issues opened while debugging T12 (`/v1/completions` end-to-end on
5+
disaggregated vLLM + EFA RDMA + NIXL). It exists so future revs don't re-walk
6+
the same dead ends.
7+
8+
## Timeline
9+
10+
| Rev | Date (UTC) | Outcome | Primary hypothesis under test |
11+
| --- | ---------- | ------- | ----------------------------- |
12+
| rev3 | 2026-05-05 | NO-GO | Namespace mismatch between 3 separate DGDs |
13+
| rev4 | 2026-05-06 | NO-GO | Namespace mismatch, operator hash suffix |
14+
| rev5 | 2026-05-06 | NO-GO (3 of 4 gates) | Dynamo 1.1.0 bump + discovery timing |
15+
| rev6 | 2026-05-06 | Partial PASS (T12a/b/c) | Missing worker `readinessProbe` blocks EndpointSlice |
16+
| rev7 | 2026-05-06 | **FULL PASS** | vLLM NIXL backend default is hardcoded UCX |
17+
18+
## The hypothesis chain
19+
20+
### H1 — Three DGDs create per-service namespace mismatch (rev3)
21+
22+
**Hypothesis**: The operator stamps each service with
23+
`<k8s-ns>-<dgd-name>-<service>` as the Dynamo namespace. Three separate DGDs
24+
→ three different namespaces → Frontend can't see workers.
25+
26+
**Evidence**: `kubectl get dgd -o yaml` shows the operator warning
27+
"`spec.services[X].dynamoNamespace is deprecated and ignored`" on all three.
28+
Frontend log: `KubeDiscoveryClient::list returning 0 instances`.
29+
30+
**Outcome**: Merge to single DGD. Partial — see H2.
31+
32+
**Artifact**: skill `dgd-operator-namespace-suffix`.
33+
34+
### H2 — Single DGD still adds a per-worker hash suffix (rev4)
35+
36+
**Hypothesis**: Even with one DGD, the operator adds a content-based hash
37+
suffix to worker namespaces (but not Frontend), so Frontend can't reach
38+
workers.
39+
40+
**Evidence**: Worker namespaces observed as
41+
`default-dynamo-combined-vllm-ae74a2d2` vs Frontend's
42+
`default-dynamo-combined-vllm`.
43+
44+
**Fix**: `DYN_NAMESPACE=default-dynamo-combined-vllm` +
45+
`DYN_NAMESPACE_WORKER_SUFFIX=""` on all services.
46+
47+
**Outcome**: Namespaces align. Still `0 instances`. H2 was necessary but not
48+
sufficient.
49+
50+
### H3 — Dynamo 1.0.1 has a `KubeDiscoveryClient` bug fixed in 1.1.0 (rev5)
51+
52+
**Hypothesis**: Upstream issue 9200 implies 1.1.0 fixes the namespace/discovery
53+
filter. Bump image to 1.1.0 and retry.
54+
55+
**Evidence**: Source inspection of `lib/runtime/src/discovery/kube/daemon.rs`
56+
shows identical predicate at line 246 in both 1.0.1 and 1.1.0.
57+
58+
**Outcome**: Same symptom, same `0 instances`. Bump didn't help — the bug
59+
isn't in the namespace/CR matching logic.
60+
61+
### H4 — `ready=False` on worker EndpointSlices blocks discovery (rev5 → rev6)
62+
63+
**Hypothesis**: The `daemon.rs:246` predicate filters on
64+
`endpoint.conditions.ready == true`. Without a `readinessProbe`, kubelet keeps
65+
Pod Ready=False during model load, which keeps the EndpointSlice not-ready,
66+
which filters out the worker silently.
67+
68+
**Evidence**:
69+
```
70+
$ kubectl get endpointslice -o jsonpath='{.items[*].endpoints[*].conditions.ready}'
71+
false false false ← all workers
72+
```
73+
Source: `lib/runtime/src/discovery/kube/utils.rs:22-51` and
74+
`daemon.rs:246`.
75+
76+
**Fix**: Add explicit `readinessProbe` (`httpGet /health`,
77+
`failureThreshold: 60`, `initialDelaySeconds: 120`) to each worker's
78+
`mainContainer`. Match the operator's injected port (`system`, 9090) and
79+
handler type to avoid "may not specify more than 1 handler type" k8s rejection.
80+
81+
**Outcome**: **T12a/b/c PASS.** Frontend reports `KubeDiscoveryClient::list
82+
returning 6 instances`. `/v1/models` returns the registered model.
83+
84+
**Artifact**: skill `dynamo-kube-discovery-readiness`. Upstream docs PR
85+
`ai-dynamo/dynamo#9201` to fix the misleading `component_worker.go` comment.
86+
87+
### H5 — NIXL falls back to UCX despite `NIXL_BACKEND=LIBFABRIC` (rev6 → rev7)
88+
89+
**Hypothesis (initial, WRONG)**: Setting `NIXL_BACKEND=LIBFABRIC` and
90+
`VLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` env vars should select the libfabric
91+
backend plugin.
92+
93+
**What we observed**: Decode log prints `Backend UCX was instantiated` and
94+
subsequent `handshake_failed` on `add_remote_agent()` with EFA cross-node
95+
transfers. vLLM warning: `Unknown vLLM environment variable detected:
96+
VLLM_NIXL_KVCACHE_BACKEND`.
97+
98+
**Actual root cause (validated via source read)**:
99+
`vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1022-1024`:
100+
```python
101+
self.nixl_backends = vllm_config.kv_transfer_config.get_from_extra_config(
102+
"backends", ["UCX"]
103+
)
104+
```
105+
There is no env var read path. `NIXL_BACKEND`, `VLLM_NIXL_KVCACHE_BACKEND`,
106+
and every other env we had set since rev2 were silent no-ops. The only way
107+
to select LIBFABRIC is via the `kv_connector_extra_config.backends` field
108+
inside the `--kv-transfer-config` JSON.
109+
110+
**Fix**:
111+
```yaml
112+
args:
113+
- --kv-transfer-config
114+
- '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'
115+
```
116+
117+
**Outcome**: Decode log: `Backend LIBFABRIC was instantiated`. No more
118+
`handshake_failed`. NIXL exchanges libfabric-native EP names.
119+
120+
**Upstream issues filed**:
121+
- `vllm-project/vllm#41814` — "NixlConnector hardcodes backends=\"UCX\" default; no env-var override path"
122+
- Cross-references `aws-samples/awsome-inference#72` and the earlier
123+
`ai-dynamo/dynamo#9200`.
124+
125+
### H6 — `fold completions stream: invalid type: unit variant, expected newtype variant` (rev7 intermediate, WRONG interpretation)
126+
127+
**Hypothesis (initial)**: Operator 1.0.1 vs runtime 1.1.0 version skew, or
128+
prefill is returning a full-response shape when it should return KV handles
129+
only.
130+
131+
**What we observed**:
132+
```
133+
Failed deserializing JSON to response
134+
err: invalid type: unit variant, expected newtype variant at line 1 column 55
135+
json_str: {"data":{"data":{"token_ids":[],"finish_reason":"error",...}}}
136+
```
137+
The Frontend routed the request to `component=prefill` (not `backend`/decode),
138+
and the worker returned a double-wrapped JSON with `finish_reason:"error"`
139+
and zero tokens. The Rust deserializer on the Frontend choked at column 55
140+
on the `"error"` string.
141+
142+
**Actual root cause**: Client-side. `curl` without an explicit `stream` flag
143+
defaults to a Dynamo code path that folds the streaming frames into a single
144+
completion. When the upstream shape doesn't match the `ChoiceFinishReason`
145+
Rust enum (which expects `{"error":<msg>}` newtype, not bare string
146+
`"error"`), the fold fails. Explicit `stream:true` or `stream:false` takes
147+
a different code path that parses correctly.
148+
149+
**Proof**:
150+
- `curl -d '...,"stream":false'` → HTTP 200 with real tokens.
151+
- `curl -d '...,"stream":true'` → HTTP 200 SSE tokens + `[DONE]`.
152+
- `/v1/chat/completions` → HTTP 200 with real assistant message.
153+
- All three hit `component=backend` (decode worker), not `component=prefill`
154+
as the failing case did. So the request *was* being routed correctly once
155+
the client sent a well-formed body.
156+
157+
**Side note**: rev7 prefill *did* also receive and complete its request
158+
(request_id `4e808b0b-...`) in 357 ms — the KV handles were produced
159+
correctly. The fold error was purely in how the Frontend consumed what
160+
*both* workers returned. The `{"data":{"data":...}}` envelope is the normal
161+
Dynamo push-handler wrapper; it was only misinterpreted because the
162+
implicit-stream fold path expects it with specific enum shapes.
163+
164+
**No code fix required for our rev7 PASS.** Client should send explicit
165+
`stream:false` or `stream:true`.
166+
167+
**Caveat (from second independent researcher pass)**: the underlying bug in
168+
Dynamo's worker Python code is real and still present. On the sad path where
169+
vLLM's `RequestOutput.outputs` is empty, `components/src/dynamo/vllm/handlers.py:2014`
170+
emits `"finish_reason":"error"` (bare string). The Rust `FinishReason` enum at
171+
`lib/llm/src/protocols/common.rs:44-56` is:
172+
173+
```rust
174+
#[serde(rename = "error")]
175+
Error(String), // newtype — expects {"error":"<msg>"} or "error: <msg>"
176+
```
177+
178+
So `"error"` bare triggers the `invalid type: unit variant, expected newtype
179+
variant` deserialize error. A sister emission at `handlers.py:1705-1710` uses
180+
the correct form: `"finish_reason": "error: No outputs from vLLM engine"`.
181+
182+
Our rev7 request stream never hits this path because it succeeds. But if a
183+
future dispatch returns empty outputs (e.g., NIXL hiccup, engine abort), the
184+
Frontend will again HTTP 500 instead of surfacing a real error. Track as
185+
Dynamo follow-up:
186+
187+
- Patch `handlers.py:2014` to emit `"error: <reason>"` instead of bare
188+
`"error"`.
189+
- Or harden the Rust enum to accept `BareError` via `#[serde(untagged)]`.
190+
191+
Filed as follow-up. Not required for T12 closure.
192+
193+
## Knobs and settings that turned out to be no-ops
194+
195+
These were set at various points between rev2 and rev7. Removing any of them
196+
would not change the rev7 PASS outcome. Kept in the YAML for defensiveness
197+
or because other subsystems read them:
198+
199+
| Env / arg | Effect | Keep? |
200+
| --- | --- | --- |
201+
| `NIXL_BACKEND=LIBFABRIC` | No read path in vLLM or NIXL Python API | Remove (was the whole trap) |
202+
| `VLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` | Not a recognized vLLM env (warning emitted) | Remove |
203+
| `NIXL_SKIP_TOPOLOGY_CHECK=1` | Reads in NIXL topology module — kept as defensive | Keep |
204+
| `NIXL_LIBFABRIC_MAX_RAILS=1` | Read by libfabric plugin at runtime — active | Keep |
205+
| `FI_PROVIDER=efa`, `FI_EFA_USE_DEVICE_RDMA=1` | Read by libfabric at provider init — active | Keep |
206+
| `FI_EFA_ENABLE_SHM=0`, `FI_EFA_ENABLE_SHM_TRANSFER=0` | Read by libfabric provider — active | Keep |
207+
| `DYN_NAMESPACE=...` | Dynamo 1.1.0 honors this for namespace override — active | Keep (already load-bearing) |
208+
| `DYN_NAMESPACE_WORKER_SUFFIX=""` | Dynamo 1.1.0 honors — strips the hash suffix | Keep |
209+
| `DYN_SYSTEM_PORT=9090` | Operator default; probe port alignment | Keep |
210+
| `NIXL_PLUGIN_DIR=/opt/dynamo/.../plugins` | NIXL reads this to locate `libplugin_LIBFABRIC.so` | Keep (load-bearing) |
211+
212+
## What the rev6 readinessProbe + rev7 extra_config together unlock
213+
214+
The full dependency chain that had to land:
215+
216+
1. **Worker pod has `readinessProbe`** — otherwise EndpointSlice `ready=false`
217+
2. **EndpointSlice `ready=true`** — otherwise `daemon.rs:246` filters
218+
3. **KubeDiscoveryClient returns > 0 instances** — otherwise `/v1/models` is empty
219+
4. **`/v1/models` registers the model** — otherwise Frontend routes 404
220+
5. **NIXL instantiates LIBFABRIC (not UCX)** — otherwise cross-node handshake fails
221+
6. **NIXL handshake succeeds** — otherwise KV transfer never starts
222+
7. **Client sends explicit `stream:true/false`** — otherwise the implicit-fold path blows up on enum deserialization
223+
224+
All seven gates now pass. T12 is closed.
225+
226+
## Skills authored from this work
227+
228+
- `dgd-operator-namespace-suffix` — Layer 1-2 (namespace merge + DYN_NAMESPACE override)
229+
- `dynamo-kube-discovery-readiness` — Layer 3 (the readinessProbe fix that was the real gate)
230+
- `codebuild-dockerhub-429-recovery` — CodeBuild retry + public-ECR mirror pattern
231+
- New skill forthcoming: `nixl-libfabric-backend-selection` — document the `extra_config.backends` route and the no-op envs
232+
233+
## Upstream issues / PRs opened
234+
235+
- `ai-dynamo/dynamo#9200` — KubeDiscoveryClient diagnostic gap
236+
- `ai-dynamo/dynamo#9201` — docs PR: clarify `component_worker.go` ReadinessProbe comment
237+
- `vllm-project/vllm#41814` — NixlConnector hardcoded UCX default + suggested env var + docs
238+
239+
## What to do if T12 regresses
240+
241+
1. Check `kubectl get endpointslice -o jsonpath='{.items[*].endpoints[*].conditions.ready}'` — all should be `true`. If any are `false`, probe misconfigured or model-load > `initialDelaySeconds + periodSeconds * failureThreshold`.
242+
2. Check decode log for `Backend LIBFABRIC was instantiated`. If it says UCX, the `kv_connector_extra_config` JSON is missing or malformed.
243+
3. Check client request has explicit `stream:true` or `stream:false`. If `curl` omits it, expect the `unit variant` deserialize error.
244+
4. For deeper failures, see the rev7 evidence bundle for a known-good snapshot.
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# rev7 — T12d PASS — end-to-end disaggregated /v1/completions
2+
3+
**Status: ALL GATES GREEN.** Disaggregated inference returns real tokens over
4+
EFA RDMA + NIXL LIBFABRIC transport on the `dynamo-combined-efa` image.
5+
6+
## Summary of the rev6 → rev7 delta
7+
8+
One-line fix in `k8s/dgd-dynamo-combined-vllm.yaml`: extend
9+
`--kv-transfer-config` JSON with `kv_connector_extra_config.backends`:
10+
11+
```yaml
12+
# rev6 (broken — silently fell back to UCX which can't handshake cross-node):
13+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
14+
15+
# rev7 (forces LIBFABRIC, cross-node transfer works):
16+
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'
17+
```
18+
19+
## Root cause (independent investigation, confirmed)
20+
21+
vLLM's `NixlConnector` has the default hardcoded at
22+
`vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1022-1024`:
23+
24+
```python
25+
self.nixl_backends = vllm_config.kv_transfer_config.get_from_extra_config(
26+
"backends", ["UCX"]
27+
)
28+
```
29+
30+
No env-var read path. `NIXL_BACKEND`, `VLLM_NIXL_KVCACHE_BACKEND` (and every
31+
other env we tried from rev2 onward) are silently ignored. The only way to
32+
pick LIBFABRIC is via the JSON extra_config.
33+
34+
Upstream issue filed: `vllm-project/vllm#41814`.
35+
36+
## Gate matrix
37+
38+
| Gate | rev5 | rev6 | rev7 | Evidence |
39+
| ---- | ---- | ---- | ---- | -------- |
40+
| Pods Running & Ready | FAIL | PASS | **PASS** | `t12-pods.txt` |
41+
| EndpointSlices `ready=true` | FAIL | PASS | **PASS** | `t12-endpointslices.yaml` |
42+
| KubeDiscoveryClient instances > 0 | 0 | 6 | **6** | `t12-frontend-PASS.log` |
43+
| `/v1/models` returns model | FAIL | PASS | **PASS** | same log |
44+
| NIXL backend == LIBFABRIC | N/A | FAIL (UCX) | **PASS** | `t12-decode-PASS.log``Backend LIBFABRIC was instantiated` |
45+
| `/v1/completions` non-stream | FAIL | FAIL | **PASS** | `t12d-nostream.json` |
46+
| `/v1/completions` SSE stream | FAIL | FAIL | **PASS** | `t12d-stream.sse` |
47+
| `/v1/chat/completions` | FAIL | FAIL | **PASS** | `t12d-chat.json` |
48+
49+
## Proof of work
50+
51+
`t12d-nostream.json` (HTTP 200, 745 ms):
52+
53+
```json
54+
{"id":"cmpl-0807a0be-...","choices":[{"text":" Paris. The capital of France is Paris. The capital of France is Paris. The capital of France","index":0,"finish_reason":"length"}],"object":"text_completion","usage":{"prompt_tokens":5,"completion_tokens":20,"total_tokens":25}}
55+
```
56+
57+
`t12d-chat.json` (HTTP 200, 71 ms):
58+
59+
```json
60+
{"id":"chatcmpl-25a2658e-...","choices":[{"index":0,"message":{"content":"It's nice to meet you. Is there something","role":"assistant"}...}],"object":"chat.completion","usage":{"prompt_tokens":36,"completion_tokens":10,"total_tokens":46}}
61+
```
62+
63+
Worker trace (`t12-decode-PASS.log`):
64+
65+
```
66+
NIXL INFO _api.py:361 Backend LIBFABRIC was instantiated
67+
handle_payload: request received component=backend endpoint=generate ...
68+
handle_payload: request completed (elapsed_ms=737)
69+
```
70+
71+
## Environment
72+
73+
- Image: `058264135704.dkr.ecr.us-east-2.amazonaws.com/dynamo-efa:9467d1460c71` (Dynamo 1.1.0 + vLLM 0.19.1)
74+
- Operator: `nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.0.1`
75+
- Model: `meta-llama/Llama-3.1-8B-Instruct`
76+
- Transport: EFA RDMA via libfabric 2.4.0amzn3.0, NIXL 0.6.x with LIBFABRIC plugin
77+
- Nodes: `hyperpod-i-0a3eb6d3953cceaa7` (Frontend + Decode on H100), `ip-10-1-0-198` (Prefill on H100)
78+
79+
## Known minor warning (non-fatal)
80+
81+
```
82+
W libfabric_rail_manager.cpp:543] Could not deduce average EFA device upstream link bandwidth,
83+
W libfabric_rail_manager.cpp:259] Using default (all) rail selection policy for DRAM memory type
84+
```
85+
86+
NIXL falls back to all-rail selection. No measured impact on T12d at our message sizes. Will tune `NIXL_LIBFABRIC_MAX_RAILS` in a follow-up if p50 latency becomes a gate.
87+
88+
## Files
89+
90+
- `t12d-nostream.json` — HTTP 200 JSON completion
91+
- `t12d-stream.sse` — HTTP 200 SSE completion tokens + `[DONE]`
92+
- `t12d-chat.json` — HTTP 200 chat completion
93+
- `t12-dgd-applied.yaml` — DGD spec as applied
94+
- `t12-pods.txt` — pod Ready status
95+
- `t12-endpointslices.yaml` — EndpointSlice ready=true on all 3
96+
- `t12-decode-PASS.log` — Backend LIBFABRIC + all 3 request completions
97+
- `t12-prefill-PASS.log` — prefill request trace
98+
- `t12-frontend-PASS.log` — Frontend routing trace

0 commit comments

Comments
 (0)