|
| 1 | +# T12 — Hypotheses and findings log (rev3 → rev7) |
| 2 | + |
| 3 | +This document captures the chain of hypotheses, false leads, validated findings, |
| 4 | +and upstream issues opened while debugging T12 (`/v1/completions` end-to-end on |
| 5 | +disaggregated vLLM + EFA RDMA + NIXL). It exists so future revs don't re-walk |
| 6 | +the same dead ends. |
| 7 | + |
| 8 | +## Timeline |
| 9 | + |
| 10 | +| Rev | Date (UTC) | Outcome | Primary hypothesis under test | |
| 11 | +| --- | ---------- | ------- | ----------------------------- | |
| 12 | +| rev3 | 2026-05-05 | NO-GO | Namespace mismatch between 3 separate DGDs | |
| 13 | +| rev4 | 2026-05-06 | NO-GO | Namespace mismatch, operator hash suffix | |
| 14 | +| rev5 | 2026-05-06 | NO-GO (3 of 4 gates) | Dynamo 1.1.0 bump + discovery timing | |
| 15 | +| rev6 | 2026-05-06 | Partial PASS (T12a/b/c) | Missing worker `readinessProbe` blocks EndpointSlice | |
| 16 | +| rev7 | 2026-05-06 | **FULL PASS** | vLLM NIXL backend default is hardcoded UCX | |
| 17 | + |
| 18 | +## The hypothesis chain |
| 19 | + |
| 20 | +### H1 — Three DGDs create per-service namespace mismatch (rev3) |
| 21 | + |
| 22 | +**Hypothesis**: The operator stamps each service with |
| 23 | +`<k8s-ns>-<dgd-name>-<service>` as the Dynamo namespace. Three separate DGDs |
| 24 | +→ three different namespaces → Frontend can't see workers. |
| 25 | + |
| 26 | +**Evidence**: `kubectl get dgd -o yaml` shows the operator warning |
| 27 | +"`spec.services[X].dynamoNamespace is deprecated and ignored`" on all three. |
| 28 | +Frontend log: `KubeDiscoveryClient::list returning 0 instances`. |
| 29 | + |
| 30 | +**Outcome**: Merge to single DGD. Partial — see H2. |
| 31 | + |
| 32 | +**Artifact**: skill `dgd-operator-namespace-suffix`. |
| 33 | + |
| 34 | +### H2 — Single DGD still adds a per-worker hash suffix (rev4) |
| 35 | + |
| 36 | +**Hypothesis**: Even with one DGD, the operator adds a content-based hash |
| 37 | +suffix to worker namespaces (but not Frontend), so Frontend can't reach |
| 38 | +workers. |
| 39 | + |
| 40 | +**Evidence**: Worker namespaces observed as |
| 41 | +`default-dynamo-combined-vllm-ae74a2d2` vs Frontend's |
| 42 | +`default-dynamo-combined-vllm`. |
| 43 | + |
| 44 | +**Fix**: `DYN_NAMESPACE=default-dynamo-combined-vllm` + |
| 45 | +`DYN_NAMESPACE_WORKER_SUFFIX=""` on all services. |
| 46 | + |
| 47 | +**Outcome**: Namespaces align. Still `0 instances`. H2 was necessary but not |
| 48 | +sufficient. |
| 49 | + |
| 50 | +### H3 — Dynamo 1.0.1 has a `KubeDiscoveryClient` bug fixed in 1.1.0 (rev5) |
| 51 | + |
| 52 | +**Hypothesis**: Upstream issue 9200 implies 1.1.0 fixes the namespace/discovery |
| 53 | +filter. Bump image to 1.1.0 and retry. |
| 54 | + |
| 55 | +**Evidence**: Source inspection of `lib/runtime/src/discovery/kube/daemon.rs` |
| 56 | +shows identical predicate at line 246 in both 1.0.1 and 1.1.0. |
| 57 | + |
| 58 | +**Outcome**: Same symptom, same `0 instances`. Bump didn't help — the bug |
| 59 | +isn't in the namespace/CR matching logic. |
| 60 | + |
| 61 | +### H4 — `ready=False` on worker EndpointSlices blocks discovery (rev5 → rev6) |
| 62 | + |
| 63 | +**Hypothesis**: The `daemon.rs:246` predicate filters on |
| 64 | +`endpoint.conditions.ready == true`. Without a `readinessProbe`, kubelet keeps |
| 65 | +Pod Ready=False during model load, which keeps the EndpointSlice not-ready, |
| 66 | +which filters out the worker silently. |
| 67 | + |
| 68 | +**Evidence**: |
| 69 | +``` |
| 70 | +$ kubectl get endpointslice -o jsonpath='{.items[*].endpoints[*].conditions.ready}' |
| 71 | +false false false ← all workers |
| 72 | +``` |
| 73 | +Source: `lib/runtime/src/discovery/kube/utils.rs:22-51` and |
| 74 | +`daemon.rs:246`. |
| 75 | + |
| 76 | +**Fix**: Add explicit `readinessProbe` (`httpGet /health`, |
| 77 | +`failureThreshold: 60`, `initialDelaySeconds: 120`) to each worker's |
| 78 | +`mainContainer`. Match the operator's injected port (`system`, 9090) and |
| 79 | +handler type to avoid "may not specify more than 1 handler type" k8s rejection. |
| 80 | + |
| 81 | +**Outcome**: **T12a/b/c PASS.** Frontend reports `KubeDiscoveryClient::list |
| 82 | +returning 6 instances`. `/v1/models` returns the registered model. |
| 83 | + |
| 84 | +**Artifact**: skill `dynamo-kube-discovery-readiness`. Upstream docs PR |
| 85 | +`ai-dynamo/dynamo#9201` to fix the misleading `component_worker.go` comment. |
| 86 | + |
| 87 | +### H5 — NIXL falls back to UCX despite `NIXL_BACKEND=LIBFABRIC` (rev6 → rev7) |
| 88 | + |
| 89 | +**Hypothesis (initial, WRONG)**: Setting `NIXL_BACKEND=LIBFABRIC` and |
| 90 | +`VLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` env vars should select the libfabric |
| 91 | +backend plugin. |
| 92 | + |
| 93 | +**What we observed**: Decode log prints `Backend UCX was instantiated` and |
| 94 | +subsequent `handshake_failed` on `add_remote_agent()` with EFA cross-node |
| 95 | +transfers. vLLM warning: `Unknown vLLM environment variable detected: |
| 96 | +VLLM_NIXL_KVCACHE_BACKEND`. |
| 97 | + |
| 98 | +**Actual root cause (validated via source read)**: |
| 99 | +`vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1022-1024`: |
| 100 | +```python |
| 101 | +self.nixl_backends = vllm_config.kv_transfer_config.get_from_extra_config( |
| 102 | + "backends", ["UCX"] |
| 103 | +) |
| 104 | +``` |
| 105 | +There is no env var read path. `NIXL_BACKEND`, `VLLM_NIXL_KVCACHE_BACKEND`, |
| 106 | +and every other env we had set since rev2 were silent no-ops. The only way |
| 107 | +to select LIBFABRIC is via the `kv_connector_extra_config.backends` field |
| 108 | +inside the `--kv-transfer-config` JSON. |
| 109 | + |
| 110 | +**Fix**: |
| 111 | +```yaml |
| 112 | +args: |
| 113 | + - --kv-transfer-config |
| 114 | + - '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}' |
| 115 | +``` |
| 116 | +
|
| 117 | +**Outcome**: Decode log: `Backend LIBFABRIC was instantiated`. No more |
| 118 | +`handshake_failed`. NIXL exchanges libfabric-native EP names. |
| 119 | + |
| 120 | +**Upstream issues filed**: |
| 121 | +- `vllm-project/vllm#41814` — "NixlConnector hardcodes backends=\"UCX\" default; no env-var override path" |
| 122 | +- Cross-references `aws-samples/awsome-inference#72` and the earlier |
| 123 | + `ai-dynamo/dynamo#9200`. |
| 124 | + |
| 125 | +### H6 — `fold completions stream: invalid type: unit variant, expected newtype variant` (rev7 intermediate, WRONG interpretation) |
| 126 | + |
| 127 | +**Hypothesis (initial)**: Operator 1.0.1 vs runtime 1.1.0 version skew, or |
| 128 | +prefill is returning a full-response shape when it should return KV handles |
| 129 | +only. |
| 130 | + |
| 131 | +**What we observed**: |
| 132 | +``` |
| 133 | +Failed deserializing JSON to response |
| 134 | + err: invalid type: unit variant, expected newtype variant at line 1 column 55 |
| 135 | + json_str: {"data":{"data":{"token_ids":[],"finish_reason":"error",...}}} |
| 136 | +``` |
| 137 | +The Frontend routed the request to `component=prefill` (not `backend`/decode), |
| 138 | +and the worker returned a double-wrapped JSON with `finish_reason:"error"` |
| 139 | +and zero tokens. The Rust deserializer on the Frontend choked at column 55 |
| 140 | +on the `"error"` string. |
| 141 | +
|
| 142 | +**Actual root cause**: Client-side. `curl` without an explicit `stream` flag |
| 143 | +defaults to a Dynamo code path that folds the streaming frames into a single |
| 144 | +completion. When the upstream shape doesn't match the `ChoiceFinishReason` |
| 145 | +Rust enum (which expects `{"error":<msg>}` newtype, not bare string |
| 146 | +`"error"`), the fold fails. Explicit `stream:true` or `stream:false` takes |
| 147 | +a different code path that parses correctly. |
| 148 | +
|
| 149 | +**Proof**: |
| 150 | +- `curl -d '...,"stream":false'` → HTTP 200 with real tokens. |
| 151 | +- `curl -d '...,"stream":true'` → HTTP 200 SSE tokens + `[DONE]`. |
| 152 | +- `/v1/chat/completions` → HTTP 200 with real assistant message. |
| 153 | +- All three hit `component=backend` (decode worker), not `component=prefill` |
| 154 | + as the failing case did. So the request *was* being routed correctly once |
| 155 | + the client sent a well-formed body. |
| 156 | +
|
| 157 | +**Side note**: rev7 prefill *did* also receive and complete its request |
| 158 | +(request_id `4e808b0b-...`) in 357 ms — the KV handles were produced |
| 159 | +correctly. The fold error was purely in how the Frontend consumed what |
| 160 | +*both* workers returned. The `{"data":{"data":...}}` envelope is the normal |
| 161 | +Dynamo push-handler wrapper; it was only misinterpreted because the |
| 162 | +implicit-stream fold path expects it with specific enum shapes. |
| 163 | +
|
| 164 | +**No code fix required for our rev7 PASS.** Client should send explicit |
| 165 | +`stream:false` or `stream:true`. |
| 166 | +
|
| 167 | +**Caveat (from second independent researcher pass)**: the underlying bug in |
| 168 | +Dynamo's worker Python code is real and still present. On the sad path where |
| 169 | +vLLM's `RequestOutput.outputs` is empty, `components/src/dynamo/vllm/handlers.py:2014` |
| 170 | +emits `"finish_reason":"error"` (bare string). The Rust `FinishReason` enum at |
| 171 | +`lib/llm/src/protocols/common.rs:44-56` is: |
| 172 | +
|
| 173 | +```rust |
| 174 | +#[serde(rename = "error")] |
| 175 | +Error(String), // newtype — expects {"error":"<msg>"} or "error: <msg>" |
| 176 | +``` |
| 177 | + |
| 178 | +So `"error"` bare triggers the `invalid type: unit variant, expected newtype |
| 179 | +variant` deserialize error. A sister emission at `handlers.py:1705-1710` uses |
| 180 | +the correct form: `"finish_reason": "error: No outputs from vLLM engine"`. |
| 181 | + |
| 182 | +Our rev7 request stream never hits this path because it succeeds. But if a |
| 183 | +future dispatch returns empty outputs (e.g., NIXL hiccup, engine abort), the |
| 184 | +Frontend will again HTTP 500 instead of surfacing a real error. Track as |
| 185 | +Dynamo follow-up: |
| 186 | + |
| 187 | +- Patch `handlers.py:2014` to emit `"error: <reason>"` instead of bare |
| 188 | + `"error"`. |
| 189 | +- Or harden the Rust enum to accept `BareError` via `#[serde(untagged)]`. |
| 190 | + |
| 191 | +Filed as follow-up. Not required for T12 closure. |
| 192 | + |
| 193 | +## Knobs and settings that turned out to be no-ops |
| 194 | + |
| 195 | +These were set at various points between rev2 and rev7. Removing any of them |
| 196 | +would not change the rev7 PASS outcome. Kept in the YAML for defensiveness |
| 197 | +or because other subsystems read them: |
| 198 | + |
| 199 | +| Env / arg | Effect | Keep? | |
| 200 | +| --- | --- | --- | |
| 201 | +| `NIXL_BACKEND=LIBFABRIC` | No read path in vLLM or NIXL Python API | Remove (was the whole trap) | |
| 202 | +| `VLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` | Not a recognized vLLM env (warning emitted) | Remove | |
| 203 | +| `NIXL_SKIP_TOPOLOGY_CHECK=1` | Reads in NIXL topology module — kept as defensive | Keep | |
| 204 | +| `NIXL_LIBFABRIC_MAX_RAILS=1` | Read by libfabric plugin at runtime — active | Keep | |
| 205 | +| `FI_PROVIDER=efa`, `FI_EFA_USE_DEVICE_RDMA=1` | Read by libfabric at provider init — active | Keep | |
| 206 | +| `FI_EFA_ENABLE_SHM=0`, `FI_EFA_ENABLE_SHM_TRANSFER=0` | Read by libfabric provider — active | Keep | |
| 207 | +| `DYN_NAMESPACE=...` | Dynamo 1.1.0 honors this for namespace override — active | Keep (already load-bearing) | |
| 208 | +| `DYN_NAMESPACE_WORKER_SUFFIX=""` | Dynamo 1.1.0 honors — strips the hash suffix | Keep | |
| 209 | +| `DYN_SYSTEM_PORT=9090` | Operator default; probe port alignment | Keep | |
| 210 | +| `NIXL_PLUGIN_DIR=/opt/dynamo/.../plugins` | NIXL reads this to locate `libplugin_LIBFABRIC.so` | Keep (load-bearing) | |
| 211 | + |
| 212 | +## What the rev6 readinessProbe + rev7 extra_config together unlock |
| 213 | + |
| 214 | +The full dependency chain that had to land: |
| 215 | + |
| 216 | +1. **Worker pod has `readinessProbe`** — otherwise EndpointSlice `ready=false` |
| 217 | +2. **EndpointSlice `ready=true`** — otherwise `daemon.rs:246` filters |
| 218 | +3. **KubeDiscoveryClient returns > 0 instances** — otherwise `/v1/models` is empty |
| 219 | +4. **`/v1/models` registers the model** — otherwise Frontend routes 404 |
| 220 | +5. **NIXL instantiates LIBFABRIC (not UCX)** — otherwise cross-node handshake fails |
| 221 | +6. **NIXL handshake succeeds** — otherwise KV transfer never starts |
| 222 | +7. **Client sends explicit `stream:true/false`** — otherwise the implicit-fold path blows up on enum deserialization |
| 223 | + |
| 224 | +All seven gates now pass. T12 is closed. |
| 225 | + |
| 226 | +## Skills authored from this work |
| 227 | + |
| 228 | +- `dgd-operator-namespace-suffix` — Layer 1-2 (namespace merge + DYN_NAMESPACE override) |
| 229 | +- `dynamo-kube-discovery-readiness` — Layer 3 (the readinessProbe fix that was the real gate) |
| 230 | +- `codebuild-dockerhub-429-recovery` — CodeBuild retry + public-ECR mirror pattern |
| 231 | +- New skill forthcoming: `nixl-libfabric-backend-selection` — document the `extra_config.backends` route and the no-op envs |
| 232 | + |
| 233 | +## Upstream issues / PRs opened |
| 234 | + |
| 235 | +- `ai-dynamo/dynamo#9200` — KubeDiscoveryClient diagnostic gap |
| 236 | +- `ai-dynamo/dynamo#9201` — docs PR: clarify `component_worker.go` ReadinessProbe comment |
| 237 | +- `vllm-project/vllm#41814` — NixlConnector hardcoded UCX default + suggested env var + docs |
| 238 | + |
| 239 | +## What to do if T12 regresses |
| 240 | + |
| 241 | +1. Check `kubectl get endpointslice -o jsonpath='{.items[*].endpoints[*].conditions.ready}'` — all should be `true`. If any are `false`, probe misconfigured or model-load > `initialDelaySeconds + periodSeconds * failureThreshold`. |
| 242 | +2. Check decode log for `Backend LIBFABRIC was instantiated`. If it says UCX, the `kv_connector_extra_config` JSON is missing or malformed. |
| 243 | +3. Check client request has explicit `stream:true` or `stream:false`. If `curl` omits it, expect the `unit variant` deserialize error. |
| 244 | +4. For deeper failures, see the rev7 evidence bundle for a known-good snapshot. |
0 commit comments