Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
25b858c
Add Dynamo combined image (vLLM + TRT-LLM) with EFA/NIXL RDMA
dmvevents Mar 17, 2026
f9ce6af
Make Dynamo multi-GPU universal: verified instances table, instance-s…
dmvevents Mar 17, 2026
ae86406
Update Dockerfile to install curl before trying to use it
iankouls-aws Apr 9, 2026
aa84e5d
Modified Dockerfile.dynamo-combined-efa to fix failing build
iankouls-aws Apr 9, 2026
ee7cd4a
Install Intel MKL libraries in Dockerfile
iankouls-aws Apr 14, 2026
ca76f4c
Add symlinks for CUDA libraries in Dockerfile
iankouls-aws Apr 14, 2026
2c2acf2
Add symlink for cupti library in Dockerfile
iankouls-aws Apr 15, 2026
2e23142
Add symlink for libcusparseLt in Dockerfile
iankouls-aws Apr 15, 2026
9e3e6cc
Add symlink for nvshmem library in Dockerfile
iankouls-aws Apr 15, 2026
779a65a
Refactor CUDA library exposure for TRT-LLM
iankouls-aws Apr 15, 2026
fa55fed
Refactor symlink creation for NVIDIA libraries
iankouls-aws Apr 15, 2026
07edc6d
Move symlink setup for NVIDIA libraries as last RUN command
iankouls-aws Apr 15, 2026
54e36f7
Include CUDA math libraries in Dockerfile
iankouls-aws Apr 15, 2026
217352d
Refactor Dockerfile for CUDA and library compatibility
iankouls-aws Apr 16, 2026
bdba9bb
Add THIRD-PARTY-LICENSES
iankouls-aws Apr 17, 2026
346de59
Dockerfile.dynamo-combined-efa: clean overlay rewrite
antonaws Apr 18, 2026
660ee6b
Dockerfile.dynamo-combined-efa: self-contained build
antonaws Apr 18, 2026
816a624
k8s/dynamo-combined: switch to DynamoGraphDeployment CRD
antonaws Apr 18, 2026
0035dbb
dynamo-combined-efa: use sibling Dockerfile.efa as base
antonaws Apr 18, 2026
6e5c8b9
README: fix Dynamo install — anonymous helm fetch, correct chart vers…
antonaws Apr 20, 2026
0f461ff
dynamo-inference: sync from antonai-work/dynamo-workshop (verified SB…
dmvevents Apr 25, 2026
e2aac0d
sbom: derive per-backend SBOMs from combined image (Alex insight 2026…
dmvevents Apr 25, 2026
8ed1749
dynamo-inference: build fixes + SBOM + RDMA E2E evidence
dmvevents Apr 25, 2026
ad86680
dynamo-inference: add E2E RDMA validation evidence for awsi-efa-base:v1
dmvevents Apr 25, 2026
51aa22c
dynamo-inference: combined image v2→v8 iteration — TRT-LLM backend de…
dmvevents Apr 25, 2026
42fb905
sbom: awsi-dynamo-combined-efa:v8 (SPDX + CycloneDX + condensed licen…
dmvevents Apr 25, 2026
20c9282
dynamo-inference: CRITICAL — add aws-ofi-nccl + NCCL to combined image
dmvevents Apr 25, 2026
d4ab1e2
dockerfiles: drop public.ecr.aws/v9l4g5s4 default from NETWORKING_BASE
dmvevents Apr 28, 2026
8e54327
sbom: external trivy CVE scans (v0.69.3 fresh DB, ground truth)
dmvevents Apr 28, 2026
bd541b4
ci: add CodeBuild specs + fix build.sh for option-A (no ARG default)
dmvevents Apr 28, 2026
49e138c
dockerfiles: public-FROM-only — inline EFA + networking stack, drop A…
dmvevents Apr 28, 2026
6ccd514
ci: commercial-licenses.md — move out of gitignored docs/ + update to…
dmvevents Apr 29, 2026
0d73383
ci: fix buildspec CodeBuild parse error — delegate post_build to shel…
dmvevents Apr 29, 2026
4a57b38
ci: buildspec post_build — re-export env + use CODEBUILD_SRC_DIR
dmvevents Apr 29, 2026
81f61cc
build: single-image A100→B300 support + --base-image arg
dmvevents May 4, 2026
d35812d
tests: on-cluster smoke harness for dynamo-efa (10 gates)
dmvevents May 5, 2026
99afaa7
smoke: post-rename verification PASS for dynamo-efa:d35812db45d6
dmvevents May 5, 2026
5d31837
multinode: T11 NCCL AllReduce PASS + T12 Dynamo disagg PARTIAL
dmvevents May 5, 2026
ce21673
disagg: fix NIXL plugin discovery + Dynamo 0.16 kv-transfer-config mi…
dmvevents May 5, 2026
a1725d4
dockerfile: add nccl-tests build step in combined networking-builder
dmvevents May 5, 2026
ef04cd9
evidence: dynamo-efa:a1725d43e5c0 post-fix validation (rev2)
dmvevents May 5, 2026
b1f64c6
disagg: merge 3 DGDs → 1 DGD (upstream canonical) + rev3 evidence
dmvevents May 5, 2026
9592bdf
evidence: rev4 — DYN_NAMESPACE_WORKER_SUFFIX override attempt (T12 st…
dmvevents May 6, 2026
b2f1c78
dynamo: bump 1.0.1 → 1.1.0 (NGC runtime)
dmvevents May 6, 2026
9467d14
docs: OKRs 2026-05-06 — 3 objectives, 8 KRs
dmvevents May 6, 2026
7490773
rev5 — scaffold evidence dir for Dynamo 1.1.0 validation
dmvevents May 6, 2026
a47f11e
KR 2.3 — TRT-LLM-on-1.2.0 decision doc (DRAFT)
dmvevents May 6, 2026
da86404
rev5 — T12d watcher script
dmvevents May 6, 2026
f6a8394
OKRs — progress update (dyno session contributions)
dmvevents May 6, 2026
e932250
rev5 — fill-rev5-pass.sh auto-populator
dmvevents May 6, 2026
009954d
rev5 — fill-rev5-nogo.sh companion populator
dmvevents May 6, 2026
37eda25
rev5 — OPERATOR-FLOW.md end-to-end runbook
dmvevents May 6, 2026
3a421f2
rev5 — 1.1.0 validation: T1-T11+T12a/b/c PASS, T12d NO-GO (upstream #…
dmvevents May 6, 2026
be53d65
docs: REPRODUCIBILITY.md — external readers build from commit SHA
dmvevents May 6, 2026
23b0063
OKRs — post-T12d update: O1 closed, O3 dual-blocked
dmvevents May 6, 2026
53ad099
disagg: add readinessProbe to worker pods — fixes T12 KubeDiscoveryCl…
dmvevents May 6, 2026
81327dd
rev6 — T12 readinessProbe fix: KubeDiscoveryClient PASS, /v1/models r…
dmvevents May 6, 2026
4c43f0f
rev7 — T12 FULL PASS: /v1/completions returns real tokens over EFA + …
dmvevents May 6, 2026
19a2aed
build: restore -a flag for arch-targeted image, fat list remains default
dmvevents May 8, 2026
9190b8f
fix(efa): install and configure sshd for NCCL multinode tests
May 12, 2026
c76284b
test(efa): MPIJob for NCCL all_reduce on 2x g5.8xlarge (verified)
May 12, 2026
b86e5e0
test(efa): bump NCCL all_reduce iterations on g5 manifest (20-40s run…
May 12, 2026
474c7a2
test(efa): g5 NCCL manifest — hostNetwork + /sys infiniband + RDMA=0
May 12, 2026
d22f420
fix(efa): pin aws-ofi-nccl to v1.19.1 via upstream source build
May 13, 2026
47c2f2c
rev8: restore kv_connector_extra_config:LIBFABRIC + nodeSelector p5.4…
May 21, 2026
83d1384
docs(evidence): rev8 PR #72 E2E live trace + workshop UCX/NIXL findings
May 21, 2026
f028fdb
docs(evidence): rev8 backend coverage matrix — UCX vs LIBFABRIC acros…
May 21, 2026
66c9fe8
docs(evidence): nixlbench buildability finding — headers missing in i…
May 21, 2026
6b93ff1
rev8: fix nixlbench silent build failure + add CLAUDE.md for the subp…
May 21, 2026
1f4d500
rev8: fix nixlbench apt deps to match proven networking-base recipe
May 21, 2026
bb83ad3
rev8: include /opt/nixlbench in trtllm-stage + vllm-stage of combined…
May 21, 2026
33b2c52
rev8: add nixlbench build stage to Dockerfile.dynamo-combined-efa
May 21, 2026
e2a09b8
rev8: fix nixlbench ETCD runtime + runtime libs (round 4 verdict)
May 21, 2026
4dd67bd
rev8: write etcd-cpp-api.pc manually after etcd-cpp-apiv3 install
May 22, 2026
520cfc5
rev8: copy etcd-cpp-api libs forward + add cpprest/protobuf/grpc runt…
May 22, 2026
0e58e90
2.projects/dynamo-inference: add evidence/ — sanitized validation arc…
May 22, 2026
966619e
evidence/BUILD.md: pin both image digests in 2-image hierarchy
May 22, 2026
29cf1d4
evidence: fix recursive ECR_REGISTRY:-${ECR_REGISTRY} default
May 23, 2026
ce80c01
evidence: Cpass reproducibility verification — 3/3 PASS
May 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Per-run smoke evidence is captured in docs/evidence/<date>-<tag>/ when meaningful.
2.projects/dynamo-inference/tests/out/
30 changes: 29 additions & 1 deletion 2.projects/dynamo-inference/ATTRIBUTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,30 @@ This project incorporates components from various open-source projects. We ackno
- **License:** NVIDIA CUDA Toolkit EULA
- **Description:** GPU computing platform and programming model.

### GDRCopy
- **Source:** https://github.com/NVIDIA/gdrcopy
- **License:** MIT License
- **Description:** Low-latency GPU memory copy library leveraging NVIDIA GPUDirect RDMA.

---

## Inference Backends (Combined Image)

### FlashInfer
- **Source:** https://github.com/flashinfer-ai/flashinfer
- **License:** Apache License 2.0
- **Description:** Optimized FlashAttention kernels for LLM inference.

### LMCache
- **Source:** https://github.com/LMCache/LMCache
- **License:** Apache License 2.0
- **Description:** KV-cache reuse library for LLM inference acceleration.

### FFmpeg
- **Source:** https://ffmpeg.org/
- **License:** LGPL 2.1+ (Apache-only codecs used in build)
- **Description:** Multimedia framework for audio/video processing in multimodal models.

---

## Communication Libraries
Expand Down Expand Up @@ -120,6 +144,10 @@ This project incorporates components from various open-source projects. We ackno
| vLLM | Apache-2.0 |
| PyTorch | BSD-3-Clause |
| Transformers | Apache-2.0 |
| GDRCopy | MIT |
| FlashInfer | Apache-2.0 |
| LMCache | Apache-2.0 |
| FFmpeg | LGPL-2.1+ |

---

Expand All @@ -139,4 +167,4 @@ For questions about licensing or attribution, please open an issue in the reposi

---

**Last Updated:** November 2025
**Last Updated:** March 2026
91 changes: 91 additions & 0 deletions 2.projects/dynamo-inference/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Dynamo + NIXL + EFA — what works, what's broken, what to fix (2026-05-21 rev8)

## TL;DR

This subproject builds three artifacts:

| Artifact | What | Source of truth |
|---|---|---|
| **`efa:gpu`** (a.k.a. `public.ecr.aws/hpc-cloud/efa:gpu`) | EFA + UCX + NIXL + NCCL networking base | `Dockerfile.efa` |
| **`dynamo-efa:<sha>`** | Dynamo 1.1.0 + vLLM 0.19 + TRT-LLM, layered over `efa:gpu` | `Dockerfile.dynamo-combined-efa` |
| **DGD manifest** | DynamoGraphDeployment K8s YAML (Frontend + Prefill + Decode) | `k8s/dgd-dynamo-combined-vllm.yaml` |

**Critical:** the manifest's `--kv-transfer-config` MUST include `kv_connector_extra_config:{backends:["LIBFABRIC"]}`. Without it, NIXL falls back to UCX which cannot complete the handshake on EFA's RDM endpoint type. This was discovered in rev7 (commit `4c43f0f`) but the fix was applied via `kubectl edit` and not committed back to the manifest until rev8 (commit `47c2f2c`).

## What works (verified live on H100 P5.48xlarge cross-node 2026-05-21)

| Layer | Test | Result |
|---|---|---|
| L1 | `fi_pingpong -p efa` cross-node | ✅ 213 MB/s |
| L1 | `mpijob-nccl-allreduce-g5.yaml` | ✅ (NCCL collective on g5) |
| L2 | `nixl_example LIBFABRIC` (NIXL official self-test) | ✅ |
| L2 | NIXL Python API cross-node (via Dynamo prefill→decode) | ✅ |
| L3 | `/v1/completions` (Llama-3.1-8B disagg) | ✅ HTTP 200 in 1.85s |
| L3 wire | `rdma_read_bytes` delta during one request | ✅ exactly 2 MiB across 4 NICs |

## Known failures (and the fix in each case)

### 1. Default `--kv-transfer-config` selects UCX → handshake fails on EFA

**Symptom:** Decode worker raises `nixl_cu12._bindings.nixlBackendError: NIXL_ERR_BACKEND` from `loadRemoteMD()`. Frontend returns HTTP 500 with `Failed to fold completions stream … invalid type: unit variant` (this is the secondary symptom — a known Dynamo 1.1.0 handlers.py bug that emits bare `finish_reason: "error"` and breaks the Rust enum deserializer).

**Fix (already applied, rev8 commit `47c2f2c`):** `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'`

**Why no env-var fix:** vLLM `nixl_connector.py:1022-1024` hardcodes the default to `["UCX"]` and ignores `NIXL_BACKEND` / `VLLM_NIXL_KVCACHE_BACKEND` (filed as `vllm-project/vllm#41814`). Only the JSON `kv_connector_extra_config.backends` path is read.

### 2. PrefillWorker can land on a P4d (no native RDMA WRITE)

**Symptom:** Anti-affinity puts Prefill on a P4d.24xlarge node; `loadRemoteMD()` succeeds but cross-node KV transfer never produces RDMA bytes (P4d EFA has `max_qp_rd_atom=0`, cuco I250).

**Fix (already applied, rev8 commit `47c2f2c`):** `nodeSelector: node.kubernetes.io/instance-type: ml.p5.48xlarge` on both PrefillWorker and DecodeWorker. P5en works too — relax to a label-based selector if you want H200 support.

### 3. Workshop UCX page (`ucx_perftest`) fails on hostNetwork pods

**Symptom:** UCX picks the EC2 metadata link-local IP (`169.254.0.1`) for OOB rendezvous, gets `connect: Connection refused`, declares peer "unreachable".

**Fix (workshop docs need updating):** Set `UCX_NET_DEVICES=` to scope to actual EFA interfaces, or `UCX_TLS=rc,sm,self` to drop TCP entirely. Note: UCX is NOT the production transport on EFA — it's an educational comparison only. The page should make this explicit.

### 4. Workshop NIXL page defaults to UCX

**Symptom:** `/opt/nvidia/nvda_nixl/bin/nixl_example` (no args) selects UCX by default and fails the same way Dynamo did before rev7.

**Fix (workshop docs):** Always pass `LIBFABRIC` as the backend arg: `nixl_example LIBFABRIC`. PASS verified live.

### 5. `nixlbench` binary is silently missing from `efa:gpu`

**Symptom:** `Dockerfile.efa` has a `--- nixlbench ---` build stage at line ~191, but the resulting `/opt/nixlbench/bin/nixlbench` is NOT in the running image. The meson build fails silently because `pkg-config`, `libgflags-dev`, and `libetcd-cpp-api-dev` are missing at configure time.

**Fix (already applied, rev8 commit pending):** Added `apt-get install pkg-config libgflags-dev libetcd-cpp-api-dev` before the meson setup, plus a `test -x /opt/nixlbench/bin/nixlbench` post-build assertion so the build fails loudly instead of silently shipping a broken stage.

### 6. `hf-token` Secret keyname doesn't match Dynamo Frontend's HF SDK expectation

**Symptom:** Frontend hits HTTP 401 fetching `USE_POLICY.md` from HuggingFace because the Secret only exposes `token` (lowercase) via `envFromSecret`, not the `HF_TOKEN`/`HUGGING_FACE_HUB_TOKEN` keys the SDK reads.

**Fix (cluster-side patch, no manifest mutation):**
```bash
TOK=$(kubectl get secret hf-token -n default -o jsonpath='{.data.token}')
kubectl patch secret hf-token -n default --type=json -p="[
{\"op\":\"add\",\"path\":\"/data/HF_TOKEN\",\"value\":\"$TOK\"},
{\"op\":\"add\",\"path\":\"/data/HUGGING_FACE_HUB_TOKEN\",\"value\":\"$TOK\"}
]"
kubectl delete pod -l nvidia.com/dynamo-component=Frontend -n default # force restart
```

## How to verify a deployment

After `kubectl apply -f k8s/dgd-dynamo-combined-vllm.yaml`:

1. `kubectl get dgd dynamo-combined-vllm` → READY=True (3/3 services)
2. `kubectl get pods -l nvidia.com/dynamo-graph-deployment-name=dynamo-combined-vllm -o wide` → Prefill + Decode on **different P5 nodes**, Frontend anywhere
3. Decode pod log contains `Backend LIBFABRIC was instantiated` (NOT `Backend UCX`)
4. `curl -X POST http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Hello","max_tokens":10,"stream":false}'` → HTTP 200 with non-empty `choices[0].text`
5. Compare hw_counter `rdma_read_bytes` across `/sys/class/infiniband/rdmap*/ports/1/hw_counters/` before vs after the request → delta > 1 MiB on the prefill node

If any of those fail, the fix is in `docs/evidence/rev8-pr72-e2e-2026-05-21/SUMMARY.md` (full 2-round trace) or `docs/evidence/rev8-pr72-e2e-2026-05-21/TEAM-FINDINGS.md` (workshop UCX/NIXL fixes).

## Hard rules for editing this subproject

- **Never default to UCX backend on EFA.** Every NIXL consumer (vLLM, nixl_example, nixlbench, workshop tests) must explicitly select `LIBFABRIC`. UCX cannot do EFA's RDM endpoint.
- **Never apply a fix in-cluster only.** If you `kubectl edit` a manifest and it works, commit the same edit back to the source YAML in this directory. The rev7 silent-rollback (commit `4c43f0f` advertised the fix as code but only added evidence docs) cost the team a re-debug cycle.
- **Worker pods must have a `readinessProbe`** on `/health:9090`. Without it, EndpointSlice stays Ready=False during model load and KubeDiscoveryClient (Dynamo 1.1.0 daemon.rs:246) returns 0 instances. See rev6 commit `81327dd` for the canonical probe spec.
- **Cross-node KV transfer must be P5↔P5 (or P5en↔P5en).** P4d has `max_qp_rd_atom=0` and cannot do native RDMA WRITE/READ; `loadRemoteMD()` may even succeed but counters won't move.
Loading