docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe by galletas1712 · Pull Request #11000 · ai-dynamo/dynamo

galletas1712 · 2026-06-26T21:24:29Z

Overview

Adds external-facing documentation and a runnable example for GPU Memory
Service (GMS) shadow-engine failover with plain inference-engine processes — no
Kubernetes. It gives users (and teams integrating the GMS wheel standalone,
e.g. the TensorRT-LLM team) a concrete, operator-free shape for the feature,
distilled from the existing tests/gpu_memory_service/test_shadow_failover.py
flow. Docs + example only — no library/runtime/packaging code changes.

Details

New/changed files under lib/gpu_memory_service/:

docs/standalone-usage.md — a concise user/integration guide: what GMS
gives you, the single-node flow (with a mermaid sequence diagram), what an
engine must implement to be a GMS shadow (GMS load, GMS-aware sleep/wake =
unmap/remap, memory-accounting fixes, scratch KV, hold-until-promoted), and a
"Multiple nodes: what's extra" section (with a topology diagram) scoped to
only the GMS-specific deltas — whole-group shadows sharing resident weights,
one leader flock, and whole-group failover. WideEP is folded in as a large
instance of multi-node.
examples/shadow_failover/run.sh — one bare-bones script run on every
node (keyed on NODE_RANK): single-node by default, multi-node when
NNODES>1. Starts a per-node GMS server (plus etcd/nats/frontend on the
leader) and a primary + shadow engine in autonomous shadow mode sharing a
flock. It does not kill anything; you trigger failover by killing the
primary's recorded process group and the kernel promotes the shadow.
examples/shadow_failover/README.md — prerequisites, single- and
multi-node run commands, and the failover trigger.
README.md — a short "Documentation" pointer to the guide and recipe.

The framework-integration packaging change (moving the vLLM/SGLang integrations
behind extras so the published wheel stays lean) is intentionally not in this
PR and will follow separately.

Validation

bash -n passes on run.sh; no echo/curl; SPDX headers on all new files;
run.sh is executable.
All relative doc links resolve; both mermaid diagrams parse (the earlier parse
error from a ; statement-separator in a Note is fixed).
A read-only reviewer pass over the diff (EXIT-trap safety, readiness gating,
link/factual checks, vocabulary consistency) was addressed.
Not executed end-to-end: running it needs a CUDA GPU + etcd/NATS (and
multiple nodes for the multi-node path). The commands are reconstructed from
the verified test_shadow_failover.py harness and the operator's multi-node
launch flags; smoke-test on a GPU host before relying on it. shellcheck was
not available in the authoring environment (bash -n only).

Where should the reviewer start?

lib/gpu_memory_service/docs/standalone-usage.md — is the "GMS-specific vs.
ordinary multi-node" framing right, and are the scratch-KV and failover
explanations accurate?
lib/gpu_memory_service/examples/shadow_failover/run.sh — the multi-node
launch wiring (per-cohort master ports, headless non-leader ranks,
leader-only etcd/nats/frontend) and the cleanup trap.

Related Issues

🚫 This PR is NOT linked to an issue:

Confirmed — no related issue

Summary ------- Adds a self-contained, runnable recipe under lib/gpu_memory_service/examples/shadow_failover/ that demonstrates GPU Memory Service (GMS) shadow-engine failover for vLLM on a single node / single GPU WITHOUT the Dynamo Kubernetes operator. This gives users (and the TensorRT-LLM team evaluating the GMS wheel) a concrete, operator-free shape for the feature, distilled from the existing tests/gpu_memory_service/test_shadow_failover.py harness. The recipe uses the simple manual control-endpoint orchestration flavor (POST /engine/control/{sleep,wake_up} + process-group SIGKILL of the primary), not the autonomous flock path used by the operator, so the mechanics are visible and easy to drive by hand. A primary engine loads weights once and publishes them into a per-GPU GMS server; a pre-initialized shadow imports the resident weights (no second disk load) and takes over when the primary is killed, without reloading model weights. Files: - README.md user-flow walkthrough (steps + benefit), how-it-works, verification, cleanup, and scope/caveats (single-node only; TRT-LLM has GMS weight-sharing but not the failover lifecycle yet). - run_demo.sh one-shot orchestrator (10 steps) with a cleanup trap. - start_infra.sh start etcd + nats-server -js. - start_gms.sh start the production GMS server supervisor. - run_engine.sh shared primary/shadow launcher (RW writer vs RO importer). - kill_primary.sh process-group SIGKILL failure injection. - verify.sh assert takeover via a frontend completion. - common.sh shared env defaults + readiness helpers. Validation ---------- - bash -n passes on all 7 scripts. - shellcheck -x reports zero warnings. - SPDX headers present on every file; all .sh files are executable. - Not executed end to end (requires a CUDA GPU + etcd/NATS); commands are reconstructed from the verified test harness and should be smoke-tested on a single-GPU box before relying on them. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

Summary ------- Adds lib/gpu_memory_service/docs/standalone-usage.md: a concise, steps-oriented user-flow doc for using GPU Memory Service WITHOUT the Dynamo Kubernetes operator, written for the "what do I do / what do I get" question raised when scoping the GMS wheel for standalone (e.g. TensorRT-LLM) use. It covers: what GMS is and the benefit (skip weight reload; warm shadow takeover), the framework-agnostic user flow (start server, launch engine as a GMS client, add standby), single-node shadow failover (linking the runnable recipe), the autonomous-flock vs manual-control activation mechanisms, and an explicit "what is automatic vs. what needs a control plane" section for multi-node and WideEP (GMS gives fast per-rank weight re-materialization; the detect / serve-degraded / re-spawn / rejoin orchestration is the engine's/operator's responsibility, not GMS). It also lists the wheel import surface external engines depend on, and notes the framework integration subpackages are Dynamo-runtime glue, not part of the standalone surface. Also adds a small "Documentation" pointer in the GMS README linking the new guide and the runnable recipe so they are discoverable. Validation ---------- - SPDX header present; markdown only, no code changes. - All relative cross-links resolve (../README.md, ../examples/shadow_failover/README.md, ../../../docs/kubernetes/shadow-engine-failover.md). - Claims spot-checked against the GMS source (server entrypoint, socket naming, exposed import surface, finalize_gms_write return type). Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

Summary ------- Addresses review feedback that the recipe scripts were too large/convoluted and that the writing should be more general — focused on the key assumptions/flows a user must provide, plus what the engine wrapper and the operator do so a reader can replicate shadow failover with just the GMS wheel. Recipe (lib/gpu_memory_service/examples/shadow_failover/): - Collapsed 7 scripts into a single minimal run.sh (~70 lines, no banner spam, minimal prints, one cleanup trap). Deleted common.sh, start_infra.sh, start_gms.sh, run_engine.sh, kill_primary.sh, verify.sh, run_demo.sh. - Trimmed README.md to Overview / Prerequisites / Run it / What it shows, and a pointer to the conceptual doc for the orchestration explanation. Docs (lib/gpu_memory_service/docs/standalone-usage.md): - Added a general "Orchestration: who does what (and how to replicate it)" section split into three layers: (1) the GMS wheel primitives, (2) the engine wrapper in components/src/dynamo/vllm + integrations/vllm (GMS-aware sleep/wake = unmap/remap, memory accounting, scratch KV, ENGINE_ID RW/RO, flock-gated discovery registration), and (3) the Dynamo operator (DRA GPU sharing, shared lock/socket volume, failover env injection, RestartPolicy Never + cascade controller, multi-node per-rank GMS + pod-index rendezvous). Each layer states what you must reimplement to DIY; multi-node/WideEP folded in as "what's automatic vs. what needs a control plane". Validation ---------- - bash -n run.sh passes; shellcheck clean. Not executed (needs GPU + infra). - SPDX headers intact; relative cross-links resolve. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

Summary ------- Rewrites lib/gpu_memory_service/docs/standalone-usage.md from the perspective of someone running plain inference-engine processes on one or a couple of nodes (launched via Slurm/ssh/systemd/supervisor — not Kubernetes). Drops the DRA/Grove/pod/operator-internals framing and instead answers "what do I have to do to support shadow-engine failover?". Structure: - The pieces involved: GMS (weights + lock), a standby-capable engine, and your orchestration. - Single node: what you do (start GMS server; run a writer + RO shadows sharing one lock file; autonomous flock promotion vs manual), and what you must provide yourself (GPU headroom for co-residency, shared lock path, routing to the active engine, relaunching the dead one). - What the engine itself must support (GMS load, GMS-aware sleep/wake = unmap/remap, memory-accounting fixes, scratch KV, hold-until-active) with a backend table; framed as the work to do when integrating a non-vLLM engine. - A couple of nodes: run whole engine groups (active + standby) that pair up by slot via per-group rendezvous, promotion stays a node-local leader flock, and the four things your launcher must build (detect, cohort-atomic teardown, promote, relaunch). Notes the Kubernetes operator is just an implementation of those four steps. WideEP folded in as the same loop. - Retains the wheel import-surface section. Validation ---------- - Markdown only; SPDX header intact; all relative cross-links resolve (../README.md, ../examples/shadow_failover/README.md, ../../../docs/kubernetes/shadow-engine-failover.md). Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

Switch the single-node recipe to the simplest flow: two shadow-mode engines share a flock, kill the primary, the kernel hands the lock to the shadow. No echo signposting (comments only), no curl / control endpoints / frontend. Observe the takeover in the shadow log. README and the standalone doc updated to attribute the recipe to the autonomous-flock path (manual control kept as the alternative for an external controller). Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

The 'What the GMS wheel must expose' section is internal packaging/ops guidance, not user/integration content. Remove it from the external-facing standalone usage guide and fix the README pointer. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

- run.sh: replace the WARMUP fixed-sleep tunable with a real readiness gate — wait until the shadow logs that it is parked on the flock ("waiting for lock") before killing the primary. Remove the WARMUP override from the README. - docs: rename "A couple of nodes" -> "Multiple nodes" and fold the WideEP content into it (WideEP is just a large multi-node deployment), leaving two failover sections: single-node and multi-node. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

- run.sh: EXIT trap no longer defaults an unset PGID to 0 (which would `kill -KILL -0` the caller's own process group when the script fails before the engines launch); guard each group-kill on a set PGID. - run.sh: the "waiting for lock" readiness gate now aborts if the shadow process dies before parking, instead of spinning forever. - docs: use consistent primary/shadow framing in the single-node recap; drop the exact TRT-LLM source-identity filename (external, unverifiable here) in favor of a general phrasing. - recipe README: reflect that the gate waits on the shadow being parked, not a timer. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

- Cut ~45% (8.5 -> ~5 min): fold "pieces involved" into the intro, trim prose. - Restructure multi-node as "what's extra" on top of single-node (3 changes: standby = whole group, per-group rendezvous, one leader flock + whole-group failover). Drop the failure-detection / "four steps to build" / Slurm control-plane framing — the user triggers the failover (kills the primary). - Fix the scratch-KV explanation: placeholder un-backed KV lets a (re)initializing shadow capture CUDA graphs without consuming real GPU memory for KV; on promotion the real GMS-backed KV is swapped in at the same VAs so the graphs stay valid. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

- Drop the per-group rendezvous bullet: a shared master addr/port + --node-rank is ordinary multi-node deployment, not specific to GMS shadow failover. The multi-node section now lists only what's special to GMS: standby = whole group sharing resident weights (no extra weight memory), promotion via one leader flock, and whole-group failover. - Add a sequence diagram for the single-node failover flow and a topology diagram for the multi-node active/standby layout. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

Replace active/standby with primary/shadow throughout the standalone guide, including the prose, the engine-requirements list, the table header, and both mermaid diagram labels. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

…node - docs: a ';' inside a mermaid Note line broke the sequence diagram (mermaid treats ';' as a statement separator); rephrase it. Remove the em-dashes from the prose. - recipe: reframe single-GPU -> single-node (any GPU count; add a TP knob) and make run.sh multi-node-capable from the SAME script, keyed on NODE_RANK (leader runs etcd/nats/frontend; per-cohort master ports; non-leader ranks run headless). The script no longer kills anything: it records each engine's PGID and the user triggers failover by killing the primary, after which the flock promotes the shadow automatically. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

datadog-official · 2026-06-26T21:24:58Z

⚠️ Warnings

🚦 3 Pipeline jobs failed

Docs link check | lychee

PR | dynamo-runtime / rust-gpu

PR | dynamo-status-check

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 811972d | Docs | Give us feedback!}

github-actions · 2026-06-26T21:27:20Z

🌿 Fern Docs Preview: https://nvidia-preview-91787112-8cc6-4f5d-b141-07b4e409e8df.docs.buildwithfern.com/dynamo/dev

…per) The "What the engine must support" summary wrongly credited dynamo.vllm with implementing the whole list. Split it correctly: the vLLM patches for GMS integration (gpu_memory_service.integrations.vllm) provide the weight load, GMS-aware sleep/wake, memory-accounting fixes, and scratch KV; the Dynamo vLLM wrapper (components/src/dynamo/vllm) drives the hold-until-promoted activation (flock gate + deferred discovery registration) and wires them together. Adjust the vLLM row in the backend table to match. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 added 13 commits June 26, 2026 10:56

docs(gms): drop GPU-headroom/MPS note from single-node section

dcbcb99

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

pull-request-size Bot added the size/L label Jun 26, 2026

galletas1712 temporarily deployed to external_collaborator June 26, 2026 21:24 — with GitHub Actions Inactive

github-actions Bot added docs documentation Improvements or additions to documentation labels Jun 26, 2026

galletas1712 temporarily deployed to external_collaborator June 26, 2026 21:30 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe#11000

docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe#11000
galletas1712 wants to merge 14 commits into
mainfrom
schwinns/gms-vllm-shadow-failover-recipe

galletas1712 commented Jun 26, 2026

Uh oh!

datadog-official Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

galletas1712 commented Jun 26, 2026

Overview

Details

Validation

Where should the reviewer start?

Related Issues

Uh oh!

datadog-official Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-official Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading