docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe#11000
Draft
galletas1712 wants to merge 14 commits into
Draft
docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe#11000galletas1712 wants to merge 14 commits into
galletas1712 wants to merge 14 commits into
Conversation
Summary
-------
Adds a self-contained, runnable recipe under
lib/gpu_memory_service/examples/shadow_failover/ that demonstrates GPU Memory
Service (GMS) shadow-engine failover for vLLM on a single node / single GPU
WITHOUT the Dynamo Kubernetes operator. This gives users (and the TensorRT-LLM
team evaluating the GMS wheel) a concrete, operator-free shape for the feature,
distilled from the existing tests/gpu_memory_service/test_shadow_failover.py
harness.
The recipe uses the simple manual control-endpoint orchestration flavor
(POST /engine/control/{sleep,wake_up} + process-group SIGKILL of the primary),
not the autonomous flock path used by the operator, so the mechanics are
visible and easy to drive by hand. A primary engine loads weights once and
publishes them into a per-GPU GMS server; a pre-initialized shadow imports the
resident weights (no second disk load) and takes over when the primary is
killed, without reloading model weights.
Files:
- README.md user-flow walkthrough (steps + benefit), how-it-works,
verification, cleanup, and scope/caveats (single-node only;
TRT-LLM has GMS weight-sharing but not the failover
lifecycle yet).
- run_demo.sh one-shot orchestrator (10 steps) with a cleanup trap.
- start_infra.sh start etcd + nats-server -js.
- start_gms.sh start the production GMS server supervisor.
- run_engine.sh shared primary/shadow launcher (RW writer vs RO importer).
- kill_primary.sh process-group SIGKILL failure injection.
- verify.sh assert takeover via a frontend completion.
- common.sh shared env defaults + readiness helpers.
Validation
----------
- bash -n passes on all 7 scripts.
- shellcheck -x reports zero warnings.
- SPDX headers present on every file; all .sh files are executable.
- Not executed end to end (requires a CUDA GPU + etcd/NATS); commands are
reconstructed from the verified test harness and should be smoke-tested on a
single-GPU box before relying on them.
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Summary ------- Adds lib/gpu_memory_service/docs/standalone-usage.md: a concise, steps-oriented user-flow doc for using GPU Memory Service WITHOUT the Dynamo Kubernetes operator, written for the "what do I do / what do I get" question raised when scoping the GMS wheel for standalone (e.g. TensorRT-LLM) use. It covers: what GMS is and the benefit (skip weight reload; warm shadow takeover), the framework-agnostic user flow (start server, launch engine as a GMS client, add standby), single-node shadow failover (linking the runnable recipe), the autonomous-flock vs manual-control activation mechanisms, and an explicit "what is automatic vs. what needs a control plane" section for multi-node and WideEP (GMS gives fast per-rank weight re-materialization; the detect / serve-degraded / re-spawn / rejoin orchestration is the engine's/operator's responsibility, not GMS). It also lists the wheel import surface external engines depend on, and notes the framework integration subpackages are Dynamo-runtime glue, not part of the standalone surface. Also adds a small "Documentation" pointer in the GMS README linking the new guide and the runnable recipe so they are discoverable. Validation ---------- - SPDX header present; markdown only, no code changes. - All relative cross-links resolve (../README.md, ../examples/shadow_failover/README.md, ../../../docs/kubernetes/shadow-engine-failover.md). - Claims spot-checked against the GMS source (server entrypoint, socket naming, exposed import surface, finalize_gms_write return type). Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Summary ------- Addresses review feedback that the recipe scripts were too large/convoluted and that the writing should be more general — focused on the key assumptions/flows a user must provide, plus what the engine wrapper and the operator do so a reader can replicate shadow failover with just the GMS wheel. Recipe (lib/gpu_memory_service/examples/shadow_failover/): - Collapsed 7 scripts into a single minimal run.sh (~70 lines, no banner spam, minimal prints, one cleanup trap). Deleted common.sh, start_infra.sh, start_gms.sh, run_engine.sh, kill_primary.sh, verify.sh, run_demo.sh. - Trimmed README.md to Overview / Prerequisites / Run it / What it shows, and a pointer to the conceptual doc for the orchestration explanation. Docs (lib/gpu_memory_service/docs/standalone-usage.md): - Added a general "Orchestration: who does what (and how to replicate it)" section split into three layers: (1) the GMS wheel primitives, (2) the engine wrapper in components/src/dynamo/vllm + integrations/vllm (GMS-aware sleep/wake = unmap/remap, memory accounting, scratch KV, ENGINE_ID RW/RO, flock-gated discovery registration), and (3) the Dynamo operator (DRA GPU sharing, shared lock/socket volume, failover env injection, RestartPolicy Never + cascade controller, multi-node per-rank GMS + pod-index rendezvous). Each layer states what you must reimplement to DIY; multi-node/WideEP folded in as "what's automatic vs. what needs a control plane". Validation ---------- - bash -n run.sh passes; shellcheck clean. Not executed (needs GPU + infra). - SPDX headers intact; relative cross-links resolve. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Summary ------- Rewrites lib/gpu_memory_service/docs/standalone-usage.md from the perspective of someone running plain inference-engine processes on one or a couple of nodes (launched via Slurm/ssh/systemd/supervisor — not Kubernetes). Drops the DRA/Grove/pod/operator-internals framing and instead answers "what do I have to do to support shadow-engine failover?". Structure: - The pieces involved: GMS (weights + lock), a standby-capable engine, and your orchestration. - Single node: what you do (start GMS server; run a writer + RO shadows sharing one lock file; autonomous flock promotion vs manual), and what you must provide yourself (GPU headroom for co-residency, shared lock path, routing to the active engine, relaunching the dead one). - What the engine itself must support (GMS load, GMS-aware sleep/wake = unmap/remap, memory-accounting fixes, scratch KV, hold-until-active) with a backend table; framed as the work to do when integrating a non-vLLM engine. - A couple of nodes: run whole engine groups (active + standby) that pair up by slot via per-group rendezvous, promotion stays a node-local leader flock, and the four things your launcher must build (detect, cohort-atomic teardown, promote, relaunch). Notes the Kubernetes operator is just an implementation of those four steps. WideEP folded in as the same loop. - Retains the wheel import-surface section. Validation ---------- - Markdown only; SPDX header intact; all relative cross-links resolve (../README.md, ../examples/shadow_failover/README.md, ../../../docs/kubernetes/shadow-engine-failover.md). Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Switch the single-node recipe to the simplest flow: two shadow-mode engines share a flock, kill the primary, the kernel hands the lock to the shadow. No echo signposting (comments only), no curl / control endpoints / frontend. Observe the takeover in the shadow log. README and the standalone doc updated to attribute the recipe to the autonomous-flock path (manual control kept as the alternative for an external controller). Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
The 'What the GMS wheel must expose' section is internal packaging/ops guidance, not user/integration content. Remove it from the external-facing standalone usage guide and fix the README pointer. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- run.sh: replace the WARMUP fixed-sleep tunable with a real readiness gate —
wait until the shadow logs that it is parked on the flock ("waiting for lock")
before killing the primary. Remove the WARMUP override from the README.
- docs: rename "A couple of nodes" -> "Multiple nodes" and fold the WideEP
content into it (WideEP is just a large multi-node deployment), leaving two
failover sections: single-node and multi-node.
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- run.sh: EXIT trap no longer defaults an unset PGID to 0 (which would `kill -KILL -0` the caller's own process group when the script fails before the engines launch); guard each group-kill on a set PGID. - run.sh: the "waiting for lock" readiness gate now aborts if the shadow process dies before parking, instead of spinning forever. - docs: use consistent primary/shadow framing in the single-node recap; drop the exact TRT-LLM source-identity filename (external, unverifiable here) in favor of a general phrasing. - recipe README: reflect that the gate waits on the shadow being parked, not a timer. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- Cut ~45% (8.5 -> ~5 min): fold "pieces involved" into the intro, trim prose. - Restructure multi-node as "what's extra" on top of single-node (3 changes: standby = whole group, per-group rendezvous, one leader flock + whole-group failover). Drop the failure-detection / "four steps to build" / Slurm control-plane framing — the user triggers the failover (kills the primary). - Fix the scratch-KV explanation: placeholder un-backed KV lets a (re)initializing shadow capture CUDA graphs without consuming real GPU memory for KV; on promotion the real GMS-backed KV is swapped in at the same VAs so the graphs stay valid. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- Drop the per-group rendezvous bullet: a shared master addr/port + --node-rank is ordinary multi-node deployment, not specific to GMS shadow failover. The multi-node section now lists only what's special to GMS: standby = whole group sharing resident weights (no extra weight memory), promotion via one leader flock, and whole-group failover. - Add a sequence diagram for the single-node failover flow and a topology diagram for the multi-node active/standby layout. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Replace active/standby with primary/shadow throughout the standalone guide, including the prose, the engine-requirements list, the table header, and both mermaid diagram labels. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…node - docs: a ';' inside a mermaid Note line broke the sequence diagram (mermaid treats ';' as a statement separator); rephrase it. Remove the em-dashes from the prose. - recipe: reframe single-GPU -> single-node (any GPU count; add a TP knob) and make run.sh multi-node-capable from the SAME script, keyed on NODE_RANK (leader runs etcd/nats/frontend; per-cohort master ports; non-leader ranks run headless). The script no longer kills anything: it records each engine's PGID and the user triggers failover by killing the primary, after which the flock promotes the shadow automatically. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Contributor
…per) The "What the engine must support" summary wrongly credited dynamo.vllm with implementing the whole list. Split it correctly: the vLLM patches for GMS integration (gpu_memory_service.integrations.vllm) provide the weight load, GMS-aware sleep/wake, memory-accounting fixes, and scratch KV; the Dynamo vLLM wrapper (components/src/dynamo/vllm) drives the hold-until-promoted activation (flock gate + deferred discovery registration) and wires them together. Adjust the vLLM row in the backend table to match. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds external-facing documentation and a runnable example for GPU Memory
Service (GMS) shadow-engine failover with plain inference-engine processes — no
Kubernetes. It gives users (and teams integrating the GMS wheel standalone,
e.g. the TensorRT-LLM team) a concrete, operator-free shape for the feature,
distilled from the existing
tests/gpu_memory_service/test_shadow_failover.pyflow. Docs + example only — no library/runtime/packaging code changes.
Details
New/changed files under
lib/gpu_memory_service/:docs/standalone-usage.md— a concise user/integration guide: what GMSgives you, the single-node flow (with a mermaid sequence diagram), what an
engine must implement to be a GMS shadow (GMS load, GMS-aware sleep/wake =
unmap/remap, memory-accounting fixes, scratch KV, hold-until-promoted), and a
"Multiple nodes: what's extra" section (with a topology diagram) scoped to
only the GMS-specific deltas — whole-group shadows sharing resident weights,
one leader
flock, and whole-group failover. WideEP is folded in as a largeinstance of multi-node.
examples/shadow_failover/run.sh— one bare-bones script run on everynode (keyed on
NODE_RANK): single-node by default, multi-node whenNNODES>1. Starts a per-node GMS server (plus etcd/nats/frontend on theleader) and a primary + shadow engine in autonomous shadow mode sharing a
flock. It does not kill anything; you trigger failover by killing theprimary's recorded process group and the kernel promotes the shadow.
examples/shadow_failover/README.md— prerequisites, single- andmulti-node run commands, and the failover trigger.
README.md— a short "Documentation" pointer to the guide and recipe.The framework-integration packaging change (moving the vLLM/SGLang integrations
behind extras so the published wheel stays lean) is intentionally not in this
PR and will follow separately.
Validation
bash -npasses onrun.sh; noecho/curl; SPDX headers on all new files;run.shis executable.error from a
;statement-separator in a Note is fixed).link/factual checks, vocabulary consistency) was addressed.
multiple nodes for the multi-node path). The commands are reconstructed from
the verified
test_shadow_failover.pyharness and the operator's multi-nodelaunch flags; smoke-test on a GPU host before relying on it.
shellcheckwasnot available in the authoring environment (
bash -nonly).Where should the reviewer start?
lib/gpu_memory_service/docs/standalone-usage.md— is the "GMS-specific vs.ordinary multi-node" framing right, and are the scratch-KV and failover
explanations accurate?
lib/gpu_memory_service/examples/shadow_failover/run.sh— the multi-nodelaunch wiring (per-cohort master ports, headless non-leader ranks,
leader-only etcd/nats/frontend) and the cleanup trap.
Related Issues
🚫 This PR is NOT linked to an issue: