Skip to content

docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe#11000

Draft
galletas1712 wants to merge 14 commits into
mainfrom
schwinns/gms-vllm-shadow-failover-recipe
Draft

docs(gms): standalone (non-Kubernetes) shadow-engine failover guide and recipe#11000
galletas1712 wants to merge 14 commits into
mainfrom
schwinns/gms-vllm-shadow-failover-recipe

Conversation

@galletas1712

Copy link
Copy Markdown
Contributor

Overview

Adds external-facing documentation and a runnable example for GPU Memory
Service (GMS) shadow-engine failover with plain inference-engine processes — no
Kubernetes
. It gives users (and teams integrating the GMS wheel standalone,
e.g. the TensorRT-LLM team) a concrete, operator-free shape for the feature,
distilled from the existing tests/gpu_memory_service/test_shadow_failover.py
flow. Docs + example only — no library/runtime/packaging code changes.

Details

New/changed files under lib/gpu_memory_service/:

  • docs/standalone-usage.md — a concise user/integration guide: what GMS
    gives you, the single-node flow (with a mermaid sequence diagram), what an
    engine must implement to be a GMS shadow (GMS load, GMS-aware sleep/wake =
    unmap/remap, memory-accounting fixes, scratch KV, hold-until-promoted), and a
    "Multiple nodes: what's extra" section (with a topology diagram) scoped to
    only the GMS-specific deltas — whole-group shadows sharing resident weights,
    one leader flock, and whole-group failover. WideEP is folded in as a large
    instance of multi-node.
  • examples/shadow_failover/run.sh — one bare-bones script run on every
    node (keyed on NODE_RANK): single-node by default, multi-node when
    NNODES>1. Starts a per-node GMS server (plus etcd/nats/frontend on the
    leader) and a primary + shadow engine in autonomous shadow mode sharing a
    flock. It does not kill anything; you trigger failover by killing the
    primary's recorded process group and the kernel promotes the shadow.
  • examples/shadow_failover/README.md — prerequisites, single- and
    multi-node run commands, and the failover trigger.
  • README.md — a short "Documentation" pointer to the guide and recipe.

The framework-integration packaging change (moving the vLLM/SGLang integrations
behind extras so the published wheel stays lean) is intentionally not in this
PR and will follow separately.

Validation

  • bash -n passes on run.sh; no echo/curl; SPDX headers on all new files;
    run.sh is executable.
  • All relative doc links resolve; both mermaid diagrams parse (the earlier parse
    error from a ; statement-separator in a Note is fixed).
  • A read-only reviewer pass over the diff (EXIT-trap safety, readiness gating,
    link/factual checks, vocabulary consistency) was addressed.
  • Not executed end-to-end: running it needs a CUDA GPU + etcd/NATS (and
    multiple nodes for the multi-node path). The commands are reconstructed from
    the verified test_shadow_failover.py harness and the operator's multi-node
    launch flags; smoke-test on a GPU host before relying on it. shellcheck was
    not available in the authoring environment (bash -n only).

Where should the reviewer start?

  • lib/gpu_memory_service/docs/standalone-usage.md — is the "GMS-specific vs.
    ordinary multi-node" framing right, and are the scratch-KV and failover
    explanations accurate?
  • lib/gpu_memory_service/examples/shadow_failover/run.sh — the multi-node
    launch wiring (per-cohort master ports, headless non-leader ranks,
    leader-only etcd/nats/frontend) and the cleanup trap.

Related Issues

🚫 This PR is NOT linked to an issue:

  • Confirmed — no related issue

Summary
-------
Adds a self-contained, runnable recipe under
lib/gpu_memory_service/examples/shadow_failover/ that demonstrates GPU Memory
Service (GMS) shadow-engine failover for vLLM on a single node / single GPU
WITHOUT the Dynamo Kubernetes operator. This gives users (and the TensorRT-LLM
team evaluating the GMS wheel) a concrete, operator-free shape for the feature,
distilled from the existing tests/gpu_memory_service/test_shadow_failover.py
harness.

The recipe uses the simple manual control-endpoint orchestration flavor
(POST /engine/control/{sleep,wake_up} + process-group SIGKILL of the primary),
not the autonomous flock path used by the operator, so the mechanics are
visible and easy to drive by hand. A primary engine loads weights once and
publishes them into a per-GPU GMS server; a pre-initialized shadow imports the
resident weights (no second disk load) and takes over when the primary is
killed, without reloading model weights.

Files:
- README.md          user-flow walkthrough (steps + benefit), how-it-works,
                     verification, cleanup, and scope/caveats (single-node only;
                     TRT-LLM has GMS weight-sharing but not the failover
                     lifecycle yet).
- run_demo.sh        one-shot orchestrator (10 steps) with a cleanup trap.
- start_infra.sh     start etcd + nats-server -js.
- start_gms.sh       start the production GMS server supervisor.
- run_engine.sh      shared primary/shadow launcher (RW writer vs RO importer).
- kill_primary.sh    process-group SIGKILL failure injection.
- verify.sh          assert takeover via a frontend completion.
- common.sh          shared env defaults + readiness helpers.

Validation
----------
- bash -n passes on all 7 scripts.
- shellcheck -x reports zero warnings.
- SPDX headers present on every file; all .sh files are executable.
- Not executed end to end (requires a CUDA GPU + etcd/NATS); commands are
  reconstructed from the verified test harness and should be smoke-tested on a
  single-GPU box before relying on them.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Summary
-------
Adds lib/gpu_memory_service/docs/standalone-usage.md: a concise, steps-oriented
user-flow doc for using GPU Memory Service WITHOUT the Dynamo Kubernetes
operator, written for the "what do I do / what do I get" question raised when
scoping the GMS wheel for standalone (e.g. TensorRT-LLM) use.

It covers: what GMS is and the benefit (skip weight reload; warm shadow
takeover), the framework-agnostic user flow (start server, launch engine as a
GMS client, add standby), single-node shadow failover (linking the runnable
recipe), the autonomous-flock vs manual-control activation mechanisms, and an
explicit "what is automatic vs. what needs a control plane" section for
multi-node and WideEP (GMS gives fast per-rank weight re-materialization; the
detect / serve-degraded / re-spawn / rejoin orchestration is the
engine's/operator's responsibility, not GMS). It also lists the wheel import
surface external engines depend on, and notes the framework integration
subpackages are Dynamo-runtime glue, not part of the standalone surface.

Also adds a small "Documentation" pointer in the GMS README linking the new
guide and the runnable recipe so they are discoverable.

Validation
----------
- SPDX header present; markdown only, no code changes.
- All relative cross-links resolve (../README.md,
  ../examples/shadow_failover/README.md,
  ../../../docs/kubernetes/shadow-engine-failover.md).
- Claims spot-checked against the GMS source (server entrypoint, socket naming,
  exposed import surface, finalize_gms_write return type).

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Summary
-------
Addresses review feedback that the recipe scripts were too large/convoluted and
that the writing should be more general — focused on the key assumptions/flows a
user must provide, plus what the engine wrapper and the operator do so a reader
can replicate shadow failover with just the GMS wheel.

Recipe (lib/gpu_memory_service/examples/shadow_failover/):
- Collapsed 7 scripts into a single minimal run.sh (~70 lines, no banner spam,
  minimal prints, one cleanup trap). Deleted common.sh, start_infra.sh,
  start_gms.sh, run_engine.sh, kill_primary.sh, verify.sh, run_demo.sh.
- Trimmed README.md to Overview / Prerequisites / Run it / What it shows, and a
  pointer to the conceptual doc for the orchestration explanation.

Docs (lib/gpu_memory_service/docs/standalone-usage.md):
- Added a general "Orchestration: who does what (and how to replicate it)"
  section split into three layers: (1) the GMS wheel primitives, (2) the engine
  wrapper in components/src/dynamo/vllm + integrations/vllm (GMS-aware
  sleep/wake = unmap/remap, memory accounting, scratch KV, ENGINE_ID RW/RO,
  flock-gated discovery registration), and (3) the Dynamo operator (DRA GPU
  sharing, shared lock/socket volume, failover env injection, RestartPolicy
  Never + cascade controller, multi-node per-rank GMS + pod-index rendezvous).
  Each layer states what you must reimplement to DIY; multi-node/WideEP folded
  in as "what's automatic vs. what needs a control plane".

Validation
----------
- bash -n run.sh passes; shellcheck clean. Not executed (needs GPU + infra).
- SPDX headers intact; relative cross-links resolve.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Summary
-------
Rewrites lib/gpu_memory_service/docs/standalone-usage.md from the perspective of
someone running plain inference-engine processes on one or a couple of nodes
(launched via Slurm/ssh/systemd/supervisor — not Kubernetes). Drops the
DRA/Grove/pod/operator-internals framing and instead answers "what do I have to
do to support shadow-engine failover?".

Structure:
- The pieces involved: GMS (weights + lock), a standby-capable engine, and your
  orchestration.
- Single node: what you do (start GMS server; run a writer + RO shadows sharing
  one lock file; autonomous flock promotion vs manual), and what you must
  provide yourself (GPU headroom for co-residency, shared lock path, routing to
  the active engine, relaunching the dead one).
- What the engine itself must support (GMS load, GMS-aware sleep/wake =
  unmap/remap, memory-accounting fixes, scratch KV, hold-until-active) with a
  backend table; framed as the work to do when integrating a non-vLLM engine.
- A couple of nodes: run whole engine groups (active + standby) that pair up by
  slot via per-group rendezvous, promotion stays a node-local leader flock, and
  the four things your launcher must build (detect, cohort-atomic teardown,
  promote, relaunch). Notes the Kubernetes operator is just an implementation of
  those four steps. WideEP folded in as the same loop.
- Retains the wheel import-surface section.

Validation
----------
- Markdown only; SPDX header intact; all relative cross-links resolve
  (../README.md, ../examples/shadow_failover/README.md,
  ../../../docs/kubernetes/shadow-engine-failover.md).

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Switch the single-node recipe to the simplest flow: two shadow-mode engines
share a flock, kill the primary, the kernel hands the lock to the shadow. No
echo signposting (comments only), no curl / control endpoints / frontend.
Observe the takeover in the shadow log. README and the standalone doc updated to
attribute the recipe to the autonomous-flock path (manual control kept as the
alternative for an external controller).

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
The 'What the GMS wheel must expose' section is internal packaging/ops guidance, not user/integration content. Remove it from the external-facing standalone usage guide and fix the README pointer.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- run.sh: replace the WARMUP fixed-sleep tunable with a real readiness gate —
  wait until the shadow logs that it is parked on the flock ("waiting for lock")
  before killing the primary. Remove the WARMUP override from the README.
- docs: rename "A couple of nodes" -> "Multiple nodes" and fold the WideEP
  content into it (WideEP is just a large multi-node deployment), leaving two
  failover sections: single-node and multi-node.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- run.sh: EXIT trap no longer defaults an unset PGID to 0 (which would
  `kill -KILL -0` the caller's own process group when the script fails before
  the engines launch); guard each group-kill on a set PGID.
- run.sh: the "waiting for lock" readiness gate now aborts if the shadow
  process dies before parking, instead of spinning forever.
- docs: use consistent primary/shadow framing in the single-node recap; drop
  the exact TRT-LLM source-identity filename (external, unverifiable here) in
  favor of a general phrasing.
- recipe README: reflect that the gate waits on the shadow being parked, not a
  timer.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- Cut ~45% (8.5 -> ~5 min): fold "pieces involved" into the intro, trim prose.
- Restructure multi-node as "what's extra" on top of single-node (3 changes:
  standby = whole group, per-group rendezvous, one leader flock + whole-group
  failover). Drop the failure-detection / "four steps to build" / Slurm
  control-plane framing — the user triggers the failover (kills the primary).
- Fix the scratch-KV explanation: placeholder un-backed KV lets a (re)initializing
  shadow capture CUDA graphs without consuming real GPU memory for KV; on
  promotion the real GMS-backed KV is swapped in at the same VAs so the graphs
  stay valid.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
- Drop the per-group rendezvous bullet: a shared master addr/port + --node-rank
  is ordinary multi-node deployment, not specific to GMS shadow failover. The
  multi-node section now lists only what's special to GMS: standby = whole group
  sharing resident weights (no extra weight memory), promotion via one leader
  flock, and whole-group failover.
- Add a sequence diagram for the single-node failover flow and a topology
  diagram for the multi-node active/standby layout.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Replace active/standby with primary/shadow throughout the standalone guide, including the prose, the engine-requirements list, the table header, and both mermaid diagram labels.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…node

- docs: a ';' inside a mermaid Note line broke the sequence diagram (mermaid
  treats ';' as a statement separator); rephrase it. Remove the em-dashes from
  the prose.
- recipe: reframe single-GPU -> single-node (any GPU count; add a TP knob) and
  make run.sh multi-node-capable from the SAME script, keyed on NODE_RANK
  (leader runs etcd/nats/frontend; per-cohort master ports; non-leader ranks run
  headless). The script no longer kills anything: it records each engine's PGID
  and the user triggers failover by killing the primary, after which the flock
  promotes the shadow automatically.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712 galletas1712 temporarily deployed to external_collaborator June 26, 2026 21:24 — with GitHub Actions Inactive
@github-actions github-actions Bot added docs documentation Improvements or additions to documentation labels Jun 26, 2026
@datadog-official

datadog-official Bot commented Jun 26, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 3 Pipeline jobs failed

Docs link check | lychee   View in Datadog   GitHub Actions

PR | dynamo-runtime / rust-gpu   View in Datadog   GitHub Actions

PR | dynamo-status-check   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 811972d | Docs | Give us feedback!

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

…per)

The "What the engine must support" summary wrongly credited dynamo.vllm with
implementing the whole list. Split it correctly: the vLLM patches for GMS
integration (gpu_memory_service.integrations.vllm) provide the weight load,
GMS-aware sleep/wake, memory-accounting fixes, and scratch KV; the Dynamo vLLM
wrapper (components/src/dynamo/vllm) drives the hold-until-promoted activation
(flock gate + deferred discovery registration) and wires them together. Adjust
the vLLM row in the backend table to match.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712 galletas1712 temporarily deployed to external_collaborator June 26, 2026 21:30 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs documentation Improvements or additions to documentation size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant