| Docs | Roadmap | Recipes | Examples | Prebuilt Containers | Blog | Design Proposals |
The open-source, datacenter-scale inference stack. Dynamo is the orchestration layer above inference engines — it doesn't replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.
Built in Rust for performance, Python for extensibility.
- You're serving LLMs across multiple GPUs or nodes and need to coordinate them
- You want KV-aware routing to avoid redundant prefill computation
- You need to independently scale prefill and decode (disaggregated serving)
- You want automatic scaling that meets latency SLAs at minimum total cost of ownership (TCO)
- You need fast cold-starts when spinning up new replicas
If you're running a single model on a single GPU, your inference engine alone is probably sufficient.
Feature support at a glance:
| SGLang | TensorRT-LLM | vLLM | |
|---|---|---|---|
| Disaggregated Serving | ✅ | ✅ | ✅ |
| KV-Aware Routing | ✅ | ✅ | ✅ |
| SLA-Based Planner | ✅ | ✅ | ✅ |
| KVBM | 🚧 | ✅ | ✅ |
| Multimodal | ✅ | ✅ | ✅ |
| Tool Calling | ✅ | ✅ | ✅ |
Full Feature Matrix → — LoRA, request migration, speculative decoding, and feature interactions.
| Result | Context |
|---|---|
| 7x higher throughput per GPU | DeepSeek R1 on GB200 NVL72 w/ Dynamo vs B200 without (InferenceX) |
| 7x faster model startup | ModelExpress weight streaming (DeepSeek-V3 on H200) |
| 2x faster time to first token | KV-aware routing, Qwen3-Coder 480B (Baseten benchmark) |
| 80% fewer SLA breaches | Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025 @ 2:50:00) |
| 750x higher throughput | DeepSeek-R1 on GB300 NVL72 (InferenceXv2) |
Most inference engines optimize a single GPU or a single node. Dynamo is the orchestration layer above them — it turns a cluster of GPUs into a coordinated inference system.
| Capability | What it does | Why it matters |
|---|---|---|
| Disaggregated Prefill/Decode | Separates prefill and decode into independently scalable GPU pools | Maximizes GPU utilization; each phase runs on hardware tuned for its workload |
| KV-Aware Routing | Routes requests based on worker load and KV cache overlap | Eliminates redundant prefill computation — 2x faster TTFT |
| KV Block Manager (KVBM) | Offloads KV cache across GPU → CPU → SSD → remote storage | Extends effective context length beyond GPU memory |
| ModelExpress | Streams model weights GPU-to-GPU via NIXL/NVLink | 7x faster cold-start for new replicas |
| Planner | SLA-driven autoscaler that profiles workloads and right-sizes pools | Meets latency targets at minimum total cost of ownership (TCO) |
| Grove | K8s operator for topology-aware gang scheduling (NVL72) | Places workloads optimally across racks, hosts, and NUMA nodes |
| AIConfigurator | Simulates 10K+ deployment configs in seconds | Finds optimal serving config without burning GPU-hours |
| Fault Tolerance | Canary health checks + in-flight request migration | Workers fail; user requests don't |
- Zero-config deploy (DGDR) (beta): Specify model, HW, and SLA in one YAML — AIConfigurator auto-profiles the workload, Planner optimizes the topology, and Dynamo deploys
- Agentic inference: Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations
- Multimodal E/P/D: Disaggregated encode/prefill/decode with embedding cache — 30% faster TTFT on image workloads
- Video generation: Native FastVideo + SGLang Diffusion support — real-time 1080p on single B200
- K8s Inference Gateway plugin: KV-aware routing inside the standard Kubernetes gateway
- Storage-tier KV offload: S3/Azure blob support + global KV events for cluster-wide cache visibility
# Pull a prebuilt container (SGLang example)
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
# Inside the container — start frontend and worker
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file > /dev/null 2>&1 &
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file &
# Send a request
curl -s localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}' | jqAlso available: tensorrtllm-runtime:1.0.0 and vllm-runtime:1.0.0.
pip install "ai-dynamo[sglang]" # or [vllm] or [trtllm]Then start the frontend and a worker as shown above. See the full installation guide for system dependencies and backend-specific notes.
For production multi-node clusters, install the Dynamo Platform and deploy with a single manifest:
# Zero-config deploy: specify model + SLA, Dynamo handles the rest
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model
spec:
model: Qwen/Qwen3-0.6B
backend: vllm
sla:
ttft: 200.0 # ms
itl: 20.0 # ms
autoApply: truePre-built recipes for common models:
| Model | Framework | Mode | Recipe |
|---|---|---|---|
| Llama-3-70B | vLLM | Aggregated | View |
| DeepSeek-R1 | SGLang | Disaggregated | View |
| Qwen3-32B-FP8 | TensorRT-LLM | Aggregated | View |
See recipes/ for the full list. Cloud-specific guides: AWS EKS · Google GKE
For contributors who want to build and develop locally. See the full build guide for details.
# Install system deps (Ubuntu 24.04)
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh && source $HOME/.cargo/env
# Create venv and build
uv venv dynamo && source dynamo/bin/activate
uv pip install pip maturin
cd lib/bindings/python && maturin develop --uv && cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_service
uv pip install -e .VSCode/Cursor users: see the
.devcontainerfor a pre-configured dev environment.
Dynamo is built in the open with an OSS-first development model. We welcome contributions of all kinds.
- Contribution Guide — How to contribute code, docs, and recipes
- Design Proposals — RFCs for major features
- Office Hours — Biweekly community calls
- Discord — Chat with the team and community
- Dynamo Day Recordings — Deep dives from production users
- [03/15] Dynamo 1.0 is here — production-ready with strong community adoption
- [03/15] NVIDIA Blackwell Ultra sets new inference records in MLPerf
- [03/15] NVIDIA Blackwell leads on SemiAnalysis InferenceMax benchmarks
- [12/05] Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200
- [12/02] Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo
- [11/20] Dell integrates PowerScale with NIXL for 19x faster TTFT
Older news
- Support Matrix — Hardware, OS, CUDA, and backend versions
- Feature Matrix — Detailed backend compatibility
- Release Artifacts — Containers, wheels, Helm charts
- Service Discovery — K8s-native vs etcd vs file-based discovery
- Benchmarking Guide — Compare deployment topologies with AIPerf
