docs(operator): clarify worker ReadinessProbe gates external KubeDiscoveryClient routing#9201
docs(operator): clarify worker ReadinessProbe gates external KubeDiscoveryClient routing#9201dmvevents wants to merge 2 commits into
Conversation
The pre-existing comment block in component_worker.go said the ReadinessProbe "doesn't determine that the worker is ready to receive traffic" and qualified only that it's "important for external dependencies that rely on Pod Readiness." This was correct for Dynamo's internal NATS/etcd KvStore transport, but misled several deployments into omitting the probe, which then made the Frontend's KubeDiscoveryClient path return 0 instances. The runtime's discovery daemon at lib/runtime/src/discovery/kube/daemon.rs:246 explicitly filters EndpointSlices by endpoint.conditions.ready == true before correlating with DynamoWorkerMetadata CRs, so a worker whose Pod stays Ready=False during model load (e.g., large LLM weight load, cudagraph capture, NIXL handshake) is filtered out of the Frontend's instance map — even though the worker has registered its CR and is serving traffic over the internal KvStore transport. This commit rewords the comment so future contributors understand that the ReadinessProbe is load-bearing for any deployment that exposes the worker via a Kubernetes Service (which all current operator-managed DGDs do through the auto-generated headless svc). No code change — comment-only. Reproducing case: ai-dynamo#9200
|
👋 Hi dmvevents! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe pull request expands an explanatory comment in the Dynamo worker's container configuration. The comment now details the readiness probe's relationship to external KvStore-based routing, EndpointSlices, and the KubeDiscoveryClient, referencing issue 9200. No functional code changes. ChangesReadiness Probe Comment Clarification
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.1)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@julienmancuso / @hhzhang16 — this is a docs-only change on |
|
@dmvevents I've approved, but you need to sign-off on your commit! (to pass DCO) |
|
/ok to test 7668750 |
|
@hhzhang16 — the previous CI run got CANCELLED (Mirror Repository to GitLab + Trigger CI Pipeline). Last commit is unchanged at |
Overview
One-line-scope change: comment-only edit in
deploy/operator/internal/dynamo/component_worker.go(no code change, no test change). Rewords a pre-existing comment block that was factually correct for Dynamo's internal KvStore transport but misled multiple downstream deployments when the Frontend'sKubeDiscoveryClientpath is in play.Problem
The pre-existing comment (lines 54–56 in
mainat the time of this patch) reads:This is half-true. It's accurate for the NATS/etcd KvStore transport (Dynamo's internal routing). It's not accurate for any deployment where Frontend uses the k8s-native discovery path — which is the default when the operator provisions a
Servicefor worker pods.The runtime's
KubeDiscoveryClientatlib/runtime/src/discovery/kube/daemon.rs:246explicitly filters EndpointSlices byendpoint.conditions.ready == truebefore correlating them withDynamoWorkerMetadataCRs. Workers whose Pod Ready staysFalseduring model load (large LLM weight load, cudagraph capture, NIXL handshake) are filtered out of the Frontend's instance map, even though the worker has registered its CR and is serving traffic over the internal KvStore transport.The net effect: Frontend
/v1/modelsreturns{"data":[]}and/v1/completionsreturns HTTP 404 for up to 5+ minutes, with no obvious diagnostic unless you read both the Rust daemon and the Go operator code side by side.Fix
Replace the 3-line comment with an 11-line block that:
ReadinessProbedoes gate external Kubernetes Service / EndpointSlice routinglib/runtime/src/discovery/kube/daemon.rs:246) where that filtering happensNo function, probe config, or container mutation is touched. Pure doc change.
Why this is worth merging
Testing
container.ReadinessProbeassertions unchangedgit diff --statconfirms onlycomponent_worker.gotouchedLinked
failureThreshold: 3is too tight for typical model-load times, is a potential follow-up PR)Downstream context
This contribution originated from the aws-samples/awsome-inference#72 PR which bundles Dynamo + AWS EFA RDMA + 1.0.1 → 1.1.0 bump. Happy to move the PR to a nvidia.com-hosted fork if that's the preferred contribution flow for the Dynamo repo; please let me know.
Summary by CodeRabbit
Note: This release contains internal documentation improvements with no user-facing changes or functional updates.