Skip to content

docs(operator): clarify worker ReadinessProbe gates external KubeDiscoveryClient routing#9201

Open
dmvevents wants to merge 2 commits into
ai-dynamo:mainfrom
dmvevents:fix/readinessprobe-comment-v2
Open

docs(operator): clarify worker ReadinessProbe gates external KubeDiscoveryClient routing#9201
dmvevents wants to merge 2 commits into
ai-dynamo:mainfrom
dmvevents:fix/readinessprobe-comment-v2

Conversation

@dmvevents

@dmvevents dmvevents commented May 6, 2026

Copy link
Copy Markdown

Overview

One-line-scope change: comment-only edit in deploy/operator/internal/dynamo/component_worker.go (no code change, no test change). Rewords a pre-existing comment block that was factually correct for Dynamo's internal KvStore transport but misled multiple downstream deployments when the Frontend's KubeDiscoveryClient path is in play.

Problem

The pre-existing comment (lines 54–56 in main at the time of this patch) reads:

// ReadinessProbe in Dynamo worker context doesn't determine that the worker is ready to receive traffic
// Since worker registration is done through external KvStore and Transport does not use Kubernetes Service
// Still important for external depencies that rely on Pod Readiness

This is half-true. It's accurate for the NATS/etcd KvStore transport (Dynamo's internal routing). It's not accurate for any deployment where Frontend uses the k8s-native discovery path — which is the default when the operator provisions a Service for worker pods.

The runtime's KubeDiscoveryClient at lib/runtime/src/discovery/kube/daemon.rs:246 explicitly filters EndpointSlices by endpoint.conditions.ready == true before correlating them with DynamoWorkerMetadata CRs. Workers whose Pod Ready stays False during model load (large LLM weight load, cudagraph capture, NIXL handshake) are filtered out of the Frontend's instance map, even though the worker has registered its CR and is serving traffic over the internal KvStore transport.

The net effect: Frontend /v1/models returns {"data":[]} and /v1/completions returns HTTP 404 for up to 5+ minutes, with no obvious diagnostic unless you read both the Rust daemon and the Go operator code side by side.

Fix

Replace the 3-line comment with an 11-line block that:

  1. States explicitly that ReadinessProbe does gate external Kubernetes Service / EndpointSlice routing
  2. Names the runtime file + line (lib/runtime/src/discovery/kube/daemon.rs:246) where that filtering happens
  3. Clarifies that the prior wording was only correct for the internal KvStore transport
  4. Cross-links the reproducing case (ai-dynamo/dynamo#9200)

No function, probe config, or container mutation is touched. Pure doc change.

Why this is worth merging

  • Zero risk. Compile is unaffected; probe behavior is unchanged.
  • Real operational value. At least one deployment (aws-samples/awsome-inference#72) traced a 6-revision debugging cycle to the ambiguity in this exact comment.
  • Preserves the original author's intent — they were correct about the KvStore case. The fix just adds the Service-routing case on equal footing.

Testing

  • Comment-only; existing unit tests cover container.ReadinessProbe assertions unchanged
  • Local git diff --stat confirms only component_worker.go touched
  • No functional regression expected (requires CI confirm)

Linked

  • Fixes half of #9200 (documentation side; the operational side, whether the probe's default failureThreshold: 3 is too tight for typical model-load times, is a potential follow-up PR)

Downstream context

This contribution originated from the aws-samples/awsome-inference#72 PR which bundles Dynamo + AWS EFA RDMA + 1.0.1 → 1.1.0 bump. Happy to move the PR to a nvidia.com-hosted fork if that's the preferred contribution flow for the Dynamo repo; please let me know.


Open in Devin Review

Summary by CodeRabbit

  • Documentation
    • Updated internal documentation for container configuration readiness probe details.

Note: This release contains internal documentation improvements with no user-facing changes or functional updates.

The pre-existing comment block in component_worker.go said the
ReadinessProbe "doesn't determine that the worker is ready to receive
traffic" and qualified only that it's "important for external
dependencies that rely on Pod Readiness."

This was correct for Dynamo's internal NATS/etcd KvStore transport,
but misled several deployments into omitting the probe, which then
made the Frontend's KubeDiscoveryClient path return 0 instances.

The runtime's discovery daemon at
lib/runtime/src/discovery/kube/daemon.rs:246 explicitly filters
EndpointSlices by endpoint.conditions.ready == true before
correlating with DynamoWorkerMetadata CRs, so a worker whose Pod
stays Ready=False during model load (e.g., large LLM weight load,
cudagraph capture, NIXL handshake) is filtered out of the Frontend's
instance map — even though the worker has registered its CR and is
serving traffic over the internal KvStore transport.

This commit rewords the comment so future contributors understand
that the ReadinessProbe is load-bearing for any deployment that
exposes the worker via a Kubernetes Service (which all current
operator-managed DGDs do through the auto-generated headless svc).

No code change — comment-only.

Reproducing case: ai-dynamo#9200
@dmvevents dmvevents requested a review from a team as a code owner May 6, 2026 06:38
@copy-pr-bot

copy-pr-bot Bot commented May 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

👋 Hi dmvevents! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added docs external-contribution Pull request is from an external contributor labels May 6, 2026
@github-actions github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label May 6, 2026
@coderabbitai

coderabbitai Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0c57f288-9337-43f4-a5b4-d990056ccb92

📥 Commits

Reviewing files that changed from the base of the PR and between 87c5090 and fdc4acb.

📒 Files selected for processing (1)
  • deploy/operator/internal/dynamo/component_worker.go

Walkthrough

The pull request expands an explanatory comment in the Dynamo worker's container configuration. The comment now details the readiness probe's relationship to external KvStore-based routing, EndpointSlices, and the KubeDiscoveryClient, referencing issue 9200. No functional code changes.

Changes

Readiness Probe Comment Clarification

Layer / File(s) Summary
Documentation / Comments
deploy/operator/internal/dynamo/component_worker.go
Readiness probe comment expanded from a brief note to a detailed multi-line explanation of its role in external routing and KubeDiscoveryClient behavior, with GitHub issue reference.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: clarifying that the ReadinessProbe comment gates external KubeDiscoveryClient routing.
Description check ✅ Passed The PR description covers all required template sections: Overview (comment-only change), Details (problem, fix, rationale), Where to review (specific file), and Related Issues (references #9200).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dmvevents

Copy link
Copy Markdown
Author

@julienmancuso / @hhzhang16 — this is a docs-only change on deploy/operator/internal/dynamo/component_worker.go clarifying the ReadinessProbe comment (full context in linked issue #9200). dynamo-ops bot has already approved and all non-gated checks are green; the block is @ai-dynamo/dynamo-deploy-codeowners review since the paths in CODEOWNERS require a human approver. Would one of you be able to take a quick look? Happy to rebase if anything has gone stale.

@hhzhang16

Copy link
Copy Markdown
Contributor

@dmvevents I've approved, but you need to sign-off on your commit! (to pass DCO)

@hhzhang16

Copy link
Copy Markdown
Contributor

/ok to test 7668750

@dmvevents

Copy link
Copy Markdown
Author

@hhzhang16 — the previous CI run got CANCELLED (Mirror Repository to GitLab + Trigger CI Pipeline). Last commit is unchanged at 7668750. Could you re-trigger via /ok to test? PR is APPROVED and only blocked on the missing CI pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes docs external-contribution Pull request is from an external contributor size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants