Skip to content

[release-4.22] NVIDIA-596: Enable dpu healthcheck#3009

Open
tsorya wants to merge 2 commits into
openshift:release-4.22from
tsorya:cherry-pick-2941-to-release-4.22
Open

[release-4.22] NVIDIA-596: Enable dpu healthcheck#3009
tsorya wants to merge 2 commits into
openshift:release-4.22from
tsorya:cherry-pick-2941-to-release-4.22

Conversation

@tsorya
Copy link
Copy Markdown
Contributor

@tsorya tsorya commented May 17, 2026

This is a manual cherry-pick of #2941 to release-4.22.

Changes

Cherry-picks both commits from #2941:

  1. NVIDIA-596: Enable DPU healthcheck — Adds configurable DPU node lease health monitoring (dpu-node-lease-renew-interval, dpu-node-lease-duration) read from the hardware-offload-config ConfigMap. Env vars OVNKUBE_NODE_LEASE_RENEW_INTERVAL / OVNKUBE_NODE_LEASE_DURATION are injected into ovnkube-node for dpu-host/dpu modes. Script-lib translates them into --dpu-node-lease-renew-interval / --dpu-node-lease-duration CLI flags. Setting renew-interval to 0 disables the health check.

  2. NVIDIA-616: Bump Multus CNI API version to 1.1.0 — Required for CNI STATUS and GC verbs used by the DPU health check. Backward compatible with 0.3.1.

Conflict Resolution

One conflict in bindata/network/ovn-kubernetes/common/008-script-lib.yaml at the end of the start-ovnkube-node() function: release-4.22 still has ${egress_features_enable_flag} and ${multi_external_gateway_enable_flag} (which were removed on master by #2944). Resolution keeps these existing lines and appends the new ${dpu_lease_flags}.

This is compatible with #2997 (cherry-pick of #2944) which is pending on release-4.22 — when #2997 merges, it will cleanly remove the egress/gateway lines while ${dpu_lease_flags} remains.

tsorya and others added 2 commits May 16, 2026 21:24
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:
- Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
  the hardware-offload-config ConfigMap (defaults: 10s / 40s).
- Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
  env vars into ovnkube-controller for dpu-host/dpu node modes.
- Script-lib translates env vars into --dpu-node-lease-renew-interval
  and --dpu-node-lease-duration CLI flags for ovnkube-node.
- Setting renew-interval to 0 disables the health check; duration
  must always be > 0 (required by ovn-kubernetes).
- Lease namespace is derived via downward API (fieldRef).

Jira: https://issues.redhat.com/browse/NVIDIA-596
Made-with: Cursor
Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 17, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

This is a manual cherry-pick of #2941 to release-4.22.

/assign tsorya

Changes

Cherry-picks both commits from #2941:

  1. NVIDIA-596: Enable DPU healthcheck — Adds configurable DPU node lease health monitoring (dpu-node-lease-renew-interval, dpu-node-lease-duration) read from the hardware-offload-config ConfigMap. Env vars OVNKUBE_NODE_LEASE_RENEW_INTERVAL / OVNKUBE_NODE_LEASE_DURATION are injected into ovnkube-node for dpu-host/dpu modes. Script-lib translates them into --dpu-node-lease-renew-interval / --dpu-node-lease-duration CLI flags. Setting renew-interval to 0 disables the health check.

  2. NVIDIA-616: Bump Multus CNI API version to 1.1.0 — Required for CNI STATUS and GC verbs used by the DPU health check. Backward compatible with 0.3.1.

Conflict Resolution

One conflict in bindata/network/ovn-kubernetes/common/008-script-lib.yaml at the end of the start-ovnkube-node() function: release-4.22 still has ${egress_features_enable_flag} and ${multi_external_gateway_enable_flag} (which were removed on master by #2944). Resolution keeps these existing lines and appends the new ${dpu_lease_flags}.

This is compatible with #2997 (cherry-pick of #2944) which is pending on release-4.22 — when #2997 merges, it will cleanly remove the egress/gateway lines while ${dpu_lease_flags} remains.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 17, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 304b6354-7aec-4886-ad9f-4de29d14014b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from marty-power and taanyas May 17, 2026 01:26
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tsorya
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 17, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 17, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/65638f00-5190-11f1-9814-cded5d8a1eab-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/65638f00-5190-11f1-9814-cded5d8a1eab-1

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 18, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/daa71920-5263-11f1-9827-2279928092a2-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/daa71920-5263-11f1-9827-2279928092a2-1

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 18, 2026

/retest-required

@machine424
Copy link
Copy Markdown

/hold
please see https://redhat.atlassian.net/browse/OCPBUGS-86033

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2026
@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 18, 2026

/hold please see https://redhat.atlassian.net/browse/OCPBUGS-86033

any issue open on multus side? or cilium ?

@machine424
Copy link
Copy Markdown

/hold please see https://redhat.atlassian.net/browse/OCPBUGS-86033

any issue open on multus side? or cilium ?

I can see cilium/cilium#30363 on cniVersion bumping for cilium.

@mgencur
Copy link
Copy Markdown

mgencur commented May 21, 2026

/hold
Can we first discuss the effect on OpenShift support for Cilium? This one was reported and rejected: https://redhat.atlassian.net/browse/OCPBUGS-86033
But, OpenShift/Hypershift supports Cilium as CNI provider which has been working so far. Merging this would bring a regression => blocker.
Note: I'm trying to find a workaround in Cilium.

@mgencur
Copy link
Copy Markdown

mgencur commented May 21, 2026

This works for me openshift/release#79593 - overriding the CNI version in Cilium config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants