[release-4.22] NVIDIA-596: Enable dpu healthcheck#3009
Conversation
Add configurable DPU node lease health monitoring to detect when the DPU-side OVN-Kubernetes component is down or not installed. Without this, pods are scheduled to DPU-accelerated nodes regardless of DPU readiness, causing silent 2-minute CNI ADD timeouts with no visibility or automated remediation. DPU lease configuration: - Read dpu-node-lease-renew-interval and dpu-node-lease-duration from the hardware-offload-config ConfigMap (defaults: 10s / 40s). - Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into ovnkube-controller for dpu-host/dpu node modes. - Script-lib translates env vars into --dpu-node-lease-renew-interval and --dpu-node-lease-duration CLI flags for ovnkube-node. - Setting renew-interval to 0 disables the health check; duration must always be > 0 (required by ovn-kubernetes). - Lease namespace is derived via downward API (fieldRef). Jira: https://issues.redhat.com/browse/NVIDIA-596 Made-with: Cursor Signed-off-by: Igal Tsoiref <itsoiref@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Igal Tsoiref <itsoiref@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tsorya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/65638f00-5190-11f1-9814-cded5d8a1eab-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/65638f00-5190-11f1-9814-cded5d8a1eab-1 |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/daa71920-5263-11f1-9827-2279928092a2-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/daa71920-5263-11f1-9827-2279928092a2-1 |
|
/retest-required |
|
/hold |
any issue open on multus side? or cilium ? |
I can see cilium/cilium#30363 on cniVersion bumping for cilium. |
|
/hold |
|
This works for me openshift/release#79593 - overriding the CNI version in Cilium config. |
This is a manual cherry-pick of #2941 to release-4.22.
Changes
Cherry-picks both commits from #2941:
NVIDIA-596: Enable DPU healthcheck — Adds configurable DPU node lease health monitoring (
dpu-node-lease-renew-interval,dpu-node-lease-duration) read from thehardware-offload-configConfigMap. Env varsOVNKUBE_NODE_LEASE_RENEW_INTERVAL/OVNKUBE_NODE_LEASE_DURATIONare injected into ovnkube-node for dpu-host/dpu modes. Script-lib translates them into--dpu-node-lease-renew-interval/--dpu-node-lease-durationCLI flags. Setting renew-interval to 0 disables the health check.NVIDIA-616: Bump Multus CNI API version to 1.1.0 — Required for CNI STATUS and GC verbs used by the DPU health check. Backward compatible with 0.3.1.
Conflict Resolution
One conflict in
bindata/network/ovn-kubernetes/common/008-script-lib.yamlat the end of thestart-ovnkube-node()function: release-4.22 still has${egress_features_enable_flag}and${multi_external_gateway_enable_flag}(which were removed on master by #2944). Resolution keeps these existing lines and appends the new${dpu_lease_flags}.This is compatible with #2997 (cherry-pick of #2944) which is pending on release-4.22 — when #2997 merges, it will cleanly remove the egress/gateway lines while
${dpu_lease_flags}remains.