fix(check): mirror GetVCenter FD-selection in CheckNode readOnly gate (OCPBUGS-63365)#1
Conversation
ca2895f to
4d43673
Compare
|
@coderabbitai review |
1 similar comment
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
| break | ||
| } | ||
| } | ||
| } else if len(ctx.PlatformSpec.FailureDomains) > 0 { |
There was a problem hiding this comment.
If the length of failure domain is > 1 and the labels are either blank OR not found in our existing failure domain list, this would signal an invalid / mis-labeled node. We should log error. For defaulting to the first (index 0) failure domain, we should only do this if length of failure domains is 1. When length is one, we do not require vcenter to have labels placed correctly so that nodes are labeled as we expect in lines 75-82.
There was a problem hiding this comment.
👋 Ambient, i agree with @vr4manta, please update the code
There was a problem hiding this comment.
Done — updated in the latest push. The else if fallback to FDs[0] is now guarded to len(FDs) == 1 only. For multi-FD clusters, missing labels and unmatched labels each log a klog.Warningf and leave readOnly=false (full permission check runs). All 5 tests still pass.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe node permission check's Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
4d43673 to
2e7b81a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/check/node_cluster_permissions.go (1)
85-94:⚠️ Potential issue | 🟠 MajorRestrict first-FD fallback to single-FD clusters only.
At Line 92,
matchingRPfalls back to index0for any non-empty FD list. With multiple FDs and missing labels, this can incorrectly markreadOnly=trueand skip cluster-permission checks for an invalid/mislabeled node.Proposed fix
- var matchingRP string - if len(region) > 0 && len(zone) > 0 && len(ctx.PlatformSpec.FailureDomains) > 1 { - for _, fd := range ctx.PlatformSpec.FailureDomains { - if fd.Region == region && fd.Zone == zone { - matchingRP = fd.Topology.ResourcePool - break - } - } - } else if len(ctx.PlatformSpec.FailureDomains) > 0 { - matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool - } + var matchingRP string + if len(ctx.PlatformSpec.FailureDomains) > 1 { + if len(region) > 0 && len(zone) > 0 { + for _, fd := range ctx.PlatformSpec.FailureDomains { + if fd.Region == region && fd.Zone == zone { + matchingRP = fd.Topology.ResourcePool + break + } + } + } + } else if len(ctx.PlatformSpec.FailureDomains) == 1 { + matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/check/node_cluster_permissions.go` around lines 85 - 94, The code currently falls back to ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool whenever FailureDomains is non-empty, which can misassign matchingRP for multi-FD clusters; change the fallback so matchingRP is set from the first FD only when len(ctx.PlatformSpec.FailureDomains) == 1 (i.e., restrict the else branch to a single-FD cluster), leaving the multi-FD path to require explicit region+zone matching (references: matchingRP, region, zone, ctx.PlatformSpec.FailureDomains).
🧹 Nitpick comments (1)
pkg/check/node_cluster_permissions_test.go (1)
138-261: Add a multi-FD + missing-labels negative test case.Current additions validate beta-label matching and single-FD fallback, but they don’t lock in the inverse behavior for multi-FD unlabeled nodes. Add a test where
len(FailureDomains) > 1and node labels are absent/unmatched, then assertCheckNodereturns the missing-cluster-permissions error (i.e., no implicit first-FD fallback).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/check/node_cluster_permissions_test.go` around lines 138 - 261, Add a negative test that ensures multi-FD clusters do NOT fallback to the first FD when node topology labels are absent: create a new test (e.g., TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions) that mirrors the existing tests but uses len(inf.Spec.PlatformSpec.VSphere.FailureDomains) > 1, leaves the node without topology labels, gives one FD a custom ResourcePool, calls setMissingClusterPermissionsAuthManager(ctx) to simulate missing cluster perms, obtains vCenter/VM via GetVCenter and testlib.GetVM, then calls check.CheckNode(ctx, node, vm) and assert that it returns an error (use assert.Error and/or assert.Contains on the error message such as "missing privileges for compute cluster") to lock in the expected negative behavior when no labels exist in a multi-FD setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/check/node_cluster_permissions.go`:
- Around line 85-94: The code currently falls back to
ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool whenever FailureDomains
is non-empty, which can misassign matchingRP for multi-FD clusters; change the
fallback so matchingRP is set from the first FD only when
len(ctx.PlatformSpec.FailureDomains) == 1 (i.e., restrict the else branch to a
single-FD cluster), leaving the multi-FD path to require explicit region+zone
matching (references: matchingRP, region, zone,
ctx.PlatformSpec.FailureDomains).
---
Nitpick comments:
In `@pkg/check/node_cluster_permissions_test.go`:
- Around line 138-261: Add a negative test that ensures multi-FD clusters do NOT
fallback to the first FD when node topology labels are absent: create a new test
(e.g.,
TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions)
that mirrors the existing tests but uses
len(inf.Spec.PlatformSpec.VSphere.FailureDomains) > 1, leaves the node without
topology labels, gives one FD a custom ResourcePool, calls
setMissingClusterPermissionsAuthManager(ctx) to simulate missing cluster perms,
obtains vCenter/VM via GetVCenter and testlib.GetVM, then calls
check.CheckNode(ctx, node, vm) and assert that it returns an error (use
assert.Error and/or assert.Contains on the error message such as "missing
privileges for compute cluster") to lock in the expected negative behavior when
no labels exist in a multi-FD setup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 909517a3-6858-432e-b7b4-eac35d7d10c1
📒 Files selected for processing (2)
pkg/check/node_cluster_permissions.gopkg/check/node_cluster_permissions_test.go
👋 Ambient address coderabbit feedback |
CheckComputeClusterPermissions.CheckNode() uses a readOnly flag to skip the full cluster-permission check when a custom resource pool is configured. The readOnly determination had two gaps versus the authoritative GetVCenter() function in pkg/check/util.go: 1. No beta-label fallback: CheckNode read only GA topology labels (topology.kubernetes.io/region/zone) but not the deprecated beta labels (failure-domain.beta.kubernetes.io/region/zone). Nodes still using beta labels never matched any failure domain, leaving readOnly=false. 2. No single-FD fallback: GetVCenter (OCPBUGS-59319) uses the first failure domain unconditionally when len(FDs)==1. CheckNode did not apply this rule, so newly-provisioned workers without topology labels always triggered a full cluster permission check. Both gaps produce false-positive VSphereOpenshiftNodeHealthFail alerts on clusters using a custom resource pool. Fix: rewrite the readOnly determination to mirror GetVCenter's FD-selection: 1. Read GA labels; fall back to beta labels if absent. 2. When len(FDs) > 1: match FD by region+zone. 3. Otherwise: use the first FD unconditionally (same as GetVCenter). Two new regression tests are added; both fail without the fix and pass after. Fixes: OCPBUGS-63365 Related: OCPBUGS-59319 Co-Authored-By: rvanderp3 <rvanderp3@users.noreply.github.com>
2e7b81a to
0db3b79
Compare
|
Addressed both items:
|
OCPBUGS-63365: Fix false-positive CheckComputeClusterPermissions alerts with custom ResourcePool
Problem
Clusters using a custom vSphere resource pool (non-default, i.e. the pool path
does not end with
/Resources) see intermittentVSphereOpenshiftNodeHealthFailalerts claiming:
The alerts fire even when the vSphere user has correct permissions. They clear
once topology labels propagate to the node.
Root Cause
CheckComputeClusterPermissions.CheckNode()uses areadOnlyflag to skipthe full cluster-permission check when a custom resource pool is configured
(permissions are on the RP, not the cluster root).
The
readOnlydetermination had two gaps relative to the authoritativeGetVCenter()function (pkg/check/util.go:313–326):No beta-label fallback — only GA labels (
topology.kubernetes.io/region/zone)were read. Nodes with deprecated beta labels (
failure-domain.beta.kubernetes.io/region/zone)never matched any failure domain, leaving
readOnly=false.No single-FD fallback —
GetVCenter()(OCPBUGS-59319) treats a clusterwith exactly one failure domain as "topology-unaware" and selects the first FD
regardless of node labels.
CheckNode()did not apply the same rule, sonodes without labels (e.g., newly-provisioned workers before the node controller
has set topology labels) always triggered the full permission check.
Fix
Rewrote the
readOnlydetermination inCheckNode()to mirrorGetVCenter()'sfailure-domain selection:
len(FDs) > 1: match the FD by region + zone.GetVCenter()).Testing
Two new regression tests were added — both fail without the fix and pass after:
TestCheckComputeClusterPermissions_BetaLabels_CustomRP_ReadOnlyTestCheckComputeClusterPermissions_SingleFD_NoLabels_CustomRPAll 5
TestCheckComputeClusterPermissions_*tests pass. Fullpkg/checksuite passes.
Files Changed
pkg/check/node_cluster_permissions.goreadOnlyblock (~20 lines)pkg/check/node_cluster_permissions_test.goRelated
GetVCenter()that this fix mirrorsreadOnlyintroduction for legacy INI configSummary by CodeRabbit
Bug Fixes
Tests