Skip to content

fix(check): mirror GetVCenter FD-selection in CheckNode readOnly gate (OCPBUGS-63365)#1

Draft
ambient-code[bot] wants to merge 1 commit into
masterfrom
bugfix/OCPBUGS-63365-cluster-permission-readOnly
Draft

fix(check): mirror GetVCenter FD-selection in CheckNode readOnly gate (OCPBUGS-63365)#1
ambient-code[bot] wants to merge 1 commit into
masterfrom
bugfix/OCPBUGS-63365-cluster-permission-readOnly

Conversation

@ambient-code
Copy link
Copy Markdown

@ambient-code ambient-code Bot commented Apr 24, 2026

OCPBUGS-63365: Fix false-positive CheckComputeClusterPermissions alerts with custom ResourcePool

Problem

Clusters using a custom vSphere resource pool (non-default, i.e. the pool path
does not end with /Resources) see intermittent VSphereOpenshiftNodeHealthFail
alerts claiming:

missing privileges for compute cluster DC0_C0: VirtualMachine.Config.AddNewDisk

The alerts fire even when the vSphere user has correct permissions. They clear
once topology labels propagate to the node.

Root Cause

CheckComputeClusterPermissions.CheckNode() uses a readOnly flag to skip
the full cluster-permission check when a custom resource pool is configured
(permissions are on the RP, not the cluster root).

The readOnly determination had two gaps relative to the authoritative
GetVCenter() function (pkg/check/util.go:313–326):

  1. No beta-label fallback — only GA labels (topology.kubernetes.io/region/zone)
    were read. Nodes with deprecated beta labels (failure-domain.beta.kubernetes.io/region/zone)
    never matched any failure domain, leaving readOnly=false.

  2. No single-FD fallbackGetVCenter() (OCPBUGS-59319) treats a cluster
    with exactly one failure domain as "topology-unaware" and selects the first FD
    regardless of node labels. CheckNode() did not apply the same rule, so
    nodes without labels (e.g., newly-provisioned workers before the node controller
    has set topology labels) always triggered the full permission check.

Fix

Rewrote the readOnly determination in CheckNode() to mirror GetVCenter()'s
failure-domain selection:

  1. Read GA topology labels; fall back to beta labels if absent.
  2. When len(FDs) > 1: match the FD by region + zone.
  3. Otherwise: use the first FD unconditionally (same rule as GetVCenter()).
-    region := node.Labels["topology.kubernetes.io/region"]
-    zone := node.Labels["topology.kubernetes.io/zone"]
-    for _, fd := range ctx.PlatformSpec.FailureDomains {
-        rp := fd.Topology.ResourcePool
-        if fd.Region == region && fd.Zone == zone && rp != "" && !strings.HasSuffix(rp, "/Resources") {
-            readOnly = true
-        }
-    }
+    region := node.Labels[v1.LabelTopologyRegion]
+    if len(region) == 0 {
+        region = node.Labels[v1.LabelFailureDomainBetaRegion]
+    }
+    zone := node.Labels[v1.LabelTopologyZone]
+    if len(zone) == 0 {
+        zone = node.Labels[v1.LabelFailureDomainBetaZone]
+    }
+    var matchingRP string
+    if len(region) > 0 && len(zone) > 0 && len(ctx.PlatformSpec.FailureDomains) > 1 {
+        for _, fd := range ctx.PlatformSpec.FailureDomains {
+            if fd.Region == region && fd.Zone == zone {
+                matchingRP = fd.Topology.ResourcePool
+                break
+            }
+        }
+    } else if len(ctx.PlatformSpec.FailureDomains) > 0 {
+        matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
+    }
+    if matchingRP != "" && !strings.HasSuffix(matchingRP, "/Resources") {
+        readOnly = true
+    }

Testing

Two new regression tests were added — both fail without the fix and pass after:

Test Gap covered
TestCheckComputeClusterPermissions_BetaLabels_CustomRP_ReadOnly Beta-label fallback (uses 2 FDs to force multi-FD path)
TestCheckComputeClusterPermissions_SingleFD_NoLabels_CustomRP Single-FD fallback (node with no topology labels)

All 5 TestCheckComputeClusterPermissions_* tests pass. Full pkg/check
suite passes.

Files Changed

File Change
pkg/check/node_cluster_permissions.go Rewrote readOnly block (~20 lines)
pkg/check/node_cluster_permissions_test.go Added 2 regression tests (+125 lines)

Related

  • OCPBUGS-59319 — introduced the single-FD fallback in GetVCenter() that this fix mirrors
  • OCPBUGS-926 — original readOnly introduction for legacy INI config

Summary by CodeRabbit

Bug Fixes

  • Enhanced cluster permission checks to more accurately determine when a cluster should operate in read-only mode, with improved support for different topology label formats and better fallback handling.

Tests

  • Added new test cases covering node cluster permission scenarios, including cases with custom resource pools and missing or legacy topology labels.

@ambient-code ambient-code Bot force-pushed the bugfix/OCPBUGS-63365-cluster-permission-readOnly branch from ca2895f to 4d43673 Compare April 24, 2026 19:07
@ambient-code
Copy link
Copy Markdown
Author

ambient-code Bot commented Apr 24, 2026

@coderabbitai review

1 similar comment
@rvanderp3
Copy link
Copy Markdown

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Comment thread pkg/check/node_cluster_permissions.go Outdated
break
}
}
} else if len(ctx.PlatformSpec.FailureDomains) > 0 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the length of failure domain is > 1 and the labels are either blank OR not found in our existing failure domain list, this would signal an invalid / mis-labeled node. We should log error. For defaulting to the first (index 0) failure domain, we should only do this if length of failure domains is 1. When length is one, we do not require vcenter to have labels placed correctly so that nodes are labeled as we expect in lines 75-82.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Ambient, i agree with @vr4manta, please update the code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — updated in the latest push. The else if fallback to FDs[0] is now guarded to len(FDs) == 1 only. For multi-FD clusters, missing labels and unmatched labels each log a klog.Warningf and leave readOnly=false (full permission check runs). All 5 tests still pass.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d2e5f80b-f0fc-48a7-9281-855ce81f23c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The node permission check's readOnly determination was refactored to align with GetVCenter's failure-domain selection logic. GA topology labels now fall back to beta labels, and failure-domain selection uses region+zone matching for multiple domains or defaults to the first domain. Two new test cases verify correct behavior with deprecated labels and single failure-domain scenarios.

Changes

Cohort / File(s) Summary
ReadOnly Determination Logic
pkg/check/node_cluster_permissions.go
Refactored label reading to prioritize GA topology labels (LabelTopologyRegion/Zone) with beta label fallback (LabelFailureDomainBetaRegion/Zone). Updated failure-domain selection to match region+zone across multiple domains or fall back to first domain. readOnly is now set only when the selected ResourcePool is non-empty and excludes the "/Resources" suffix. Comments updated accordingly.
Test Coverage
pkg/check/node_cluster_permissions_test.go
Added two node-level permission tests covering edge cases: beta failure-domain label handling with multiple failure domains, and single-failure-domain fallback with missing topology labels. Both tests verify CheckNode treats clusters as read-only when custom ResourcePools are present.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops through the logic, label by label,
GA first, then beta—fallback's reliable.
Failure domains aligned with GetVCenter's way,
ReadOnly rules keeping permissions at bay!
Tests hopping in to validate the changes today. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: aligning the failure-domain selection logic in CheckNode's readOnly gate with GetVCenter's logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bugfix/OCPBUGS-63365-cluster-permission-readOnly

Comment @coderabbitai help to get the list of available commands and usage tips.

@ambient-code ambient-code Bot force-pushed the bugfix/OCPBUGS-63365-cluster-permission-readOnly branch from 4d43673 to 2e7b81a Compare April 24, 2026 19:25
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/check/node_cluster_permissions.go (1)

85-94: ⚠️ Potential issue | 🟠 Major

Restrict first-FD fallback to single-FD clusters only.

At Line 92, matchingRP falls back to index 0 for any non-empty FD list. With multiple FDs and missing labels, this can incorrectly mark readOnly=true and skip cluster-permission checks for an invalid/mislabeled node.

Proposed fix
-		var matchingRP string
-		if len(region) > 0 && len(zone) > 0 && len(ctx.PlatformSpec.FailureDomains) > 1 {
-			for _, fd := range ctx.PlatformSpec.FailureDomains {
-				if fd.Region == region && fd.Zone == zone {
-					matchingRP = fd.Topology.ResourcePool
-					break
-				}
-			}
-		} else if len(ctx.PlatformSpec.FailureDomains) > 0 {
-			matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
-		}
+		var matchingRP string
+		if len(ctx.PlatformSpec.FailureDomains) > 1 {
+			if len(region) > 0 && len(zone) > 0 {
+				for _, fd := range ctx.PlatformSpec.FailureDomains {
+					if fd.Region == region && fd.Zone == zone {
+						matchingRP = fd.Topology.ResourcePool
+						break
+					}
+				}
+			}
+		} else if len(ctx.PlatformSpec.FailureDomains) == 1 {
+			matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/check/node_cluster_permissions.go` around lines 85 - 94, The code
currently falls back to ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
whenever FailureDomains is non-empty, which can misassign matchingRP for
multi-FD clusters; change the fallback so matchingRP is set from the first FD
only when len(ctx.PlatformSpec.FailureDomains) == 1 (i.e., restrict the else
branch to a single-FD cluster), leaving the multi-FD path to require explicit
region+zone matching (references: matchingRP, region, zone,
ctx.PlatformSpec.FailureDomains).
🧹 Nitpick comments (1)
pkg/check/node_cluster_permissions_test.go (1)

138-261: Add a multi-FD + missing-labels negative test case.

Current additions validate beta-label matching and single-FD fallback, but they don’t lock in the inverse behavior for multi-FD unlabeled nodes. Add a test where len(FailureDomains) > 1 and node labels are absent/unmatched, then assert CheckNode returns the missing-cluster-permissions error (i.e., no implicit first-FD fallback).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/check/node_cluster_permissions_test.go` around lines 138 - 261, Add a
negative test that ensures multi-FD clusters do NOT fallback to the first FD
when node topology labels are absent: create a new test (e.g.,
TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions)
that mirrors the existing tests but uses
len(inf.Spec.PlatformSpec.VSphere.FailureDomains) > 1, leaves the node without
topology labels, gives one FD a custom ResourcePool, calls
setMissingClusterPermissionsAuthManager(ctx) to simulate missing cluster perms,
obtains vCenter/VM via GetVCenter and testlib.GetVM, then calls
check.CheckNode(ctx, node, vm) and assert that it returns an error (use
assert.Error and/or assert.Contains on the error message such as "missing
privileges for compute cluster") to lock in the expected negative behavior when
no labels exist in a multi-FD setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/check/node_cluster_permissions.go`:
- Around line 85-94: The code currently falls back to
ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool whenever FailureDomains
is non-empty, which can misassign matchingRP for multi-FD clusters; change the
fallback so matchingRP is set from the first FD only when
len(ctx.PlatformSpec.FailureDomains) == 1 (i.e., restrict the else branch to a
single-FD cluster), leaving the multi-FD path to require explicit region+zone
matching (references: matchingRP, region, zone,
ctx.PlatformSpec.FailureDomains).

---

Nitpick comments:
In `@pkg/check/node_cluster_permissions_test.go`:
- Around line 138-261: Add a negative test that ensures multi-FD clusters do NOT
fallback to the first FD when node topology labels are absent: create a new test
(e.g.,
TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions)
that mirrors the existing tests but uses
len(inf.Spec.PlatformSpec.VSphere.FailureDomains) > 1, leaves the node without
topology labels, gives one FD a custom ResourcePool, calls
setMissingClusterPermissionsAuthManager(ctx) to simulate missing cluster perms,
obtains vCenter/VM via GetVCenter and testlib.GetVM, then calls
check.CheckNode(ctx, node, vm) and assert that it returns an error (use
assert.Error and/or assert.Contains on the error message such as "missing
privileges for compute cluster") to lock in the expected negative behavior when
no labels exist in a multi-FD setup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 909517a3-6858-432e-b7b4-eac35d7d10c1

📥 Commits

Reviewing files that changed from the base of the PR and between 842aed3 and 4d43673.

📒 Files selected for processing (2)
  • pkg/check/node_cluster_permissions.go
  • pkg/check/node_cluster_permissions_test.go

@rvanderp3
Copy link
Copy Markdown

♻️ Duplicate comments (1)

pkg/check/node_cluster_permissions.go (1)> 85-94: ⚠️ Potential issue | 🟠 Major

Restrict first-FD fallback to single-FD clusters only.
At Line 92, matchingRP falls back to index 0 for any non-empty FD list. With multiple FDs and missing labels, this can incorrectly mark readOnly=true and skip cluster-permission checks for an invalid/mislabeled node.

Proposed fix

-		var matchingRP string
-		if len(region) > 0 && len(zone) > 0 && len(ctx.PlatformSpec.FailureDomains) > 1 {
-			for _, fd := range ctx.PlatformSpec.FailureDomains {
-				if fd.Region == region && fd.Zone == zone {
-					matchingRP = fd.Topology.ResourcePool
-					break
-				}
-			}
-		} else if len(ctx.PlatformSpec.FailureDomains) > 0 {
-			matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
-		}
+		var matchingRP string
+		if len(ctx.PlatformSpec.FailureDomains) > 1 {
+			if len(region) > 0 && len(zone) > 0 {
+				for _, fd := range ctx.PlatformSpec.FailureDomains {
+					if fd.Region == region && fd.Zone == zone {
+						matchingRP = fd.Topology.ResourcePool
+						break
+					}
+				}
+			}
+		} else if len(ctx.PlatformSpec.FailureDomains) == 1 {
+			matchingRP = ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
+		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/check/node_cluster_permissions.go` around lines 85 - 94, The code
currently falls back to ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool
whenever FailureDomains is non-empty, which can misassign matchingRP for
multi-FD clusters; change the fallback so matchingRP is set from the first FD
only when len(ctx.PlatformSpec.FailureDomains) == 1 (i.e., restrict the else
branch to a single-FD cluster), leaving the multi-FD path to require explicit
region+zone matching (references: matchingRP, region, zone,
ctx.PlatformSpec.FailureDomains).

🧹 Nitpick comments (1)

pkg/check/node_cluster_permissions_test.go (1)> 138-261: Add a multi-FD + missing-labels negative test case.

Current additions validate beta-label matching and single-FD fallback, but they don’t lock in the inverse behavior for multi-FD unlabeled nodes. Add a test where len(FailureDomains) > 1 and node labels are absent/unmatched, then assert CheckNode returns the missing-cluster-permissions error (i.e., no implicit first-FD fallback).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/check/node_cluster_permissions_test.go` around lines 138 - 261, Add a
negative test that ensures multi-FD clusters do NOT fallback to the first FD
when node topology labels are absent: create a new test (e.g.,
TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions)
that mirrors the existing tests but uses
len(inf.Spec.PlatformSpec.VSphere.FailureDomains) > 1, leaves the node without
topology labels, gives one FD a custom ResourcePool, calls
setMissingClusterPermissionsAuthManager(ctx) to simulate missing cluster perms,
obtains vCenter/VM via GetVCenter and testlib.GetVM, then calls
check.CheckNode(ctx, node, vm) and assert that it returns an error (use
assert.Error and/or assert.Contains on the error message such as "missing
privileges for compute cluster") to lock in the expected negative behavior when
no labels exist in a multi-FD setup.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/check/node_cluster_permissions.go`:
- Around line 85-94: The code currently falls back to
ctx.PlatformSpec.FailureDomains[0].Topology.ResourcePool whenever FailureDomains
is non-empty, which can misassign matchingRP for multi-FD clusters; change the
fallback so matchingRP is set from the first FD only when
len(ctx.PlatformSpec.FailureDomains) == 1 (i.e., restrict the else branch to a
single-FD cluster), leaving the multi-FD path to require explicit region+zone
matching (references: matchingRP, region, zone,
ctx.PlatformSpec.FailureDomains).

---

Nitpick comments:
In `@pkg/check/node_cluster_permissions_test.go`:
- Around line 138-261: Add a negative test that ensures multi-FD clusters do NOT
fallback to the first FD when node topology labels are absent: create a new test
(e.g.,
TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions)
that mirrors the existing tests but uses
len(inf.Spec.PlatformSpec.VSphere.FailureDomains) > 1, leaves the node without
topology labels, gives one FD a custom ResourcePool, calls
setMissingClusterPermissionsAuthManager(ctx) to simulate missing cluster perms,
obtains vCenter/VM via GetVCenter and testlib.GetVM, then calls
check.CheckNode(ctx, node, vm) and assert that it returns an error (use
assert.Error and/or assert.Contains on the error message such as "missing
privileges for compute cluster") to lock in the expected negative behavior when
no labels exist in a multi-FD setup.

ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 909517a3-6858-432e-b7b4-eac35d7d10c1

📥 Commits
Reviewing files that changed from the base of the PR and between 842aed3 and 4d43673.

📒 Files selected for processing (2)

  • pkg/check/node_cluster_permissions.go
  • pkg/check/node_cluster_permissions_test.go

👋 Ambient address coderabbit feedback

CheckComputeClusterPermissions.CheckNode() uses a readOnly flag to skip
the full cluster-permission check when a custom resource pool is configured.
The readOnly determination had two gaps versus the authoritative GetVCenter()
function in pkg/check/util.go:

1. No beta-label fallback: CheckNode read only GA topology labels
   (topology.kubernetes.io/region/zone) but not the deprecated beta labels
   (failure-domain.beta.kubernetes.io/region/zone). Nodes still using beta
   labels never matched any failure domain, leaving readOnly=false.

2. No single-FD fallback: GetVCenter (OCPBUGS-59319) uses the first failure
   domain unconditionally when len(FDs)==1. CheckNode did not apply this rule,
   so newly-provisioned workers without topology labels always triggered a full
   cluster permission check.

Both gaps produce false-positive VSphereOpenshiftNodeHealthFail alerts on
clusters using a custom resource pool.

Fix: rewrite the readOnly determination to mirror GetVCenter's FD-selection:
  1. Read GA labels; fall back to beta labels if absent.
  2. When len(FDs) > 1: match FD by region+zone.
  3. Otherwise: use the first FD unconditionally (same as GetVCenter).

Two new regression tests are added; both fail without the fix and pass after.

Fixes: OCPBUGS-63365
Related: OCPBUGS-59319
Co-Authored-By: rvanderp3 <rvanderp3@users.noreply.github.com>
@ambient-code ambient-code Bot force-pushed the bugfix/OCPBUGS-63365-cluster-permission-readOnly branch from 2e7b81a to 0db3b79 Compare April 24, 2026 19:38
@ambient-code
Copy link
Copy Markdown
Author

ambient-code Bot commented Apr 24, 2026

Addressed both items:

  1. Duplicate (already fixed in previous push) — the else if fallback to FDs[0] is guarded to len(FDs) == 1 only. Multi-FD clusters with missing or unmatched labels log a warning and leave readOnly=false.

  2. Nitpick — added TestCheckComputeClusterPermissions_MultiFD_NoLabels_RequiresClusterPermissions: multi-FD cluster, node has no labels, missing cluster permissions → asserts CheckNode returns an error containing "missing privileges for compute cluster". All 6 TestCheckComputeClusterPermissions_* tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants