Skip to content

OCPBUGS-56274: add datacenter consistency check#212

Open
RomanBednar wants to merge 1 commit into
openshift:mainfrom
RomanBednar:OCPBUGS-56274
Open

OCPBUGS-56274: add datacenter consistency check#212
RomanBednar wants to merge 1 commit into
openshift:mainfrom
RomanBednar:OCPBUGS-56274

Conversation

@RomanBednar
Copy link
Copy Markdown
Contributor

@RomanBednar RomanBednar commented Feb 19, 2026

When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-config ConfigMap in openshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check, CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsed cloud.conf (ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update the cloud-provider-config ConfigMap in the openshift-config namespace.

Cluster Setup

Two failure domains configured:

  • us-east-1 → datacenter nested-devqedatacenter-1
  • us-west-1 → datacenter nested-devqedatacenter-2

Both on vCenter 232-15-184-10.in-addr.arpa.

Simulating the Bug

The datacenter nested-devqedatacenter-2 was removed from cloud-provider-config:

# Edit cloud-provider-config to remove nested-devqedatacenter-2
oc -n openshift-config edit configmap cloud-provider-config
# Changed: datacenters = nested-devqedatacenter-1,nested-devqedatacenter-2
# To:      datacenters = nested-devqedatacenter-1

# Verified propagation to vsphere-csi-config-secret:
oc -n openshift-cluster-csi-drivers get secret/vsphere-csi-config-secret \
  -o jsonpath='{.data.cloud\.conf}' | base64 -d
# Output confirmed: datacenters = nested-devqedatacenter-1

Unpatched Behaviour (openshift/main)

export KUBECONFIG=/Users/MAC/openshift/clusters/vsphere/cluster-01/auth/kubeconfig
git checkout openshift/main && make
./vsphere-problem-detector start -v 5 \
  --kubeconfig=$KUBECONFIG \
  --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:17:18.909862   17481 infra_config.go:15] Checking infrastructure and cloud provider config for consistency.
I0219 16:17:18.909897   17481 vsphere_check.go:302] CheckInfraConfig passed
I0219 16:17:24.169406   17481 vsphere_check.go:109] Finished running all vSphere specific checks in the cluster
I0219 16:17:24.307163   17481 event.go:377] ... type: 'Normal' reason: 'SucceededVSphereCheckInfraConfig' Check succeeded

No warning or error about the missing datacenter nested-devqedatacenter-2.

Patched Behaviour (OCPBUGS-56274)

git checkout OCPBUGS-56274 && make
./vsphere-problem-detector start -v 5 \
  --kubeconfig=$KUBECONFIG \
  --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:23:24.680681   32885 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0219 16:23:24.680821   32885 datacenter_consistency.go:50] Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa", but it is not listed in the cloud provider config (datacenters = "nested-devqedatacenter-1" in vsphere-csi-config-secret, namespace openshift-cluster-csi-drivers). Add "nested-devqedatacenter-2" to the datacenters list in the cloud-provider-config ConfigMap in the openshift-config namespace.
I0219 16:23:24.680835   32885 vsphere_check.go:299] CheckDatacenterConsistency failed: Datacenter-Consistency: failure domain "us-west-1" ...
I0219 16:23:30.292865   32885 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa" ...

WARNING emitted, explicitly naming nested-devqedatacenter-2 as missing, with remediation instructions.

Summary by CodeRabbit

  • New Features

    • Added a validation that ensures vSphere failure domains reference datacenters present in the configured vCenter settings; emits warnings and returns an error when mismatches are found.
  • Tests

    • Added comprehensive tests covering legacy behavior, single/multi-vCenter setups, missing datacenters, and datacenter-list parsing.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-config ConfigMap in openshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check, CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsed cloud.conf (ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update the cloud-provider-config ConfigMap in the openshift-config namespace.

Cluster Setup

Two failure domains configured:

  • us-east-1 → datacenter nested-devqedatacenter-1
  • us-west-1 → datacenter nested-devqedatacenter-2

Both on vCenter 232-15-184-10.in-addr.arpa.

Simulating the Bug

The datacenter nested-devqedatacenter-2 was removed from cloud-provider-config:

# Edit cloud-provider-config to remove nested-devqedatacenter-2
oc -n openshift-config edit configmap cloud-provider-config
# Changed: datacenters = nested-devqedatacenter-1,nested-devqedatacenter-2
# To:      datacenters = nested-devqedatacenter-1

# Verified propagation to vsphere-csi-config-secret:
oc -n openshift-cluster-csi-drivers get secret/vsphere-csi-config-secret \
 -o jsonpath='{.data.cloud\.conf}' | base64 -d
# Output confirmed: datacenters = nested-devqedatacenter-1

Unpatched Behaviour (openshift/main)

export KUBECONFIG=/Users/MAC/openshift/clusters/vsphere/cluster-01/auth/kubeconfig
git checkout openshift/main && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:17:18.909862   17481 infra_config.go:15] Checking infrastructure and cloud provider config for consistency.
I0219 16:17:18.909897   17481 vsphere_check.go:302] CheckInfraConfig passed
I0219 16:17:24.169406   17481 vsphere_check.go:109] Finished running all vSphere specific checks in the cluster
I0219 16:17:24.307163   17481 event.go:377] ... type: 'Normal' reason: 'SucceededVSphereCheckInfraConfig' Check succeeded

No warning or error about the missing datacenter nested-devqedatacenter-2.

Patched Behaviour (OCPBUGS-56274)

git checkout OCPBUGS-56274 && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:23:24.680681   32885 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0219 16:23:24.680821   32885 datacenter_consistency.go:50] Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa", but it is not listed in the cloud provider config (datacenters = "nested-devqedatacenter-1" in vsphere-csi-config-secret, namespace openshift-cluster-csi-drivers). Add "nested-devqedatacenter-2" to the datacenters list in the cloud-provider-config ConfigMap in the openshift-config namespace.
I0219 16:23:24.680835   32885 vsphere_check.go:299] CheckDatacenterConsistency failed: Datacenter-Consistency: failure domain "us-west-1" ...
I0219 16:23:30.292865   32885 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa" ...

WARNING emitted, explicitly naming nested-devqedatacenter-2 as missing, with remediation instructions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from dfajmon and mpatlasov February 19, 2026 15:41
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RomanBednar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2026
@RomanBednar
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@RomanBednar
Copy link
Copy Markdown
Contributor Author

/assign @gnufied

For review.

@RomanBednar
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Walkthrough

Adds a new cluster-level check, CheckDatacenterConsistency, that fetches the Infrastructure CR and validates each vSphere failure domain's referenced datacenter against configured datacenters in the cloud provider config and the Infrastructure vCenters list, reporting mismatches as accumulated errors.

Changes

Cohort / File(s) Summary
Datacenter consistency check
pkg/check/datacenter_consistency.go
New exported function CheckDatacenterConsistency(ctx *CheckContext) error that: fetches Infrastructure; skips legacy/no-failure-domain cases; for each vSphere failure domain looks up the vCenter entry in the cloud config, parses its Datacenters string (parseDatacenters), compares required datacenter presence, logs warnings and accumulates errors; performs a second pass comparing against infra.Spec.PlatformSpec.VSphere.VCenters.
Tests
pkg/check/datacenter_consistency_test.go
New table-driven tests for CheckDatacenterConsistency covering legacy/no-FD, successful validations (single/multi vCenter, ini/yaml variants), missing datacenters, unknown vCenter entries, and unit tests for parseDatacenters parsing/trimming behavior.
Check registration
pkg/check/interface.go
Added "CheckDatacenterConsistency": CheckDatacenterConsistency to DefaultClusterChecks map (formatting/alignment adjusted).

Sequence Diagram(s)

sequenceDiagram
  participant Runner as Runner
  participant Check as CheckDatacenterConsistency
  participant Kube as KubeClient
  participant Config as CloudConfig
  participant Infra as Infrastructure
  participant Logger as Logger/ErrorAccum

  Runner->>Check: invoke CheckDatacenterConsistency(ctx)
  Check->>Kube: GetInfrastructure(ctx)
  Kube-->>Check: Infrastructure (or error)
  alt fetch error
    Check->>Logger: log error
    Check-->>Runner: return error
  else infra fetched
    Check->>Infra: read PlatformSpec.VSphere / FailureDomains
    alt no vSphere or no FailureDomains
      Check->>Logger: debug skip
      Check-->>Runner: return nil
    else have failure domains
      loop for each FailureDomain
        Check->>Config: lookup cfg.VirtualCenter[fd.Server]
        alt config entry missing
          Check->>Logger: debug skip for that fd
        else entry present
          Check->>Check: parseDatacenters(entry.Datacenters)
          Check->>Check: compare parsed list with fd.Topology.Datacenter
          alt mismatch
            Check->>Logger: warn + append error
          end
        end
      end
      loop second pass for each FailureDomain
        Check->>Infra: lookup matched vCenter in infra.Spec.PlatformSpec.VSphere.VCenters
        Check->>Check: compare infra vcenter.datacenters with fd.Topology.Datacenter
        alt mismatch
          Check->>Logger: warn + append error
        end
      end
      Check-->>Runner: return joined errors if any, else nil
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a new datacenter consistency check to the vSphere Problem Detector.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The pull request uses standard Go testing functions without Ginkgo, making the Ginkgo test name stability check not applicable.
Test Structure And Quality ✅ Passed Test code demonstrates strong structure following Go patterns with table-driven subtests, proper setup/cleanup, appropriate timeouts, and consistency with codebase patterns.
Microshift Test Compatibility ✅ Passed PR adds standard Go unit tests using testing.T framework, not Ginkgo e2e tests, so MicroShift Test Compatibility check does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds standard Go unit tests with testing.T package, not Ginkgo e2e tests targeted by this check.
Topology-Aware Scheduling Compatibility ✅ Passed The PR adds only a diagnostic validation function that checks infrastructure configuration consistency; it does not introduce deployment manifests, workloads, or scheduling constraints.
Ote Binary Stdout Contract ✅ Passed Code analysis reveals no process-level stdout writes in new/modified files violating OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests use standard Go unit testing with mocking, no IPv4 hardcodes, no external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request.

Details

In response to this:

When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-config ConfigMap in openshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check, CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsed cloud.conf (ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update the cloud-provider-config ConfigMap in the openshift-config namespace.

Cluster Setup

Two failure domains configured:

  • us-east-1 → datacenter nested-devqedatacenter-1
  • us-west-1 → datacenter nested-devqedatacenter-2

Both on vCenter 232-15-184-10.in-addr.arpa.

Simulating the Bug

The datacenter nested-devqedatacenter-2 was removed from cloud-provider-config:

# Edit cloud-provider-config to remove nested-devqedatacenter-2
oc -n openshift-config edit configmap cloud-provider-config
# Changed: datacenters = nested-devqedatacenter-1,nested-devqedatacenter-2
# To:      datacenters = nested-devqedatacenter-1

# Verified propagation to vsphere-csi-config-secret:
oc -n openshift-cluster-csi-drivers get secret/vsphere-csi-config-secret \
 -o jsonpath='{.data.cloud\.conf}' | base64 -d
# Output confirmed: datacenters = nested-devqedatacenter-1

Unpatched Behaviour (openshift/main)

export KUBECONFIG=/Users/MAC/openshift/clusters/vsphere/cluster-01/auth/kubeconfig
git checkout openshift/main && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:17:18.909862   17481 infra_config.go:15] Checking infrastructure and cloud provider config for consistency.
I0219 16:17:18.909897   17481 vsphere_check.go:302] CheckInfraConfig passed
I0219 16:17:24.169406   17481 vsphere_check.go:109] Finished running all vSphere specific checks in the cluster
I0219 16:17:24.307163   17481 event.go:377] ... type: 'Normal' reason: 'SucceededVSphereCheckInfraConfig' Check succeeded

No warning or error about the missing datacenter nested-devqedatacenter-2.

Patched Behaviour (OCPBUGS-56274)

git checkout OCPBUGS-56274 && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:23:24.680681   32885 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0219 16:23:24.680821   32885 datacenter_consistency.go:50] Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa", but it is not listed in the cloud provider config (datacenters = "nested-devqedatacenter-1" in vsphere-csi-config-secret, namespace openshift-cluster-csi-drivers). Add "nested-devqedatacenter-2" to the datacenters list in the cloud-provider-config ConfigMap in the openshift-config namespace.
I0219 16:23:24.680835   32885 vsphere_check.go:299] CheckDatacenterConsistency failed: Datacenter-Consistency: failure domain "us-west-1" ...
I0219 16:23:30.292865   32885 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa" ...

WARNING emitted, explicitly naming nested-devqedatacenter-2 as missing, with remediation instructions.

Summary by CodeRabbit

  • New Features

  • Added a new validation check that ensures vSphere Infrastructure failure domains are properly configured in the cloud provider settings.

  • Tests

  • Added comprehensive test coverage for the datacenter consistency validation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/check/datacenter_consistency_test.go`:
- Around line 103-104: Save the original value of the global timeout
(util.Timeout) before mutating it in the subtest, then restore it after the
subtest finishes (e.g., store orig := *util.Timeout and use defer to set
*util.Timeout = orig) so changes around setting *util.Timeout = time.Second in
the test that calls CheckDatacenterConsistency(ctx) do not leak to other tests;
apply this restore pattern around where you modify util.Timeout in
pkg/check/datacenter_consistency_test.go.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: cc2fd6fa-3677-4457-8bdf-7479823f7f60

📥 Commits

Reviewing files that changed from the base of the PR and between 36a0ee6 and 5039e0d.

📒 Files selected for processing (3)
  • pkg/check/datacenter_consistency.go
  • pkg/check/datacenter_consistency_test.go
  • pkg/check/interface.go

Comment thread pkg/check/datacenter_consistency_test.go
Comment thread pkg/check/datacenter_consistency.go
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-config ConfigMap in openshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check, CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsed cloud.conf (ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update the cloud-provider-config ConfigMap in the openshift-config namespace.

Cluster Setup

Two failure domains configured:

  • us-east-1 → datacenter nested-devqedatacenter-1
  • us-west-1 → datacenter nested-devqedatacenter-2

Both on vCenter 232-15-184-10.in-addr.arpa.

Simulating the Bug

The datacenter nested-devqedatacenter-2 was removed from cloud-provider-config:

# Edit cloud-provider-config to remove nested-devqedatacenter-2
oc -n openshift-config edit configmap cloud-provider-config
# Changed: datacenters = nested-devqedatacenter-1,nested-devqedatacenter-2
# To:      datacenters = nested-devqedatacenter-1

# Verified propagation to vsphere-csi-config-secret:
oc -n openshift-cluster-csi-drivers get secret/vsphere-csi-config-secret \
 -o jsonpath='{.data.cloud\.conf}' | base64 -d
# Output confirmed: datacenters = nested-devqedatacenter-1

Unpatched Behaviour (openshift/main)

export KUBECONFIG=/Users/MAC/openshift/clusters/vsphere/cluster-01/auth/kubeconfig
git checkout openshift/main && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:17:18.909862   17481 infra_config.go:15] Checking infrastructure and cloud provider config for consistency.
I0219 16:17:18.909897   17481 vsphere_check.go:302] CheckInfraConfig passed
I0219 16:17:24.169406   17481 vsphere_check.go:109] Finished running all vSphere specific checks in the cluster
I0219 16:17:24.307163   17481 event.go:377] ... type: 'Normal' reason: 'SucceededVSphereCheckInfraConfig' Check succeeded

No warning or error about the missing datacenter nested-devqedatacenter-2.

Patched Behaviour (OCPBUGS-56274)

git checkout OCPBUGS-56274 && make
./vsphere-problem-detector start -v 5 \
 --kubeconfig=$KUBECONFIG \
 --namespace=openshift-cluster-storage-operator

Relevant log lines:

I0219 16:23:24.680681   32885 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0219 16:23:24.680821   32885 datacenter_consistency.go:50] Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa", but it is not listed in the cloud provider config (datacenters = "nested-devqedatacenter-1" in vsphere-csi-config-secret, namespace openshift-cluster-csi-drivers). Add "nested-devqedatacenter-2" to the datacenters list in the cloud-provider-config ConfigMap in the openshift-config namespace.
I0219 16:23:24.680835   32885 vsphere_check.go:299] CheckDatacenterConsistency failed: Datacenter-Consistency: failure domain "us-west-1" ...
I0219 16:23:30.292865   32885 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' Datacenter-Consistency: failure domain "us-west-1" (infrastructure.config.openshift.io/cluster) requires datacenter "nested-devqedatacenter-2" on vCenter "232-15-184-10.in-addr.arpa" ...

WARNING emitted, explicitly naming nested-devqedatacenter-2 as missing, with remediation instructions.

Summary by CodeRabbit

  • New Features

  • Added a validation that ensures vSphere failure domains reference datacenters present in the configured vCenter settings; emits warnings and returns an error when mismatches are found.

  • Tests

  • Added comprehensive tests covering legacy behavior, single/multi-vCenter setups, missing datacenters, and datacenter-list parsing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/check/datacenter_consistency.go`:
- Around line 57-62: The error constructed into err via fmt.Errorf in
datacenter_consistency.go ends the message with a trailing period which violates
Go's ST1005 rule; update the format string in that fmt.Errorf (the one
referencing fd.Name, fd.Topology.Datacenter, fd.Server and vc.Datacenters) to
remove the final punctuation so the error string does not end with a period
(ensure the rest of the message and argument ordering remain unchanged).
- Around line 41-47: The fmt.Errorf call that builds the error (assigned to err)
in datacenter_consistency.go ends the formatted message with a trailing period
("%s namespace."); remove the final punctuation so the error string does not end
with a period — update the fmt.Errorf format string (the call that references
fd.Name, fd.Topology.Datacenter, fd.Server, vcConfig.Datacenters,
fd.Topology.Datacenter, util.CloudConfigNamespace) to omit the trailing "." at
the end of the message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 40e991a5-bede-4476-934d-bbc167071ca2

📥 Commits

Reviewing files that changed from the base of the PR and between 5039e0d and 9b88326.

📒 Files selected for processing (3)
  • pkg/check/datacenter_consistency.go
  • pkg/check/datacenter_consistency_test.go
  • pkg/check/interface.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/check/interface.go

Comment thread pkg/check/datacenter_consistency.go
Comment thread pkg/check/datacenter_consistency.go
@RomanBednar
Copy link
Copy Markdown
Contributor Author

/hold

Need to get an env and test this first.

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2026
@gnufied
Copy link
Copy Markdown
Member

gnufied commented Apr 29, 2026

@RomanBednar is this PR fixed now?

@RomanBednar
Copy link
Copy Markdown
Contributor Author

/retest

@RomanBednar
Copy link
Copy Markdown
Contributor Author

@gnufied The infa check was not included in the original spec, I had to reconfigure the zonal cluster, add a test case for it and retest everything; it is ready for review now. Here's the test for missing DC in infra object:

# 1. Backup Infrastructure CR
oc get infrastructure cluster -o yaml > /tmp/infrastructure-backup.yaml

# 2. Remove nested-devqedatacenter-2 from vcenters
oc patch infrastructure cluster --type=json \
  -p='[{"op":"replace","path":"/spec/platformSpec/vsphere/vcenters/0/datacenters","value":["nested-devqedatacenter-1"]}]'

# 3. Verify
oc get infrastructure cluster -o jsonpath='{.spec.platformSpec.vsphere.vcenters[0].datacenters}'
# Output: ["nested-devqedatacenter-1"]

# 4-5. Restart VPD and check logs
oc -n openshift-cluster-storage-operator delete pod -l name=vsphere-problem-detector-operator
oc -n openshift-cluster-storage-operator wait --for=condition=Ready pod \
  -l name=vsphere-problem-detector-operator --timeout=120s
# (waited 45s)
oc -n openshift-cluster-storage-operator logs deployment/vsphere-problem-detector-operator \
  | grep -i "datacenter.consistency\|CheckDatacenterConsistency"

Log output:

I0513 11:16:04.727922  1 datacenter_consistency.go:16] Checking datacenter consistency between failure domains and cloud provider config.
W0513 11:16:04.727951  1 datacenter_consistency.go:64] datacenter-Consistency: failure domain "us-west-1" references datacenter "nested-devqedatacenter-2" on vCenter "197-15-184-10.in-addr.arpa", but it is not listed in the vcenters section of infrastructure.config.openshift.io/cluster (datacenters = [nested-devqedatacenter-1]), add "nested-devqedatacenter-2" to the datacenters list for vCenter "197-15-184-10.in-addr.arpa" in the Infrastructure CR
I0513 11:16:04.728127  1 vsphere_check.go:299] CheckDatacenterConsistency failed: ...
I0513 11:16:05.142746  1 event.go:377] ... type: 'Warning' reason: 'FailedVSphereCheckDatacenterConsistency' ...

@RomanBednar
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@RomanBednar
Copy link
Copy Markdown
Contributor Author

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

@RomanBednar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@RomanBednar
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@RomanBednar
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants