OCPBUGS-56274: add datacenter consistency check#212
Conversation
|
@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RomanBednar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c4dfdec to
5039e0d
Compare
|
/assign @gnufied For review. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughAdds a new cluster-level check, CheckDatacenterConsistency, that fetches the Infrastructure CR and validates each vSphere failure domain's referenced datacenter against configured datacenters in the cloud provider config and the Infrastructure vCenters list, reporting mismatches as accumulated errors. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Runner
participant Check as CheckDatacenterConsistency
participant Kube as KubeClient
participant Config as CloudConfig
participant Infra as Infrastructure
participant Logger as Logger/ErrorAccum
Runner->>Check: invoke CheckDatacenterConsistency(ctx)
Check->>Kube: GetInfrastructure(ctx)
Kube-->>Check: Infrastructure (or error)
alt fetch error
Check->>Logger: log error
Check-->>Runner: return error
else infra fetched
Check->>Infra: read PlatformSpec.VSphere / FailureDomains
alt no vSphere or no FailureDomains
Check->>Logger: debug skip
Check-->>Runner: return nil
else have failure domains
loop for each FailureDomain
Check->>Config: lookup cfg.VirtualCenter[fd.Server]
alt config entry missing
Check->>Logger: debug skip for that fd
else entry present
Check->>Check: parseDatacenters(entry.Datacenters)
Check->>Check: compare parsed list with fd.Topology.Datacenter
alt mismatch
Check->>Logger: warn + append error
end
end
end
loop second pass for each FailureDomain
Check->>Infra: lookup matched vCenter in infra.Spec.PlatformSpec.VSphere.VCenters
Check->>Check: compare infra vcenter.datacenters with fd.Topology.Datacenter
alt mismatch
Check->>Logger: warn + append error
end
end
Check-->>Runner: return joined errors if any, else nil
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/check/datacenter_consistency_test.go`:
- Around line 103-104: Save the original value of the global timeout
(util.Timeout) before mutating it in the subtest, then restore it after the
subtest finishes (e.g., store orig := *util.Timeout and use defer to set
*util.Timeout = orig) so changes around setting *util.Timeout = time.Second in
the test that calls CheckDatacenterConsistency(ctx) do not leak to other tests;
apply this restore pattern around where you modify util.Timeout in
pkg/check/datacenter_consistency_test.go.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: cc2fd6fa-3677-4457-8bdf-7479823f7f60
📒 Files selected for processing (3)
pkg/check/datacenter_consistency.gopkg/check/datacenter_consistency_test.gopkg/check/interface.go
5039e0d to
9b88326
Compare
|
@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/check/datacenter_consistency.go`:
- Around line 57-62: The error constructed into err via fmt.Errorf in
datacenter_consistency.go ends the message with a trailing period which violates
Go's ST1005 rule; update the format string in that fmt.Errorf (the one
referencing fd.Name, fd.Topology.Datacenter, fd.Server and vc.Datacenters) to
remove the final punctuation so the error string does not end with a period
(ensure the rest of the message and argument ordering remain unchanged).
- Around line 41-47: The fmt.Errorf call that builds the error (assigned to err)
in datacenter_consistency.go ends the formatted message with a trailing period
("%s namespace."); remove the final punctuation so the error string does not end
with a period — update the fmt.Errorf format string (the call that references
fd.Name, fd.Topology.Datacenter, fd.Server, vcConfig.Datacenters,
fd.Topology.Datacenter, util.CloudConfigNamespace) to omit the trailing "." at
the end of the message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 40e991a5-bede-4476-934d-bbc167071ca2
📒 Files selected for processing (3)
pkg/check/datacenter_consistency.gopkg/check/datacenter_consistency_test.gopkg/check/interface.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/check/interface.go
|
/hold Need to get an env and test this first. |
|
@RomanBednar is this PR fixed now? |
|
/retest |
|
@gnufied The infa check was not included in the original spec, I had to reconfigure the zonal cluster, add a test case for it and retest everything; it is ready for review now. Here's the test for missing DC in infra object: # 1. Backup Infrastructure CR
oc get infrastructure cluster -o yaml > /tmp/infrastructure-backup.yaml
# 2. Remove nested-devqedatacenter-2 from vcenters
oc patch infrastructure cluster --type=json \
-p='[{"op":"replace","path":"/spec/platformSpec/vsphere/vcenters/0/datacenters","value":["nested-devqedatacenter-1"]}]'
# 3. Verify
oc get infrastructure cluster -o jsonpath='{.spec.platformSpec.vsphere.vcenters[0].datacenters}'
# Output: ["nested-devqedatacenter-1"]
# 4-5. Restart VPD and check logs
oc -n openshift-cluster-storage-operator delete pod -l name=vsphere-problem-detector-operator
oc -n openshift-cluster-storage-operator wait --for=condition=Ready pod \
-l name=vsphere-problem-detector-operator --timeout=120s
# (waited 45s)
oc -n openshift-cluster-storage-operator logs deployment/vsphere-problem-detector-operator \
| grep -i "datacenter.consistency\|CheckDatacenterConsistency"Log output: |
|
/jira refresh |
|
@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/unhold |
|
@RomanBednar: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/jira refresh |
|
@RomanBednar: This pull request references Jira Issue OCPBUGS-56274, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
When using zonal deployments of vSphere with OpenShift, if a datacenter referenced by a failure domain in the Infrastructure CR (
infrastructure.config.openshift.io/cluster) is missing from the cloud provider config (cloud-provider-configConfigMap inopenshift-config), the CSI driver silently fails to find VMs in that zone, causing the cluster to degrade. The vSphere Problem Detector (VPD) had no check to detect this misconfiguration. This fix adds a new cluster-level check,CheckDatacenterConsistency, that compares each failure domain's required datacenter against the datacenters listed in the parsedcloud.conf(ctx.VMConfig.Config.VirtualCenter[server].Datacenters). When a datacenter is absent, VPD emits a WARNING naming the missing datacenter, the affected failure domain, and instructs the administrator to update thecloud-provider-configConfigMap in theopenshift-confignamespace.Cluster Setup
Two failure domains configured:
us-east-1→ datacenternested-devqedatacenter-1us-west-1→ datacenternested-devqedatacenter-2Both on vCenter
232-15-184-10.in-addr.arpa.Simulating the Bug
The datacenter
nested-devqedatacenter-2was removed fromcloud-provider-config:Unpatched Behaviour (openshift/main)
Relevant log lines:
No warning or error about the missing datacenter
nested-devqedatacenter-2.Patched Behaviour (OCPBUGS-56274)
Relevant log lines:
WARNING emitted, explicitly naming
nested-devqedatacenter-2as missing, with remediation instructions.Summary by CodeRabbit
New Features
Tests