OCPBUGS-62517: Scale to replicas=2 and enable PDB on HighlyAvailable topology#202
Conversation
|
@tmshort: This pull request references Jira Issue OCPBUGS-62517, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (3)
WalkthroughOperator now fetches the cluster ChangesInfrastructure Topology Detection and Response
Sequence DiagramsequenceDiagram
participant Operator as Operator (process)
participant API as Kubernetes API (Infrastructure)
participant Informer as Informer Cache
participant Manager as Process Manager
Operator->>API: GET Infrastructure "cluster"
API-->>Operator: Infrastructure (with ControlPlaneTopology)
Operator->>Operator: store initial topology in Builder.Infrastructure
Operator->>Informer: register Add/Update handler
Operator->>Operator: continue running / render Helm with infra
Note over API,Manager: Cluster control plane topology changes
API->>Informer: Infrastructure object updated
Informer->>Operator: handler invoked with new Infrastructure
Operator->>Operator: compare new vs initial topology
alt topology changed
Operator->>Operator: log change
Operator->>Manager: os.Exit(0)
Manager->>Operator: restart process (fresh render)
else unchanged
Operator->>Operator: no action
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 2 inconclusive)
✅ Passed checks (9 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tmshort The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/cluster-olm-operator/main.go`:
- Line 382: The UpdateFunc handler is using the parameter name `new`, which
shadows Go's built-in new function; rename that parameter (e.g., to `updated` or
`newObj`) in the UpdateFunc signature and update all references inside the
closure (for example the cast `new.(*configv1.Infrastructure)`) to use the new
parameter name so the shadowing is removed and the cast still targets the same
object type.
- Around line 238-241: The informer only registers an UpdateFunc, so topology
changes that occur as Add events between capturing initialTopology (from infra
:= cl.ConfigClient.ConfigV1().Infrastructures().Get(...)) and the informer’s
initial LIST are missed; register an AddFunc alongside the existing UpdateFunc
on the same informer (the handler that currently processes Update events) to
detect and process added Infrastructure objects the same way as updates,
ensuring the operator re-renders manifests for that topology change; also rename
the UpdateFunc parameter currently named "new" to something like "newObj" or
"newInfra" to avoid shadowing the built-in new() and satisfy the linter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1443d7f9-1505-4594-b512-d3799aa05c3c
📒 Files selected for processing (5)
cmd/cluster-olm-operator/main.gopkg/clients/clients.gopkg/controller/builder.gopkg/controller/helm.gopkg/helmvalues/helmvalues.go
…topology Rolling updates in HighlyAvailable clusters leave catalogd and operator-controller unavailable when the only running pod is evicted before its replacement is ready. Fetch the cluster Infrastructure resource at startup and check ControlPlaneTopology. When a HighlyAvailable topology is detected (HighlyAvailable, HighlyAvailableArbiter, or DualReplica), override the Helm values to replicas=2 and podDisruptionBudget.enabled=true before rendering manifests. SingleReplica (SNO) and External topologies keep the static defaults of replicas=1 and PDB disabled. When a topology change is observed at runtime via the infrastructure informer (exceedingly rare), the operator exits so its deployment controller restarts it, triggering a fresh Helm render with the correct values for the new topology. Changes: - helmvalues: add SetIntValue and SetBoolValue helpers - clients: add InfrastructureClient backed by the config informer - controller/builder: add Infrastructure field to Builder - controller/helm: apply HA replica/PDB overrides in renderHelmTemplate; add isHighlyAvailableTopology helper - main: fetch infra at startup, pass to Builder, watch for topology changes and exit to trigger re-render Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
0196c65 to
ebc3002
Compare
|
/retest |
|
/test openshift-e2e-aws-customnoupgrade |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 10 |
|
@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c1d0f30-4884-11f1-99af-496e1ffa1ecc-0 |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-ovn-upgrade 10 |
|
@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/140b8be0-4884-11f1-8a23-d6ca6fc82744-0 |
|
/test openshift-e2e-aws-customnoupgrade |
|
customnoupgrade has failed a number of times, and seems to be unrelated to this change; other components have failing tests. |
|
All aggregate jobs succeeded. |
|
/test openshift-e2e-aws-customnoupgrade |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-ovn-upgrade 10 |
|
@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/86ce31c0-48fa-11f1-82c1-76bc25872c9e-0 |
|
customnoupgrade is also failing in #203 |
|
The aggregate payload tests passed |
|
/test openshift-e2e-aws-customnoupgrade |
1 similar comment
|
/test openshift-e2e-aws-customnoupgrade |
|
/override openshift-e2e-aws-customnoupgrade |
|
@tmshort: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/openshift-e2e-aws-customnoupgrade |
|
@tmshort: Overrode contexts on behalf of tmshort: ci/prow/openshift-e2e-aws-customnoupgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@tmshort: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Finally got a cluster-bot up with this PR in it. I consider the existing tests for HA kit and consistent component reporting to be sufficient basis to avoid writing new OTE scenarios to cover this post-merge. I have performed pre-merge verification for HA scenarios as
/verified by @grokspawn |
|
@grokspawn: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
@tmshort: Jira Issue OCPBUGS-62517: Some pull requests linked via external trackers have merged:
The following pull request, linked via external tracker, has not merged: All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-62517 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@tmshort: Jira Issue OCPBUGS-62517: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-62517 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Rolling updates in HighlyAvailable clusters leave catalogd and operator-controller unavailable when the only running pod is evicted before its replacement is ready.
Fetch the cluster Infrastructure resource at startup and check ControlPlaneTopology. When a HighlyAvailable topology is detected (HighlyAvailable, HighlyAvailableArbiter, or DualReplica), override the Helm values to replicas=2 and podDisruptionBudget.enabled=true before rendering manifests. SingleReplica (SNO) and External topologies keep the static defaults of replicas=1 and PDB disabled.
When a topology change is observed at runtime via the infrastructure informer (exceedingly rare), the operator exits so its deployment controller restarts it, triggering a fresh Helm render with the correct values for the new topology.
Changes:
Summary by CodeRabbit