Skip to content

[release-4.19] OCPBUGS-77313: Wait for revision stability before removing etcd members#1603

Closed
openshift-cherrypick-robot wants to merge 1 commit into
openshift:release-4.19from
openshift-cherrypick-robot:cherry-pick-1571-to-release-4.19
Closed

[release-4.19] OCPBUGS-77313: Wait for revision stability before removing etcd members#1603
openshift-cherrypick-robot wants to merge 1 commit into
openshift:release-4.19from
openshift-cherrypick-robot:cherry-pick-1571-to-release-4.19

Conversation

@openshift-cherrypick-robot

Copy link
Copy Markdown

This is an automated cherry-pick of #1571

/assign hasbro17

/cherrypick release-4.18

@openshift-ci-robot

Copy link
Copy Markdown

@openshift-cherrypick-robot: An error was encountered cloning bug for cherrypick for bug OCPBUGS-77313 on the Jira server at https://redhat.atlassian.net. No known errors were detected, please see the full error message for details.

Full error message. request failed. Please analyze the request body for more details. Status code: 400: {"errorMessages":["QA Contact: User 'Ge Liu' is not valid for this user picker."],"errors":{"customfield_10470":"User 'Ge Liu' is not valid for this user picker."}}

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

This is an automated cherry-pick of #1571

/assign hasbro17

/cherrypick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5e9b277d-c665-423d-b821-2e1c01c8d9d7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Elbehery and dusk125 April 28, 2026 17:05
@hasbro17

Copy link
Copy Markdown
Contributor

/hold

Need to fix the invalid cloned backport bug

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2026
Previously, the ClusterMemberRemovalController would remove etcd members
during revision rollouts, causing cluster degradation when simultaneously
deleting multiple control plane machines with the OnDelete strategy.

During a revision rollout, etcd members can temporarily appear unhealthy
while their pods are reinstalled to the latest revision. This is different
from members being indefinitely unhealthy on a stable revision.

Additionally, the EtcdEndpointsController pauses during revision rollouts,
so when a replacement machine is added and triggers a rollout, the
etcd-endpoints configmap won't update. This causes API servers on the old
revision to use removed member endpoints, leading to API unavailability.

This change adds a revision stability check before allowing member removal,
ensuring we only remove members when revisions are stable and unhealthy
members are truly unhealthy. This explicitly codifies the 4.17 behavior
where the operator waited for all revisions to complete before removing
members and lifecycle hooks.

Additionally, the ClusterMemberRemovalController now verifies that the live
etcd membership matches the configmap before proceeding with member removal,
preventing potential issues during rapid member deletion

(cherry picked from commit 0168733)
@openshift-cherrypick-robot openshift-cherrypick-robot force-pushed the cherry-pick-1571-to-release-4.19 branch from a347b28 to 25aeba3 Compare May 7, 2026 17:23
@hasbro17

hasbro17 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Closing to redo the failed backport.

@hasbro17 hasbro17 closed this May 7, 2026
@openshift-ci

openshift-ci Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hasbro17. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hasbro17

hasbro17 commented May 7, 2026

Copy link
Copy Markdown
Contributor

New 4.19 backport: #1613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants