Skip to content

ETCD-704 — cluster-restore.sh: move extra /var/lib/etcd files to backup#1628

Open
apurvanisal5 wants to merge 1 commit into
openshift:release-4.22from
apurvanisal5:etcd-704-cluster-restore-backup-remaining
Open

ETCD-704 — cluster-restore.sh: move extra /var/lib/etcd files to backup#1628
apurvanisal5 wants to merge 1 commit into
openshift:release-4.22from
apurvanisal5:etcd-704-cluster-restore-backup-remaining

Conversation

@apurvanisal5

@apurvanisal5 apurvanisal5 commented Jun 7, 2026

Copy link
Copy Markdown

Summary

  • Legacy cluster-restore.sh fails with folder /var/lib/etcd is not empty when extra files exist under /var/lib/etcd after member/ is moved.
  • Add backup_remaining_etcd_data_dir_contents() to move remaining top-level files to /var/lib/etcd-backup instead of exiting.

Jira

Fixes: ETCD-704

Verification

OCP 4.22.0-rc.4, AWS IPI 3-node HA:

  • Legacy script fails when seed files present in /var/lib/etcd
  • Patched script moves demo files to /var/lib/etcd-backup and completes SNAPSHOT RESTORE COMPLETED
  • Full HA restore; cluster healthy; testing-seed-project restored from backup

Test plan

  • Reproduce legacy failure with extra files in /var/lib/etcd
  • Patched restore moves extras to /var/lib/etcd-backup
  • Full 3-node HA restore succeeds
  • etcd data restored from snapshot

When cluster-restore.sh runs the restore-pod path, it moves member/ and
revision.json to /var/lib/etcd-backup, deletes etcd_perf*, then exits if
anything remains in /var/lib/etcd. Extra files (perf artifacts, stray
snapshots, etc.) cause DR restore to fail before the restore pod starts.

Add backup_remaining_etcd_data_dir_contents() to move all remaining
top-level entries to /var/lib/etcd-backup instead of failing.

Fixes: ETCD-704
Related: https://access.redhat.com/solutions/6958920
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 42aaf604-3f17-4449-b4d0-f277affe88f6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jubittajohn and tjungblu June 7, 2026 19:10
@openshift-ci

openshift-ci Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dusk125 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@apurvanisal5

Copy link
Copy Markdown
Author

ETCD-704-VERIFICATION-OUTPUTS.txt
Reprod details

@apurvanisal5

apurvanisal5 commented Jun 8, 2026

Copy link
Copy Markdown
Author

CI analysis for failed required jobs:

Job Failed test Notes
e2e-agnostic-ovn ResourceQuota terminating scopes sig-api-machinery, unrelated
e2e-aws-ovn-single-node Pod InPlace Resize sig-node, unrelated
e2e-gcp-operator-disruptive TestPeriodicBackupHappyPath (timeout) backup test, not cluster-restore.sh

Manual verification on OCP 4.22.0-rc.4 (3-node HA): legacy restore fails with extra files in /var/lib/etcd; patched script moves files to /var/lib/etcd-backup and completes SNAPSHOT RESTORE. Full HA restore verified (ETCD-704).

/retest required

@apurvanisal5

apurvanisal5 commented Jun 8, 2026

Copy link
Copy Markdown
Author

2/3 required jobs now green. Remaining failure is TestRetentionBySize
(backup retention count flake — found 6 groups vs expected 4-5, unrelated to cluster-restore.sh).
TestPeriodicBackupHappyPath and TestBackupScript passed on same run.
Manual ETCD-704 HA restore verified on 4.22.

/test e2e-gcp-operator-disruptive

@apurvanisal5

apurvanisal5 commented Jun 8, 2026

Copy link
Copy Markdown
Author

Latest e2e-gcp-operator-disruptive run: all operator e2e tests passed (47m),
job failed only in post-step gather-must-gather due to GitHub camgi.tar download
infra flake — unrelated to ETCD-704.

Previous failures were TestRetentionBySize / TestPeriodicBackupHappyPath flakes.
Manual 3-node HA restore verified on 4.22.

/test e2e-gcp-operator-disruptive

@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@apurvanisal5: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@apurvanisal5

Copy link
Copy Markdown
Author

/label merge-review-needed

@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@apurvanisal5: The label(s) /label merge-review-needed cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, rebase/manual, cluster-config-api-changed, run-integration-tests, verified, ready-for-human-review, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/skip-dependent-bug-check, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

Details

In response to this:

/label merge-review-needed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@apurvanisal5

Copy link
Copy Markdown
Author

/label ready-for-human-review

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant