ETCD-704 — cluster-restore.sh: move extra /var/lib/etcd files to backup#1628
Conversation
When cluster-restore.sh runs the restore-pod path, it moves member/ and revision.json to /var/lib/etcd-backup, deletes etcd_perf*, then exits if anything remains in /var/lib/etcd. Extra files (perf artifacts, stray snapshots, etc.) cause DR restore to fail before the restore pod starts. Add backup_remaining_etcd_data_dir_contents() to move all remaining top-level entries to /var/lib/etcd-backup instead of failing. Fixes: ETCD-704 Related: https://access.redhat.com/solutions/6958920
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
ETCD-704-VERIFICATION-OUTPUTS.txt |
|
CI analysis for failed required jobs:
Manual verification on OCP 4.22.0-rc.4 (3-node HA): legacy restore fails with extra files in /var/lib/etcd; patched script moves files to /var/lib/etcd-backup and completes SNAPSHOT RESTORE. Full HA restore verified (ETCD-704). /retest required |
|
2/3 required jobs now green. Remaining failure is TestRetentionBySize /test e2e-gcp-operator-disruptive |
|
Latest e2e-gcp-operator-disruptive run: all operator e2e tests passed (47m), Previous failures were TestRetentionBySize / TestPeriodicBackupHappyPath flakes. /test e2e-gcp-operator-disruptive |
|
@apurvanisal5: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/label merge-review-needed |
|
@apurvanisal5: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/label ready-for-human-review |
Summary
cluster-restore.shfails withfolder /var/lib/etcd is not emptywhen extra files exist under/var/lib/etcdaftermember/is moved.backup_remaining_etcd_data_dir_contents()to move remaining top-level files to/var/lib/etcd-backupinstead of exiting.Jira
Fixes: ETCD-704
Verification
OCP 4.22.0-rc.4, AWS IPI 3-node HA:
/var/lib/etcd/var/lib/etcd-backupand completesSNAPSHOT RESTORE COMPLETEDtesting-seed-projectrestored from backupTest plan
/var/lib/etcd/var/lib/etcd-backup