Implement cluster rebalance rollback and cleanup by whitehawk · Pull Request #2251 · arenadata/gpdb

whitehawk · 2026-03-03T04:08:23Z

Implement cluster rebalance rollback and cleanup

This patch adds:

Full rollback (via '-r' flag) capability for the rebalance flow.
Error checking for the steps marked as errored due to the interruption of the
tool execution (during both normal and rollback flows). Errored steps can be
retried, rolled back individually, or cancelled.
Cleanup capability when the execution was interrupted during rebalance.

In order to facilitate the capabilities above:

Added SegmentStatus command class to check the status of a running segment
via 'pg_ctl'.
Added fault injections inside the 'gprecoverseg' tool for the test purposes.
Renamed 'wrap_state_func_with_faults' into 'wrap_func_with_faults' in the
fault injector, as now it is used not only with the state machine functions.
In RebalanceStep class removed rollback-related statuses, as they duplicate
the statuses of the normal flow. Instead, 'rollback' flag is added to
distinguish between the steps of the normal and rollback flows.
Introduced updates in the main and shrink state machines in order to
coordinate cleanup and rollback between shrink and rebalance.
Removed 'gprecoverseg' invocation from 'check_down_segments()', as now the
proper error handling is done by the rebalance execution engine itself.
Moved 'check_down_segments()' to an earlier point to make FTS probe before
'check_running_gputils()' (otherwise, in some test cases
'check_running_gputils()' fails when accessing the cluster).
Added '--skip-resource-estimation' option to 'gpmovemirrors'. We use this
option when moving mirrors, as, when retrying a failed operation, in some cases
there will be no old datadir yet, and 'gpmovemirrors' will fail at resource
estimation. So we skip resource estimation at 'gpmovemirrors' side. It is not a
problem, as we have our own resource estimation.
Added new behave tests for rollback and cleanup.
Added new steps for behave tests.
Removed cases, related to failures during switchovers, from rebalance test
'3', as these cases are not checked in the newly added test cases differently.
Removed cases with delay from rebalance test '3', as now such scenarios are
tested with the help of fault injections in the 'gprecoverseg' at particular
points of interest.

whitehawk added 27 commits February 27, 2026 13:41

Add initial error handling

48822fc

Implement full rollback initial draft

b34be7f

Tests and fixes for rollback

211e6bf

Fix test

fb1467e

Merge branch 'feature/ADBDEV-6608' into ADBDEV-9083-re

bdf7626

Refactoring and improvements

35a8f4e

Update check_down_segments

7549678

Add tests and fixes

d706583

Fix tests, drop schema at the end of the rollback

f777e69

Update test descriptions

6e4bcd8

Add test

197c187

Add 6.1.2 test

41c5695

Add test 6.1.3.

682b12c

Add test 6.3.2

50a0033

Add test 6.4.2.

f018d2b

Update comments, remove dbg logs

e971872

Update rollback prepare

8379e87

Truncate steps table

07d3dca

Improve logging

67c24cd

Add test 6.4.3., update test 7.2

7deaba5

Rename wrap_state_func_with_faults() to wrap_func_with_faults()

0a5fbb0

Update basic tests, and fix code for them

3e7d4f8

Add case for 6.2 test, and related fix

43915d0

Add 6.2.2. test

934ad5f

Add test 7.3.2.

a58de48

Refactor

45d69d8

Split 6.2.2. test

ee3f199

whitehawk changed the title ~~Adbdev 9083 re~~ Implement cluster rebalance rollback and cleanup Mar 6, 2026

whitehawk added 2 commits March 6, 2026 13:15

Uncomment some checks

0e3020f

Minor test updates

ecf1efc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cluster rebalance rollback and cleanup#2251

Implement cluster rebalance rollback and cleanup#2251
whitehawk wants to merge 29 commits intofeature/ADBDEV-6608from
ADBDEV-9083-re

whitehawk commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whitehawk commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

whitehawk commented Mar 3, 2026 •

edited

Loading