Skip to content

Implement cluster rebalance rollback and cleanup#2251

Draft
whitehawk wants to merge 29 commits intofeature/ADBDEV-6608from
ADBDEV-9083-re
Draft

Implement cluster rebalance rollback and cleanup#2251
whitehawk wants to merge 29 commits intofeature/ADBDEV-6608from
ADBDEV-9083-re

Conversation

@whitehawk
Copy link

@whitehawk whitehawk commented Mar 3, 2026

Implement cluster rebalance rollback and cleanup

This patch adds:

  1. Full rollback (via '-r' flag) capability for the rebalance flow.
  2. Error checking for the steps marked as errored due to the interruption of the
    tool execution (during both normal and rollback flows). Errored steps can be
    retried, rolled back individually, or cancelled.
  3. Cleanup capability when the execution was interrupted during rebalance.

In order to facilitate the capabilities above:

  1. Added SegmentStatus command class to check the status of a running segment
    via 'pg_ctl'.
  2. Added fault injections inside the 'gprecoverseg' tool for the test purposes.
  3. Renamed 'wrap_state_func_with_faults' into 'wrap_func_with_faults' in the
    fault injector, as now it is used not only with the state machine functions.
  4. In RebalanceStep class removed rollback-related statuses, as they duplicate
    the statuses of the normal flow. Instead, 'rollback' flag is added to
    distinguish between the steps of the normal and rollback flows.
  5. Introduced updates in the main and shrink state machines in order to
    coordinate cleanup and rollback between shrink and rebalance.
  6. Removed 'gprecoverseg' invocation from 'check_down_segments()', as now the
    proper error handling is done by the rebalance execution engine itself.
  7. Moved 'check_down_segments()' to an earlier point to make FTS probe before
    'check_running_gputils()' (otherwise, in some test cases
    'check_running_gputils()' fails when accessing the cluster).
  8. Added '--skip-resource-estimation' option to 'gpmovemirrors'. We use this
    option when moving mirrors, as, when retrying a failed operation, in some cases
    there will be no old datadir yet, and 'gpmovemirrors' will fail at resource
    estimation. So we skip resource estimation at 'gpmovemirrors' side. It is not a
    problem, as we have our own resource estimation.
  9. Added new behave tests for rollback and cleanup.
  10. Added new steps for behave tests.
  11. Removed cases, related to failures during switchovers, from rebalance test
    '3', as these cases are not checked in the newly added test cases differently.
  12. Removed cases with delay from rebalance test '3', as now such scenarios are
    tested with the help of fault injections in the 'gprecoverseg' at particular
    points of interest.

@whitehawk whitehawk changed the title Adbdev 9083 re Implement cluster rebalance rollback and cleanup Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant