Skip to content

CNTRLPLANE-3460: Refactor defrag controller to lower impact of defrag#1618

Open
dusk125 wants to merge 3 commits into
openshift:mainfrom
dusk125:defrag-time-budget
Open

CNTRLPLANE-3460: Refactor defrag controller to lower impact of defrag#1618
dusk125 wants to merge 3 commits into
openshift:mainfrom
dusk125:defrag-time-budget

Conversation

@dusk125

@dusk125 dusk125 commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a fixed 5-minute time budget per defrag controller reconciliation loop
  • Before each member's defrag, the controller checks elapsed time; if the budget is exceeded, remaining members are skipped and deferred to the next sync cycle
  • Budget-skipped members do not count toward the degraded failure threshold — only real defrag errors trigger degradation

Test plan

  • Existing unit tests pass unchanged
  • New TestDefragBudgetExceeded test verifies budget enforcement with a near-zero budget (1ns): no defrags occur, a DefragControllerBudgetExceeded event is recorded, and the controller does not degrade
  • Manual verification in a test cluster with fragmented etcd members

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a defragmentation cooldown to prevent repeated defrag operations within a short window.
  • Improvements

    • Defragmentation now targets one fragmented member per run and orders attempts to prioritize highest fragmentation.
    • Failure tracking and degraded-state handling simplified and made more deterministic; successful recovery clears degraded status.
  • Tests

    • Added tests covering cooldown behavior and adjusted existing defrag-related tests.

The defrag controller now tracks elapsed time during each sync and
skips remaining etcd members if the 5-minute budget is exceeded.
Budget-skipped members are deferred to the next reconciliation loop
and do not count toward the degraded failure threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2026
@openshift-ci

openshift-ci Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Walkthrough

Adds a defragmentation cooldown to DefragController, refactors runDefrag to target one fragmented member per sync (filtering learners, sorting by fragmentation, leader last), updates failure/degraded handling, adjusts clearDegraded, updates fakeEtcdClient.Defragment, and adds/updates tests including TestDefragCooldown.

Changes

Defrag controller cooldown & refactor

Layer / File(s) Summary
Cooldown fields, imports, constructor, and sync check
pkg/operator/defragcontroller/defragcontroller.go
Adds imports for cmp/slices, defaultDefragCooldown constant, DefragController fields defragCooldown and lastDefragTime, initializes defragCooldown in NewDefragController, and makes sync skip defrag while cooldown is active.
runDefrag: single-member defrag, sorting and strict failure handling
pkg/operator/defragcontroller/defragcontroller.go
Rewrites runDefrag to filter learner/unstarted members, build member/status mappings, sort non-leader endpoints by fragmentation descending (leader appended last), attempt defrag on the first fragmented member only, record attempt/success/failure events, wait for cluster recovery on success, update lastDefragTime and reset failure counters, and set degraded state on immediate errors. Removes previous multi-member partial-aggregate path.
clearDegraded update
pkg/operator/defragcontroller/defragcontroller.go
clearDegraded no longer returns the status update error; it performs the update and logs warnings on failure.
Tests and fake Etcd helper
pkg/etcdcli/helpers.go, pkg/operator/defragcontroller/defragcontroller_test.go
fakeEtcdClient.Defragment now updates stored StatusResponse.DbSize from DbSizeInUse. Tests updated to set defragCooldown: 0 where needed, TestNewDefragController uses looped syncs, TestNewDefragControllerMultiSyncs adjusts error generation and loop counts, and new TestDefragCooldown validates cooldown blocking and reset behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive Custom check specifies Ginkgo test patterns (It blocks, BeforeEach/AfterEach) but PR uses standard Go table-driven tests with t.Run, not Ginkgo framework. Check is inapplicable. Repository uses standard Go testing, not Ginkgo. Confirm check applies only to Ginkgo-based tests or provide standard Go test quality criteria.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Project uses Go standard testing, not Ginkgo. All test names are stable and deterministic with no dynamic content. Check not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added. All test changes use standard Go unit test pattern (func TestXxx), not Ginkgo. Check not applicable to this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Custom check is not applicable. The PR adds only standard Go unit tests (TestDefragCooldown in defragcontroller_test.go), not Ginkgo e2e tests. This check targets only new Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed No scheduling constraints introduced. Only Go code changes. Defrag controller already checks ControlPlaneTopology and disables on SNO/HyperShift.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations detected. All files are library packages with no process-level functions. The fmt.Printf found is in a test function block, which is allowed.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds only traditional Go unit tests (func Test*), not Ginkgo e2e tests. The custom check only applies to Ginkgo e2e tests (It(), Describe(), etc.), so it is not applicable here.
Title check ✅ Passed The title accurately summarizes the main change: refactoring the defrag controller to reduce defragmentation impact through a cooldown mechanism and single-member-per-sync approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/operator/defragcontroller/defragcontroller_test.go`:
- Around line 464-468: The test uses a tiny non-zero defragBudget which can be
flaky; set the DefragController's defragBudget to 0 (or switch the controller to
use a controllable/test clock) so budget exhaustion is deterministic. Locate
where defragBudget is set in the test setup (the variable named defragBudget
used to construct the controller) and replace 1 * time.Nanosecond with 0 (or
adapt the controller constructor to accept a mock clock and use that in the
test) so calling controller.sync(context.TODO(),
factory.NewSyncContext("defrag-controller", eventRecorder)) always observes an
exhausted budget.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0de8aac-2c9a-42a2-9dee-693e18c04822

📥 Commits

Reviewing files that changed from the base of the PR and between c0614ca and 2c8da6e.

📒 Files selected for processing (2)
  • pkg/operator/defragcontroller/defragcontroller.go
  • pkg/operator/defragcontroller/defragcontroller_test.go

Comment thread pkg/operator/defragcontroller/defragcontroller_test.go Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dusk125 dusk125 changed the title WIP: Add time budget to defrag controller WIP: CNTRLPLANE-3460: Add time budget to defrag controller May 14, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 14, 2026
@openshift-ci-robot

openshift-ci-robot commented May 14, 2026

Copy link
Copy Markdown

@dusk125: This pull request references CNTRLPLANE-3460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds a fixed 5-minute time budget per defrag controller reconciliation loop
  • Before each member's defrag, the controller checks elapsed time; if the budget is exceeded, remaining members are skipped and deferred to the next sync cycle
  • Budget-skipped members do not count toward the degraded failure threshold — only real defrag errors trigger degradation

Test plan

  • Existing unit tests pass unchanged
  • New TestDefragBudgetExceeded test verifies budget enforcement with a near-zero budget (1ns): no defrags occur, a DefragControllerBudgetExceeded event is recorded, and the controller does not degrade
  • Manual verification in a test cluster with fragmented etcd members

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added a configurable time budget for defragmentation that stops further member processing when exceeded.

  • Improvements

  • Enhanced tracking to report attempted vs. successful defragmentations.

  • Improved error accounting and more accurate partial-failure reporting when budget limits or mapping errors occur.

  • Tests

  • Added a test validating behavior when the defragmentation budget is immediately exhausted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Rework the defrag controller to defrag at most one etcd member per
reconciliation loop instead of all members in a single pass. After a
successful defrag, a 1-hour cooldown prevents further defrags until the
next cycle, spacing them out to be gentler on the cluster.

Key changes:
- Replace time budget with per-sync cooldown (defaultDefragCooldown = 1h)
- Sort non-leader members by fragmentation descending, leader always last
- Post-defrag: verify member responds via Status(), log before/after DbSize
- Replace deprecated wait.Poll with wait.PollUntilContextTimeout
- Fix fake Defragment() to update per-member status for multi-sync tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dusk125 dusk125 changed the title WIP: CNTRLPLANE-3460: Add time budget to defrag controller WIP: CNTRLPLANE-3460: Refactor defrag controller to lower impact of defrag May 15, 2026
@dusk125 dusk125 marked this pull request as ready for review May 15, 2026 14:22
@dusk125 dusk125 changed the title WIP: CNTRLPLANE-3460: Refactor defrag controller to lower impact of defrag CNTRLPLANE-3460: Refactor defrag controller to lower impact of defrag May 15, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2026
@openshift-ci openshift-ci Bot requested review from ardaguclu and p0lyn0mial May 15, 2026 14:22
@openshift-ci

openshift-ci Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

@dusk125: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 de3f881 link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

if len(member.ClientURLs) == 0 {
// skip unstarted member
var (
etcdMembers []*etcdserverpb.Member

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a hashmap


// Sort non-leader members by fragmentation percentage descending so we
// defrag the most fragmented member first each cycle.
slices.SortFunc(endpointStatus, func(a, b *clientv3.StatusResponse) int {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a unit test for this sort?


if !c.lastDefragTime.IsZero() && time.Since(c.lastDefragTime) < c.defragCooldown {
klog.V(4).Infof("Defrag cooldown active, %v remaining", c.defragCooldown-time.Since(c.lastDefragTime))
return nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could schedule a next defrag if we know the remaining time here

return nil
}

if !c.lastDefragTime.IsZero() && time.Since(c.lastDefragTime) < c.defragCooldown {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be a problem that a member could exceed the fragmentation limit for 1h in high churn cluster, since this backfoff / cooldown is member agnostic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe we can frame this cooldown differently.... e.g after a defrag, the cluster must have no unhealthy members for at least 10m before defragmenting the next member

recorder.Eventf("DefragControllerDefragmentSuccess", "etcd member has been defragmented: %s, memberID: %d", member.Name, member.ID)

// Give cluster time to recover before the next sync defrags another member.
if err := wait.PollUntilContextTimeout(ctx, pollWaitDuration, pollTimeoutDuration, false,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the best pattern for controllers. Could we just select a single member try to defrag it and use a queue for any backoff checks like checking the health?

I suppose the defrag is not that often so a 10m between each member defrag shouldn't be an issue?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 144 already has that

if !etcdcli.IsClusterHealthy(memberHealth)

so we can have the backoff work through the sync loop for some period of time

numDefragFailures int
defragWaitDuration time.Duration
defragCooldown time.Duration
lastDefragTime time.Time

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have to persist this somewhere (operator conditions?), imagine you're restarting CEO within that hour.

recorder.Eventf("DefragControllerDefragmentAttempt", "Attempting defrag on member: %s, memberID: %x, dbSize: %d, dbInUse: %d, leader ID: %d", member.Name, member.ID, status.DbSize, status.DbSizeInUse, status.Leader)
if _, err := c.defragClient.Defragment(ctx, member); err != nil {
// Defrag can timeout if defragmentation takes longer than etcdcli.DefragDialTimeout.
errMsg := fmt.Sprintf("failed defrag on member: %s, memberID: %x: %v", member.Name, member.ID, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might not indicate a failure in this case. We have two options: increase the timeout (plus some slack) based on the db size, or just always use some crazy value like 10 minutes.

Not sure it makes sense to poll the status instead, so have a short timeout always and just poll until we're back up and the size has been reduced.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I see you added the polling on success - but we need the polling on failure 😄

}
recorder.Eventf("DefragControllerDefragmentSuccess", "etcd member has been defragmented: %s, memberID: %d", member.Name, member.ID)

// Give cluster time to recover before the next sync defrags another member.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the defrag call came back, then the cluster should be in healthy state and respond. I think the polling is a bit misplaced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants