Skip to content

OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain#1605

Draft
jaypoulz wants to merge 1 commit into
openshift:mainfrom
jaypoulz:OCPBUGS-84695-tnf-update-setup-node-check
Draft

OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain#1605
jaypoulz wants to merge 1 commit into
openshift:mainfrom
jaypoulz:OCPBUGS-84695-tnf-update-setup-node-check

Conversation

@jaypoulz

@jaypoulz jaypoulz commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.

When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:

  • Node-0: Pacemaker configured with 1 node, has quorum
  • Node-1: Pacemaker configured with 2 nodes, no quorum

The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:

  • If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)

  • If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)

This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.

Tests cover:

  • Node exists in Kubernetes but offline in pacemaker (upgrade reboot) → Should skip replacement logic
  • Node deleted from Kubernetes and offline in pacemaker (replacement) → Should proceed with replacement logic
  • Error handling for NotFound vs other errors

This PR works in tandem with ClusterLabs/resource-agents#2155

Summary by CodeRabbit

  • New Features

    • Event-driven cluster membership reconciliation for node add/ready/delete with targeted, deterministic per-generation update snapshots
    • Staged membership changes that safely remove/add nodes and sync cluster services only when needed
  • Bug Fixes

    • Better alignment of Kubernetes node identity with pacemaker membership, handling name/IP replacements and stale etcd members
    • Fencing and auth flows gated correctly on external etcd transition; auth supports single or dual-node setups
  • Tests

    • Added comprehensive unit tests covering reconciliation, peer IP parsing, configmap selection, and edge cases
  • Chores

    • Granted configmap watch permissions and added a label constant for update snapshots

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 29, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 29, 2026

Copy link
Copy Markdown

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.

When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:

  • Node-0: Pacemaker configured with 1 node, has quorum
  • Node-1: Pacemaker configured with 2 nodes, no quorum

The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:

  • If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)

  • If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)

This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.

Tests cover:

  • Node exists in Kubernetes but offline in pacemaker (upgrade reboot) → Should skip replacement logic
  • Node deleted from Kubernetes and offline in pacemaker (replacement) → Should proceed with replacement logic
  • Error handling for NotFound vs other errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05af3c8d-7988-4643-b02b-fc77ea42eae4

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6ff88 and e53b901.

📒 Files selected for processing (10)
  • bindata/tnfdeployment/role.yaml
  • pkg/tnf/auth/runner.go
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/operator/starter.go
  • pkg/tnf/operator/starter_test.go
  • pkg/tnf/pkg/pcs/auth.go
  • pkg/tnf/pkg/tools/tnf_labels.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • pkg/tnf/pkg/pcs/auth.go
  • pkg/tnf/auth/runner.go
  • pkg/tnf/pkg/tools/tnf_labels.go
  • bindata/tnfdeployment/role.yaml
  • pkg/tnf/update-setup/runner_test.go
  • pkg/tnf/operator/starter.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/operator/starter_test.go
  • pkg/tnf/operator/nodehandler.go

Walkthrough

Event-driven TNF node handling: per-event add/ready/delete dispatch, single-node update-setup runners using per-generation ConfigMap snapshots, pacemaker+etcd membership reconciliation by comparing Kubernetes node name+IP, staged removals/additions, and related tests, RBAC, and label support.

Changes

TNF Node Lifecycle Refactor: Event-Driven Membership Reconciliation

Layer / File(s) Summary
Label constant and RBAC
pkg/tnf/pkg/tools/tnf_labels.go, bindata/tnfdeployment/role.yaml
Adds TnfUpdateSetupComponentValue and grants get,list,watch on configmaps so the operator can read update-setup snapshot ConfigMaps.
Update-setup reconciliation implementation
pkg/tnf/update-setup/runner.go
Selects highest-generation update-setup ConfigMap for the node, decodes node snapshot, builds deterministic 1–2 node cluster config, reconciles pacemaker membership (name+IP) via pcs cluster config show, stages pacemaker removals/additions, removes unstarted etcd members, waits for stable etcd revision, configures fencing, updates etcd node_ip_map, and conditionally forces/enables/starts pacemaker on additions.
Update-setup unit tests
pkg/tnf/update-setup/runner_test.go
Table-driven tests for pacemaker reconciliation outcomes, etcd peer URL IP extraction (IPv4/IPv6), ConfigMap targeting selection, PCS 401 output detection, and ConfigMap selection logic.
Node handler: event-driven dispatch & snapshotting
pkg/tnf/operator/nodehandler.go
Refactors handleNodes to accept triggeringNode and eventType, adds per-event handlers (add/ready/delete) with external-etcd gating, generates per-generation node-list ConfigMaps, orchestrates auth/update-setup/after-setup jobs targeting a single node, and removes the prior "jobs exist" gating.
Node handler unit tests
pkg/tnf/operator/nodehandler_test.go
Extends TestHandleNodes with triggeringNode/eventType and externalEtcd gating, updates TNF job fixture helpers, and adds TestControlPlaneNodesExceptName.
Operator starter & fencing changes
pkg/tnf/operator/starter.go
Always enqueue add events (no readiness skip), pass explicit event types, add reconcile catch-up for missed two-ready bootstrap at startup, and gate fencing restarts on external etcd transition completion.
Starter unit tests
pkg/tnf/operator/starter_test.go
Update mocks to the new handleNodes signature and add TestHandleFencingSecretChange_skipsBeforeExternalEtcdTransition.
PCS auth and TNF auth runner tweak
pkg/tnf/pkg/pcs/auth.go, pkg/tnf/auth/runner.go
pcs host auth includes second-node args only when both are set; RunTnfAuth uses GetClusterConfigIgnoreMissingNode to allow config generation when nodes are missing.

Sequence Diagram

sequenceDiagram
    participant Node as Node (add/ready/delete)
    participant Starter as Starter
    participant NodeHandler as NodeHandler
    participant UpdateSetup as UpdateSetup
    participant Pacemaker as Pacemaker
    participant Etcd as Etcd
    participant JobCtrl as JobControllers

    Node->>Starter: NodeAdded / NodeReady / NodeDeleted
    Starter->>NodeHandler: handleNodesWithRetry(triggeringNode,eventType)

    alt Add/Delete: wait ExternalEtcd transition
        NodeHandler->>NodeHandler: Check ExternalEtcdCompletedTransition
    end

    NodeHandler->>NodeHandler: List control-plane nodes and check readiness
    NodeHandler->>UpdateSetup: runUpdateSetupFunc(eventType, targetNode)
    UpdateSetup->>UpdateSetup: Select highest-generation ConfigMap and decode node list
    UpdateSetup->>Pacemaker: pcs cluster config show --output-format json
    Pacemaker-->>UpdateSetup: pacemaker node map (name -> peer IP)
    UpdateSetup->>UpdateSetup: compute nodesToRemove and nodesToAdd
    UpdateSetup->>Pacemaker: pcs cluster node remove --force --skip-offline
    UpdateSetup->>Etcd: etcdctl member list -w json
    Etcd-->>UpdateSetup: member list
    UpdateSetup->>Etcd: remove unstarted members (peer IP not in k8s nodes)
    alt Nodes to Add
        UpdateSetup->>Etcd: WaitForStableRevision
        UpdateSetup->>Pacemaker: pcs cluster node add
        UpdateSetup->>UpdateSetup: configure fencing and update etcd node_ip_map
        UpdateSetup->>Pacemaker: pcs cluster force_new_cluster
        UpdateSetup->>Pacemaker: pcs cluster enable/start
    end
    UpdateSetup->>Pacemaker: pcs cluster sync
    UpdateSetup->>Pacemaker: pcs cluster start --all
    UpdateSetup-->>NodeHandler: Reconciliation complete
    NodeHandler->>JobCtrl: start TNF job controllers when ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • clobrano
🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning TestUpdateSetupConfigMapSelectsNode and TestPcsClusterNodeRemoveOutputContains401 test multiple scenarios without assertion messages, violating the requirement for meaningful failure messages. Add descriptive failure messages to assertions in these tests or refactor as table-driven tests with named subtests for better diagnostics.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the Jira ticket and accurately summarizes the main change: updating node-to-pacemaker reconciliation logic to prevent split-brain, which aligns with the PR objectives and code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not use Ginkgo framework; all tests use standard Go testing with static, descriptive test case names with no dynamic content.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test files use standard Go testing.T with testify assertions, not Ginkgo framework. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. The PR only adds standard Go unit tests (TestXxx functions using testing.T) to pkg/tnf/ packages, which are not subject to the SNO compatibility check.
Topology-Aware Scheduling Compatibility ✅ Passed PR explicitly supports TNF (2-node) topology. Code validates node counts, uses direct NodeName spec (no affinity/scheduling constraints), and gates operations to 2-node clusters.
Ote Binary Stdout Contract ✅ Passed PR modifies only library functions and tests, not process-level code. No stdout writes at process level (main, init, TestMain, BeforeSuite, etc.) were introduced or modified.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds only unit tests (using standard Go testing with *testing.T), not Ginkgo e2e tests. The check targets Ginkgo e2e test patterns (It(), Describe(), etc.), which are not present in this PR.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in any modified files.
Container-Privileges ✅ Passed Privileged settings in job.yaml and cronjob.yaml are justified for pacemaker/PCS operations requiring root access and process visibility.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposed in logs. All logging is operational data: node names, internal cluster IPs, command outputs—no passwords, tokens, keys, PII, or credentials.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from clobrano and slintes April 29, 2026 15:35
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 29, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.

When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:

  • Node-0: Pacemaker configured with 1 node, has quorum
  • Node-1: Pacemaker configured with 2 nodes, no quorum

The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:

  • If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)

  • If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)

This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.

Tests cover:

  • Node exists in Kubernetes but offline in pacemaker (upgrade reboot) → Should skip replacement logic
  • Node deleted from Kubernetes and offline in pacemaker (replacement) → Should proceed with replacement logic
  • Error handling for NotFound vs other errors

Summary by CodeRabbit

  • New Features

  • Added a pre-check during the update process to detect whether an offline pacemaker node has been deleted from the cluster. If the node no longer exists, the replacement sequence is automatically skipped; otherwise, replacement proceeds normally.

  • Tests

  • Added unit tests for node existence checking during update operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

@jaypoulz: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ba7540-43e1-11f1-8734-dc441d1b18ce-0

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/tnf/update-setup/runner_test.go (1)

18-102: Tests validate API primitives, but not the runner branch decisions end-to-end.

This table test confirms fake-client Get/IsNotFound behavior, but it doesn’t exercise RunTnfUpdateSetup (or a helper) to assert “skip vs proceed vs fail” return-path decisions directly. For this critical flow, consider extracting the node-existence decision into a small function and unit-test that function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner_test.go` around lines 18 - 102, The test is only
exercising the fake client Get behavior, not the runner's branch decisions;
extract the node-existence decision into a small function (e.g., func
checkNodeExists(ctx context.Context, kube kubernetes.Interface, nodeName string)
(exists bool, err error)) and update RunTnfUpdateSetup to call that helper; then
replace the table test to unit-test checkNodeExists directly (using
fake.NewClientset) to assert the three return paths (exists/skip, not
found/proceed, unexpected error/fail). Ensure the helper is named uniquely
(checkNodeExists) and referenced from RunTnfUpdateSetup so the new unit tests
verify the runner branch logic end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 124-126: The test passes a non-error value (&runtime.Unknown{}) to
apierrors.NewInternalError via simulateError; replace the runtime.Unknown value
with an actual error (e.g., errors.New(...) or fmt.Errorf(...)) or format the
runtime.Unknown into an error string so apierrors.NewInternalError receives an
error interface; update the simulateError assignment that calls
apierrors.NewInternalError in runner_test.go accordingly (referencing
simulateError, apierrors.NewInternalError, and runtime.Unknown).

---

Nitpick comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 18-102: The test is only exercising the fake client Get behavior,
not the runner's branch decisions; extract the node-existence decision into a
small function (e.g., func checkNodeExists(ctx context.Context, kube
kubernetes.Interface, nodeName string) (exists bool, err error)) and update
RunTnfUpdateSetup to call that helper; then replace the table test to unit-test
checkNodeExists directly (using fake.NewClientset) to assert the three return
paths (exists/skip, not found/proceed, unexpected error/fail). Ensure the helper
is named uniquely (checkNodeExists) and referenced from RunTnfUpdateSetup so the
new unit tests verify the runner branch logic end-to-end.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b46e7555-1b53-4882-b0b2-fa6f26787667

📥 Commits

Reviewing files that changed from the base of the PR and between 93d2b65 and e9ed644.

📒 Files selected for processing (2)
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go

Comment thread pkg/tnf/update-setup/runner_test.go Outdated
@jaypoulz

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from e9ed644 to 0533a44 Compare April 29, 2026 15:53
Comment thread pkg/tnf/update-setup/runner.go Outdated

// Check if offline node still exists in Kubernetes
// Node replacement requires user to DELETE the node from Kubernetes
// If the node still exists, it's likely just rebooted (e.g., during upgrade)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job only runs when we have 2 ready nodes available... but yeah, pacemaker status might be behind, or reboot by upgrade just started... good catch 🙈

Might be good enough to compare offlineNodeName with currentNodeName and otherNodeName... when one of them matches, skip. But the result is the same I guess :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I realized from this comment is that both approaches have the same problem. Same-name node replacement requires us to re-add the node. So the k8s node check doesn't work because it would block it on the same name, and the currentNodeName and otherNodeName check doesn't work they would match the replaced node.

I think we need to run update-setup to reconcile on node delete because offline is insufficient to distinguish between temporary outage and node-that-needs-replacement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I'm pushing a new approach. Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state. This means that deleting the node in k8s results in us removing it from pacemaker.

This allows us to determine user intention based on the declarative k8s state. So now if you reboot, the k8s nodes are the same, so no-op. If you delete a node, that's an intentional change, so we clean it up on the backend.

It's more involved though, so I'll need to do some soak testing for this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-name node replacement

Ah yes, of course. Has been too long ago already, me dealing with TNF 😎

Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state

Hm, not sure about that tbh...
The idea still was: we have 2 ready nodes, but it's not the initial setup, so "update" pacemaker if needed to reflect reality. Changing the trigger would be a risky change IMHO, and can introduce regressions. I mean, this worked well so far for the intended use case, not?
IMHO we only need to better understand the situation during job execution: do we need to deal with the offline node or not? So: can we check something on the node resource to find out?
Pragmatic idea: check the creation date...? If the node is "new" run the setup, else skip it. 🤔
Maybe there is something better.
Or: I don't remember details, but I think I considered putting a label on nodes after they were successfully configured for pacemaker, but that didn't happen... could be useful in this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can afford the extra iteration time to get this right since the bug is easy enough to work around at upgrade time. Beyond it fixing this edge case, the other reason I like reworking the reconciliation approach is that I think we will need to in the future to support Adaptable topology.

Scaling from 1 to 2 to 3 nodes with automatic quorum recovery will require us to set up pacemaker on a single node cluster first, then scale up as we go. This would be one of multiple things that we would need to harden to handle that scale-up / scale-down logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is absolutely true though - we can't move forward with this patch before updated podman-etcd to handle the single-node case in the monitor operation first.

@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from 0533a44 to 3bf1712 Compare April 29, 2026 21:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 121-139: The current logic uses jobsExist (from tnfSetupJobsExist)
as a post-setup signal which is set true as soon as startTnfJobcontrollers
creates setup/auth/after-setup jobs; change the gate to require a real
completion signal instead: replace or extend tnfSetupJobsExist with a check for
a TNF setup completion marker (for example a specific Job completion condition,
a "tnf-setup-completed" ConfigMap/CR, or a Job status check) and expose it as
something like tnfSetupCompleted(ctx, kubeClient) (or return a second boolean
from tnfSetupJobsExist indicating completion); then use that completion boolean
in the node count checks (i.e. when evaluating len(nodeList) < 2, consult
setupCompleted not jobsExist) so a partially-created setup after a CEO restart
is treated as not completed and the initial 2-node flow is preserved.

In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 149-215: The test in runner_test.go reimplements the
reconciliation logic instead of exercising the production code in runner.go;
extract the reconciliation decision into a single exported or package-private
helper (e.g., ReconcileNodes or decideNodeAddRemove) in runner.go that takes
k8sNodes []string and pacemakerOnline/offline []string and returns
(nodeToRemove, nodeToAdd, skip bool), then update the table-driven tests to call
that helper directly (or alternatively drive RunTnfUpdateSetup using
fakes/mocks) so the tests validate the actual implementation rather than a
duplicate inline algorithm.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 313-326: Replace the unsafe substring check
strings.Contains(peerURL, nodeIP) with exact host comparison: parse each peerURL
(use url.Parse) inside the loop over member.PeerURLs, extract the host portion
(handle optional port with net.SplitHostPort and IPv6 brackets), then compare
the extracted host string exactly to nodeIP before proceeding to format
member.ID, log, and call exec.Execute to remove the member; only call member
remove when the host equals nodeIP to avoid accidental matches.
- Around line 168-176: The removal path must not abort when nodeToRemove isn't
present in cfg.NodeName1/NodeName2; instead, stop using
cfg.NodeName1/NodeName2/NodeIP1/NodeIP2 to derive the member to remove and query
Pacemaker/etcd state directly (e.g., list pacemaker nodes or etcd members) to
find the member ID and peerURL/IP for nodeToRemove, then use that ID to perform
the removal; update the logic in the block that references nodeToRemove (and the
similar code at the other instance around lines 239-248) to call the
pacemaker/etcd discovery functions, handle not-found as a no-op or
log-and-continue, and only fall back to cfg lookups if the live discovery fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3032f008-368a-4f16-b3ad-fdf0e5c3b3e4

📥 Commits

Reviewing files that changed from the base of the PR and between 0533a44 and 3bf1712.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go

Comment thread pkg/tnf/operator/nodehandler.go Outdated
Comment thread pkg/tnf/update-setup/runner_test.go
Comment thread pkg/tnf/update-setup/runner.go Outdated
Comment thread pkg/tnf/update-setup/runner.go Outdated
@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 2 times, most recently from 78f3141 to fad9968 Compare April 29, 2026 21:33

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/tnf/update-setup/runner.go (2)

271-283: ⚠️ Potential issue | 🔴 Critical

Deleted nodes cannot be resolved through the current cluster config.

nodeNames here are the pacemaker members that Kubernetes no longer has. For that case, getNodeIP(nodeName) fails by definition, so the code aborts after pacemaker removal and leaves the stale etcd member behind.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 271 - 283, The loop in
removeEtcdMembersForNodes currently aborts when getNodeIP(nodeName) fails (which
happens for pacemaker-only/deleted nodes) and leaves stale etcd members; change
the logic so a failed getNodeIP does not return but instead logs a warning and
attempts removal by node identity: implement a fallback removal path (e.g., add
removeEtcdMemberByName(nodeName) or enhance removeEtcdMemberByIP to accept a
nodeName fallback) and call that when getNodeIP returns an error; ensure
removeEtcdMembersForNodes still logs success/failure via klog and only returns
on unrecoverable errors from the removal call (removeEtcdMemberByIP or new
removeEtcdMemberByName).

311-324: ⚠️ Potential issue | 🔴 Critical

Match the etcd peer host exactly before removing a member.

strings.Contains(peerURL, nodeIP) is unsafe for values like 10.0.0.1 vs 10.0.0.10, and it is brittle with ports/IPv6 formatting. Parse the peer URL and compare the hostname exactly before calling member remove.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 311 - 324, The code in the loop
over memberList.Members uses strings.Contains(peerURL, nodeIP) which can match
substrings and fail on ports/IPv6; update the logic in the loop that iterates
member.PeerURLs (the block around strings.Contains(peerURL, nodeIP)) to parse
each peerURL (use url.Parse), extract the host component and if it contains a
port use net.SplitHostPort to get the exact hostname (handling IPv6 brackets),
then compare that hostname exactly to nodeIP before constructing memberIDHex and
running exec.Execute for member remove; keep the existing memberIDHex formatting
and logging but only call etcdctl member remove when the parsed host equals
nodeIP.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 96-99: When building the k8sNodes set, skip empty node names so
the map does not contain "" which causes reconciliation to treat a non-existent
node as present; update the construction of k8sNodes to only insert
cfg.NodeName1 and cfg.NodeName2 when they are non-empty (e.g. check
len(cfg.NodeName1) > 0 and len(cfg.NodeName2) > 0 before adding), keeping the
map type and usage same so downstream logic that checks membership remains
unchanged.

---

Duplicate comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 271-283: The loop in removeEtcdMembersForNodes currently aborts
when getNodeIP(nodeName) fails (which happens for pacemaker-only/deleted nodes)
and leaves stale etcd members; change the logic so a failed getNodeIP does not
return but instead logs a warning and attempts removal by node identity:
implement a fallback removal path (e.g., add removeEtcdMemberByName(nodeName) or
enhance removeEtcdMemberByIP to accept a nodeName fallback) and call that when
getNodeIP returns an error; ensure removeEtcdMembersForNodes still logs
success/failure via klog and only returns on unrecoverable errors from the
removal call (removeEtcdMemberByIP or new removeEtcdMemberByName).
- Around line 311-324: The code in the loop over memberList.Members uses
strings.Contains(peerURL, nodeIP) which can match substrings and fail on
ports/IPv6; update the logic in the loop that iterates member.PeerURLs (the
block around strings.Contains(peerURL, nodeIP)) to parse each peerURL (use
url.Parse), extract the host component and if it contains a port use
net.SplitHostPort to get the exact hostname (handling IPv6 brackets), then
compare that hostname exactly to nodeIP before constructing memberIDHex and
running exec.Execute for member remove; keep the existing memberIDHex formatting
and logging but only call etcdctl member remove when the parsed host equals
nodeIP.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 975b05c2-c5db-40cd-ab9f-96db7d22ec11

📥 Commits

Reviewing files that changed from the base of the PR and between 3bf1712 and 78f3141.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner_test.go
  • pkg/tnf/operator/nodehandler.go

Comment thread pkg/tnf/update-setup/runner.go Outdated
@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from fad9968 to 03a2a57 Compare April 29, 2026 21:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/tnf/operator/nodehandler_test.go (1)

354-360: Consider using a prefix check instead of exact name matches.

The hardcoded list of job names is fragile. If tests add new master names (e.g., tnf-after-setup-master-2), this condition would need updating.

♻️ Suggested improvement
 	labels := map[string]string{
 		"app.kubernetes.io/component": "two-node-fencing-setup",
 	}
 	// Mark after-setup jobs to indicate initial setup completion
-	if name == "tnf-after-setup" || name == "tnf-after-setup-master-0" || name == "tnf-after-setup-master-1" {
+	if strings.HasPrefix(name, "tnf-after-setup") {
 		labels["app.kubernetes.io/name"] = "tnf-after-setup"
 	}

This would require adding "strings" to the imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/operator/nodehandler_test.go` around lines 354 - 360, The code
currently checks exact job names when setting labels for after-setup jobs (the
block that updates the labels map when name == "tnf-after-setup" || name ==
"tnf-after-setup-master-0" || name == "tnf-after-setup-master-1"); replace these
exact-match checks with a prefix check (e.g., use strings.HasPrefix(name,
"tnf-after-setup")) so new masters like "tnf-after-setup-master-2" are handled
automatically, and add the "strings" import to the test file.
pkg/tnf/update-setup/runner.go (1)

293-296: Unused parameters: nodeNames and getNodeIP are never used.

The function removes all unstarted etcd members based on member.Name == "", but the passed parameters are never referenced. This makes the function signature misleading and the caller's effort to pass getNodeIP pointless.

♻️ Suggested signature simplification

If the intent is to remove all unstarted members regardless of which nodes were removed:

-func removeEtcdMembersForNodes(ctx context.Context, nodeNames []string, getNodeIP func(string) (string, error)) error {
-	if len(nodeNames) == 0 {
-		return nil
-	}
+func removeUnstartedEtcdMembers(ctx context.Context) error {

Update call sites at lines 180 and 230 accordingly:

-	if err := removeEtcdMembersForNodes(ctx, nodesToRemove, getNodeIP); err != nil {
+	if len(nodesToRemove) > 0 {
+		if err := removeUnstartedEtcdMembers(ctx); err != nil {
+			return err
+		}
+	}

Alternatively, if you want to keep the node-specific behavior for future use, consider using getNodeIP to match members by IP via removeEtcdMemberByIP.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 293 - 296, The function
removeEtcdMembersForNodes has unused parameters nodeNames and getNodeIP; either
simplify the signature by removing those parameters and update callers to call
removeEtcdMembersForNodes(ctx) (i.e., make it remove all unstarted members as it
already does), or implement node-specific removal by iterating nodeNames,
calling getNodeIP(node) to obtain IPs and then call removeEtcdMemberByIP(ctx,
ip) for matches; update removeEtcdMembersForNodes to reference getNodeIP and
nodeNames (or remove them) and adjust call sites accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 349-412: The function removeEtcdMemberByIP is dead code (never
called); either delete it or wire it into the existing workflow — specifically
replace or refactor the current removeEtcdMembersForNodes usage to call
removeEtcdMemberByIP for each nodeIP, or remove the entire removeEtcdMemberByIP
function and clean up any now-unused imports (e.g., url, net, strings) and
tests; locate the existing caller(s) of removeEtcdMembersForNodes and either
iterate node IPs invoking removeEtcdMemberByIP (preserving its parsing/hex-ID
logic) or remove the function and ensure no references remain.

---

Nitpick comments:
In `@pkg/tnf/operator/nodehandler_test.go`:
- Around line 354-360: The code currently checks exact job names when setting
labels for after-setup jobs (the block that updates the labels map when name ==
"tnf-after-setup" || name == "tnf-after-setup-master-0" || name ==
"tnf-after-setup-master-1"); replace these exact-match checks with a prefix
check (e.g., use strings.HasPrefix(name, "tnf-after-setup")) so new masters like
"tnf-after-setup-master-2" are handled automatically, and add the "strings"
import to the test file.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 293-296: The function removeEtcdMembersForNodes has unused
parameters nodeNames and getNodeIP; either simplify the signature by removing
those parameters and update callers to call removeEtcdMembersForNodes(ctx)
(i.e., make it remove all unstarted members as it already does), or implement
node-specific removal by iterating nodeNames, calling getNodeIP(node) to obtain
IPs and then call removeEtcdMemberByIP(ctx, ip) for matches; update
removeEtcdMembersForNodes to reference getNodeIP and nodeNames (or remove them)
and adjust call sites accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 273e7f58-c686-46d5-b227-f828930b53ef

📥 Commits

Reviewing files that changed from the base of the PR and between 78f3141 and 03a2a57.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/update-setup/runner_test.go

Comment thread pkg/tnf/update-setup/runner.go Outdated

@fonta-rh fonta-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: OCPBUGS-84695 — Fix TNF Pacemaker Split-Brain

The core reconciliation approach is solid — comparing k8s ConfigMap (desired state) vs pacemaker XML membership (actual state) is fundamentally better than the old grep Offline | awk shell parsing. Three findings from review:


1. Dead code: removeEtcdMemberByIP() + unused getNodeIP parameter

removeEtcdMemberByIP() (runner.go:472-535, 63 lines) is defined but never called anywhere in this PR. Additionally, the getNodeIP callback is passed to removeEtcdMembersForNodes() at lines 320 and 370, but it's never dereferenced inside the function body — removeEtcdMembersForNodes removes ALL unstarted members regardless of node identity.

Was removeEtcdMemberByIP intended to be wired in as a fallback (e.g., when unstarted-name detection fails) but forgotten? If it's not needed, removing both the function and the dead getNodeIP parameter would clean things up. If removing, consider renaming removeEtcdMembersForNodesremoveAllUnstartedEtcdMembers to match the actual behavior (blanket cleanup, not per-node targeting).

2. Missing tests for intermediate setup states

The tnfSetupJobsExist() semantics changed significantly — from "any TNF job exists" to "after-setup job completed successfully (Succeeded > 0)." This is the behavioral hinge of the fix, but two intermediate states aren't tested:

  • After-setup in progress: tnf-after-setup-master-0 exists with Succeeded: 0tnfSetupJobsExist() returns false. With 1 node present, handleNodes blocks. The createTNFJob() fixture hardcodes Succeeded: 1, so this path can't currently be exercised.

  • Partial setup (setup running, after-setup not yet created): tnf-setup job exists but no tnf-after-setup jobs. The old test used createTNFJob("tnf-setup") — that coverage was lost when fixtures were changed to tnf-after-setup-*.

Both are realistic timing windows during normal cluster lifecycle. Suggested additions:

  • "1 node, after-setup job in progress (not completed) - skip reconciliation"
  • "2 nodes, setup jobs exist but no after-setup jobs - start controllers only"

3. removeEtcdMembersForNodes docstring vs behavior mismatch

The docstring says "removes unstarted etcd members after nodes are removed from pacemaker," implying per-node targeting. The actual behavior removes ALL unstarted members regardless of the nodeNames parameter (which is only used for logging). The behavior is correct and safe for a 2-node cluster, but the docstring could mislead a future engineer debugging member removal issues. Updating it to reflect the blanket-cleanup behavior would prevent misunderstandings.


Overall the approach is a clear improvement. The reconciliation logic in reconcileNodes() is clean, well-extracted for testability, and the test coverage for the core diff logic is thorough (14 cases). Nice work.

@jaypoulz jaypoulz changed the title OCPBUGS-84695: fix TNF pacemaker split-brain by checking node existence before replacement OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain Apr 30, 2026
@jaypoulz jaypoulz marked this pull request as draft April 30, 2026 16:23
@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 4 times, most recently from cb454cf to 1017cfc Compare April 30, 2026 18:36
@jaypoulz jaypoulz marked this pull request as ready for review April 30, 2026 18:39
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2026
@openshift-ci openshift-ci Bot requested a review from slintes April 30, 2026 18:40
@jaypoulz

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

@jaypoulz: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6124da80-44c4-11f1-9cb9-891f45aba7aa-0

@jaypoulz

Copy link
Copy Markdown
Contributor Author

@CodeRabbit resume

@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown
✅ Actions performed

Reviews resumed.

@openshift-ci-robot

Copy link
Copy Markdown

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.

When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:

  • Node-0: Pacemaker configured with 1 node, has quorum
  • Node-1: Pacemaker configured with 2 nodes, no quorum

The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:

  • If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)

  • If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)

This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.

Tests cover:

  • Node exists in Kubernetes but offline in pacemaker (upgrade reboot) → Should skip replacement logic
  • Node deleted from Kubernetes and offline in pacemaker (replacement) → Should proceed with replacement logic
  • Error handling for NotFound vs other errors

Summary by CodeRabbit

  • New Features

  • Allow cluster to progress with a single ready control-plane node after initial setup; initial setup remains gated until two nodes are present.

  • Bug Fixes

  • Pacemaker membership now reconciles to match configured cluster nodes.

  • Etcd cleanup removes all unstarted members whose IPs no longer match current nodes.

  • Tests

  • Added unit tests covering pacemaker reconciliation and peer-URL IP extraction; expanded node-handler scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/tnf/operator/nodehandler.go (1)

152-179: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don’t use raw TNF job existence as the post-setup signal.

tnfSetupJobsExist() is still true as soon as any TNF job object exists, so after a CEO restart mid-initial-setup, Line 160 can treat a partially initialized cluster as “already set up” and allow the 1-node reconciliation path on Line 179. Please gate this on a real completion signal instead of any labeled job.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/operator/nodehandler.go` around lines 152 - 179, tnfSetupJobsExist()
currently treats any TNF Job object as "setup done" which lets code at the
decision point between startTnfJobcontrollersFunc(...) and updateSetupFunc(...)
incorrectly take the update-setup path after a partial initial run; change the
gating to check a real completion signal (e.g., a Job completion
status/condition, a specific finished label/annotation on the setup Job, or a
dedicated ConfigMap/CR/Status field) rather than mere existence: update the
pre-check that uses tnfSetupJobsExist() to call a new/extended helper (or
enhance tnfSetupJobsExist) that inspects Job.Status.CompletionTime or a
designated completion marker, only return true when that completion marker is
present, and keep the rest of the flow (startTnfJobcontrollersFunc(...) and
updateSetupFunc(...)) unchanged so only fully completed initial setups take the
update-setup path.
pkg/tnf/update-setup/runner.go (1)

257-280: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Names-only reconciliation misses same-name replacements.

reconcileNodes only keys on node name. If a replacement reuses the Kubernetes node name but comes back with a different IP, this helper returns no changes, and Lines 142-145 exit before pacemaker membership or node_ip_map is refreshed. The reconciliation key needs to include node identity/IP, not just the name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 257 - 280, reconcileNodes
currently compares only node.Name and misses replacements that reuse a name but
change IP; change reconcileNodes signature to accept k8sNodes as a
map[string]string (name -> IP) or a small struct including IP, and use
pacemaker.Node's IP field to build pacemakerNodesMap keyed by a combined
identity e.g. name+"|"+ip (or a tuple-equivalent) instead of name alone; then
compute nodesToRemove/nodesToAdd by comparing those combined keys; update
callers to pass the refreshed node_ip_map (name->ip) into reconcileNodes and
ensure node_ip_map is refreshed before any early exit that depends on
reconciliation (so pacemaker membership vs k8s identity comparison uses
name+IP).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 152-179: tnfSetupJobsExist() currently treats any TNF Job object
as "setup done" which lets code at the decision point between
startTnfJobcontrollersFunc(...) and updateSetupFunc(...) incorrectly take the
update-setup path after a partial initial run; change the gating to check a real
completion signal (e.g., a Job completion status/condition, a specific finished
label/annotation on the setup Job, or a dedicated ConfigMap/CR/Status field)
rather than mere existence: update the pre-check that uses tnfSetupJobsExist()
to call a new/extended helper (or enhance tnfSetupJobsExist) that inspects
Job.Status.CompletionTime or a designated completion marker, only return true
when that completion marker is present, and keep the rest of the flow
(startTnfJobcontrollersFunc(...) and updateSetupFunc(...)) unchanged so only
fully completed initial setups take the update-setup path.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 257-280: reconcileNodes currently compares only node.Name and
misses replacements that reuse a name but change IP; change reconcileNodes
signature to accept k8sNodes as a map[string]string (name -> IP) or a small
struct including IP, and use pacemaker.Node's IP field to build
pacemakerNodesMap keyed by a combined identity e.g. name+"|"+ip (or a
tuple-equivalent) instead of name alone; then compute nodesToRemove/nodesToAdd
by comparing those combined keys; update callers to pass the refreshed
node_ip_map (name->ip) into reconcileNodes and ensure node_ip_map is refreshed
before any early exit that depends on reconciliation (so pacemaker membership vs
k8s identity comparison uses name+IP).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45f6ce68-40be-46fd-97e4-b8f86da424ac

📥 Commits

Reviewing files that changed from the base of the PR and between 78f3141 and 1017cfc.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go

@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 2 times, most recently from 4831238 to bb32848 Compare April 30, 2026 20:23

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/tnf/update-setup/runner.go (1)

190-196: 💤 Low value

Consider individual error handling for node additions.

If runCommands fails partway through multiple node additions, earlier successful additions won't be rolled back, potentially leaving pacemaker in an inconsistent state. However, given the two-node cluster scope and that subsequent runs will re-reconcile, this is acceptable for the current use case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 190 - 196, The current batch
execution using runCommands over nodesToAdd can hide which node failed and
doesn't allow per-node handling; change the logic to run each add command
individually inside the loop (for each nodeToAdd build the "/usr/sbin/pcs
cluster node add %s" command and call a single-command runner or runCommands
with a single-item slice), check and return the error immediately including the
failing node name (reference nodesToAdd and runCommands), so callers get
per-node failure context and easier reconciliation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 190-196: The current batch execution using runCommands over
nodesToAdd can hide which node failed and doesn't allow per-node handling;
change the logic to run each add command individually inside the loop (for each
nodeToAdd build the "/usr/sbin/pcs cluster node add %s" command and call a
single-command runner or runCommands with a single-item slice), check and return
the error immediately including the failing node name (reference nodesToAdd and
runCommands), so callers get per-node failure context and easier reconciliation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0dd5a0b1-09dd-4b24-b159-4754dd93a074

📥 Commits

Reviewing files that changed from the base of the PR and between 1017cfc and bb32848.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/update-setup/runner_test.go

@jaypoulz

jaypoulz commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

verification failed, moving back to draft.

@jaypoulz jaypoulz marked this pull request as draft May 1, 2026 15:11
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 178-180: The early return when len(nodesToAdd) == 0 skips the etcd
node_ip_map update and leaves a stale name→IP mapping; instead of returning
immediately in the block that checks nodesToAdd, invoke the same etcd
refresh/pcs update logic that runs on the add path (call the function or code
that performs the "pcs resource update etcd node_ip_map=..." update) so the
node_ip_map is always refreshed after member/pacemaker cleanup; remove the
premature return or move the node_ip_map update into a common post-cleanup step
so that both remove-only and add+remove flows update etcd (refer to the
nodesToAdd check and the code that performs the pcs resource update of etcd
node_ip_map to locate where to change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 48976c7b-86a1-4c74-b132-783410e46ccb

📥 Commits

Reviewing files that changed from the base of the PR and between bb32848 and 61229f1.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/update-setup/runner_test.go

Comment thread pkg/tnf/update-setup/runner.go Outdated
@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from 61229f1 to b496a74 Compare May 1, 2026 15:44

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/tnf/update-setup/runner_test.go (2)

161-188: ⚖️ Poor tradeoff

IP extraction logic is duplicated instead of testing production code.

The test reimplements the IP extraction logic inline rather than calling a shared helper from runner.go. If the production code in removeUnstartedEtcdMembers changes, this test won't catch regressions. Consider extracting the IP parsing into a package-level helper and testing that directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner_test.go` around lines 161 - 188, The test
duplicates the IP extraction logic instead of reusing the production
implementation; extract the peer-IP parsing into a package-level helper (e.g.,
parsePeerIP or extractIPFromPeerURL) in the same package where
removeUnstartedEtcdMembers lives, move the current parsing logic from
removeUnstartedEtcdMembers into that helper, update removeUnstartedEtcdMembers
to call the helper, and change the test to call the new helper directly (use the
same test vectors as before) so tests exercise the single source of truth for
parsing.

23-36: 💤 Low value

Duplicate test cases.

The test cases "Upgrade - both nodes in k8s and pacemaker with same IPs - no changes" (lines 23-36) and "No changes - k8s and pacemaker match (name+IP)" (lines 63-76) have identical inputs and expectations. Consider removing one to avoid redundancy.

Also applies to: 63-76

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner_test.go` around lines 23 - 36, Remove the
duplicate table-driven test entry by deleting one of the identical cases —
either "Upgrade - both nodes in k8s and pacemaker with same IPs - no changes" or
"No changes - k8s and pacemaker match (name+IP)" — so only a single test with
those k8sNodes/pacemakerNodes maps and expectRemove/expectAdd=nil,
expectSkip=true remains; ensure the remaining test keeps the intended
descriptive name and that the test slice used by Test... (the table variable in
runner_test.go) is updated accordingly.
pkg/tnf/operator/nodehandler_test.go (1)

399-402: 💤 Low value

Unused label in test helper.

The app.kubernetes.io/name label added for after-setup jobs isn't used by tnfSetupJobsExist(), which only checks for app.kubernetes.io/component=two-node-fencing-setup. If this label is intended for future use or documentation, consider adding a comment. Otherwise, it can be removed to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/operator/nodehandler_test.go` around lines 399 - 402, The test helper
is adding labels["app.kubernetes.io/name"]="tnf-after-setup" inside the
strings.HasPrefix(name, "tnf-after-setup") branch but that label is unused by
tnfSetupJobsExist() (which only checks
app.kubernetes.io/component=two-node-fencing-setup); either remove the unused
assignment to avoid confusion or keep it but add a brief comment next to the
strings.HasPrefix(...) branch explaining the label is for future/documentation
purposes and not used by tnfSetupJobsExist().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/tnf/operator/nodehandler_test.go`:
- Around line 399-402: The test helper is adding
labels["app.kubernetes.io/name"]="tnf-after-setup" inside the
strings.HasPrefix(name, "tnf-after-setup") branch but that label is unused by
tnfSetupJobsExist() (which only checks
app.kubernetes.io/component=two-node-fencing-setup); either remove the unused
assignment to avoid confusion or keep it but add a brief comment next to the
strings.HasPrefix(...) branch explaining the label is for future/documentation
purposes and not used by tnfSetupJobsExist().

In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 161-188: The test duplicates the IP extraction logic instead of
reusing the production implementation; extract the peer-IP parsing into a
package-level helper (e.g., parsePeerIP or extractIPFromPeerURL) in the same
package where removeUnstartedEtcdMembers lives, move the current parsing logic
from removeUnstartedEtcdMembers into that helper, update
removeUnstartedEtcdMembers to call the helper, and change the test to call the
new helper directly (use the same test vectors as before) so tests exercise the
single source of truth for parsing.
- Around line 23-36: Remove the duplicate table-driven test entry by deleting
one of the identical cases — either "Upgrade - both nodes in k8s and pacemaker
with same IPs - no changes" or "No changes - k8s and pacemaker match (name+IP)"
— so only a single test with those k8sNodes/pacemakerNodes maps and
expectRemove/expectAdd=nil, expectSkip=true remains; ensure the remaining test
keeps the intended descriptive name and that the test slice used by Test... (the
table variable in runner_test.go) is updated accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 50b0a5a6-de76-4e7a-a702-d2e2fa448c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 61229f1 and b496a74.

📒 Files selected for processing (4)
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/tnf/update-setup/runner.go

@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from b496a74 to 8f6ff88 Compare May 11, 2026 17:18
@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clobrano for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tnf/update-setup/runner.go (1)

113-119: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't silently succeed when the targeted node is not running pacemaker.

Now that only one node runs update-setup, returning nil here drops the reconcile entirely if that chosen node is rebooting or otherwise not running pcs yet. Bubble this up as a failure so the job retries instead of reporting success without reconciling membership.

Suggested change
 	command := "/usr/sbin/pcs cluster status"
 	_, _, err = exec.Execute(ctx, command)
 	if err != nil {
-		klog.Infof("Cluster not running (err: %v), skipping update-setup on this node", err)
-		return nil
+		return fmt.Errorf("cluster not running on targeted node %s: %w", currentNodeName, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/update-setup/runner.go` around lines 113 - 119, The code currently
treats a failure running "/usr/sbin/pcs cluster status" as success by logging
and returning nil; change this so the failure bubbles up and the job retries: in
the block that runs exec.Execute(ctx, command) (the "/usr/sbin/pcs cluster
status" check) return the error (or a wrapped error using fmt.Errorf) instead of
nil and change the log to reflect an error return (e.g., klog.Errorf or
klog.Infof with the intent to return err). This ensures the reconcile in
runner.go requeues when pcs is not available rather than silently succeeding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 147-165: Node Ready events currently always call handleNodeReady,
which re-triggers initial setup even after ExternalEtcdHasCompletedTransition;
change the switch so NodeEventTypeReady only calls handleNodeReady when
externalEtcdTransitionComplete is false (i.e., during initial bootstrap) and
otherwise skip/return no-op; update the same gating in the other occurrence of
this logic (the block covering lines 232-271) to use the same check; reference
externalEtcdTransitionComplete, ceohelpers.HasExternalEtcdCompletedTransition,
and handleNodeReady when locating and applying the change.
- Around line 337-396: The ConfigMap name derived from the process-local
updateSetupGeneration can collide across CEO restarts; change the cmName
generation in updateSetup (function updateSetup, variable cmName) to include a
restart-safe unique suffix (for example a timestamp or a UUID/random nonce) so
names are globally unique across process restarts; preserve the existing
"tnf-update-setup-<generation>" prefix and keep the numeric generation in the
ConfigMap Data/Labels (so determineUpdateSetupTargetNode and any cleanup logic
that reads the generation still work) and update any tests or cleanup code that
expect the exact name format to tolerate the new suffix.

---

Outside diff comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 113-119: The code currently treats a failure running
"/usr/sbin/pcs cluster status" as success by logging and returning nil; change
this so the failure bubbles up and the job retries: in the block that runs
exec.Execute(ctx, command) (the "/usr/sbin/pcs cluster status" check) return the
error (or a wrapped error using fmt.Errorf) instead of nil and change the log to
reflect an error return (e.g., klog.Errorf or klog.Infof with the intent to
return err). This ensures the reconcile in runner.go requeues when pcs is not
available rather than silently succeeding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b461e84f-3a05-4de2-8a47-e5212c1140b9

📥 Commits

Reviewing files that changed from the base of the PR and between b496a74 and 8f6ff88.

📒 Files selected for processing (10)
  • bindata/tnfdeployment/role.yaml
  • pkg/tnf/auth/runner.go
  • pkg/tnf/operator/nodehandler.go
  • pkg/tnf/operator/nodehandler_test.go
  • pkg/tnf/operator/starter.go
  • pkg/tnf/operator/starter_test.go
  • pkg/tnf/pkg/pcs/auth.go
  • pkg/tnf/pkg/tools/tnf_labels.go
  • pkg/tnf/update-setup/runner.go
  • pkg/tnf/update-setup/runner_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/tnf/pkg/tools/tnf_labels.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/operator/nodehandler_test.go

Comment on lines +147 to +165
// Add/delete post-handoff paths consult ExternalEtcdHasCompletedTransition; Ready does not run update-setup.
var externalEtcdTransitionComplete bool
if eventType == NodeEventTypeAdd || eventType == NodeEventTypeDelete {
var err error
externalEtcdTransitionComplete, err = ceohelpers.HasExternalEtcdCompletedTransition(ctx, operatorClient)
if err != nil {
return fmt.Errorf("failed to check external etcd transition status: %w", err)
}
}

// Handle different event types
switch eventType {
case NodeEventTypeAdd:
return handleNodeAddition(nodeList, triggeringNode, externalEtcdTransitionComplete, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer)
case NodeEventTypeReady:
return handleNodeReady(nodeList, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer)
case NodeEventTypeDelete:
return handleNodeDeletion(nodeList, triggeringNode, externalEtcdTransitionComplete, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer)
default:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate NodeEventTypeReady to initial bootstrap only.

After ExternalEtcdHasCompletedTransition becomes true, any later NotReady→Ready transition will still run handleNodeReady and kick startTnfJobcontrollersFunc again on an already-initialized cluster. That makes reboot recovery re-enter the initial setup path instead of leaving day-2 reconciliation to add/delete events.

Suggested change
-	// Add/delete post-handoff paths consult ExternalEtcdHasCompletedTransition; Ready does not run update-setup.
+	// Post-handoff paths consult ExternalEtcdHasCompletedTransition. Ready should stay bootstrap-only.
 	var externalEtcdTransitionComplete bool
-	if eventType == NodeEventTypeAdd || eventType == NodeEventTypeDelete {
+	if eventType == NodeEventTypeAdd || eventType == NodeEventTypeDelete || eventType == NodeEventTypeReady {
 		var err error
 		externalEtcdTransitionComplete, err = ceohelpers.HasExternalEtcdCompletedTransition(ctx, operatorClient)
 		if err != nil {
 			return fmt.Errorf("failed to check external etcd transition status: %w", err)
 		}
 	}
@@
 	case NodeEventTypeReady:
-		return handleNodeReady(nodeList, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer)
+		return handleNodeReady(nodeList, externalEtcdTransitionComplete, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer)
 func handleNodeReady(
 	nodeList []*corev1.Node,
+	externalEtcdTransitionComplete bool,
 	ctx context.Context,
 	controllerContext *controllercmd.ControllerContext,
 	operatorClient v1helpers.StaticPodOperatorClient,
 	kubeClient kubernetes.Interface,
 	kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,
 	etcdInformer operatorv1informers.EtcdInformer,
 ) error {
+	if externalEtcdTransitionComplete {
+		klog.V(4).Info("ignoring Ready event after external etcd handoff")
+		return nil
+	}
+
 	// Initial setup: need 2 ready nodes

Also applies to: 232-271

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/operator/nodehandler.go` around lines 147 - 165, Node Ready events
currently always call handleNodeReady, which re-triggers initial setup even
after ExternalEtcdHasCompletedTransition; change the switch so
NodeEventTypeReady only calls handleNodeReady when
externalEtcdTransitionComplete is false (i.e., during initial bootstrap) and
otherwise skip/return no-op; update the same gating in the other occurrence of
this logic (the block covering lines 232-271) to use the same check; reference
externalEtcdTransitionComplete, ceohelpers.HasExternalEtcdCompletedTransition,
and handleNodeReady when locating and applying the change.

Comment on lines +337 to +396
func getNextUpdateSetupGeneration() int64 {
updateSetupGenerationMutex.Lock()
defer updateSetupGenerationMutex.Unlock()
updateSetupGeneration++
return updateSetupGeneration
}

// updateSetup writes a snapshot ConfigMap, runs auth on all nodes, update-setup on one target, then after-setup on all.
func updateSetup(
nodeList []*corev1.Node,
triggeringNode *corev1.Node,
eventType string,
ctx context.Context,
controllerContext *controllercmd.ControllerContext,
operatorClient v1helpers.StaticPodOperatorClient,
kubeClient kubernetes.Interface,
kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,
) error {

generation := getNextUpdateSetupGeneration()

targetNode := determineUpdateSetupTargetNode(nodeList, triggeringNode, eventType)
if targetNode == nil {
return fmt.Errorf("failed to determine target node for update-setup (event: %s, triggering: %s, generation: %d)",
eventType, triggeringNode.Name, generation)
}

klog.Infof("Generation %d: Event=%s, Triggering=%s, Target=%s, NodeList=%v",
generation, eventType, triggeringNode.Name, targetNode.Name, getNodeNames(nodeList))

// Snapshot nodeList into ConfigMap for the runner (see comment on updateSetup).
nodeListData, err := encodeNodeList(nodeList)
if err != nil {
return fmt.Errorf("failed to encode node list: %w", err)
}

cmName := fmt.Sprintf("tnf-update-setup-%d", generation)
cmData := map[string]string{
"nodes": nodeListData,
"generation": fmt.Sprintf("%d", generation),
"eventType": eventType,
"timestamp": time.Now().Format(time.RFC3339),
"targetNode": targetNode.Name,
}
cm := &corev1.ConfigMap{
ObjectMeta: v1.ObjectMeta{
Name: cmName,
Namespace: operatorclient.TargetNamespace,
Labels: map[string]string{
"app.kubernetes.io/component": tools.TnfUpdateSetupComponentValue,
"tnf.etcd.openshift.io/generation": fmt.Sprintf("%d", generation),
},
},
Data: cmData,
}

_, err = kubeClient.CoreV1().ConfigMaps(operatorclient.TargetNamespace).Create(ctx, cm, v1.CreateOptions{})
if err != nil {
return fmt.Errorf("failed to create ConfigMap %s: %w", cmName, err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make update-setup generations restart-safe.

updateSetupGeneration is process-local, but the ConfigMap name is derived directly from it. If CEO restarts after creating tnf-update-setup-N and cleanup never happens, the next reconcile can reuse the same name and fail with AlreadyExists, blocking later update-setup runs.

Suggested change
 func getNextUpdateSetupGeneration() int64 {
 	updateSetupGenerationMutex.Lock()
 	defer updateSetupGenerationMutex.Unlock()
-	updateSetupGeneration++
+	next := time.Now().UnixNano()
+	if next <= updateSetupGeneration {
+		next = updateSetupGeneration + 1
+	}
+	updateSetupGeneration = next
 	return updateSetupGeneration
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func getNextUpdateSetupGeneration() int64 {
updateSetupGenerationMutex.Lock()
defer updateSetupGenerationMutex.Unlock()
updateSetupGeneration++
return updateSetupGeneration
}
// updateSetup writes a snapshot ConfigMap, runs auth on all nodes, update-setup on one target, then after-setup on all.
func updateSetup(
nodeList []*corev1.Node,
triggeringNode *corev1.Node,
eventType string,
ctx context.Context,
controllerContext *controllercmd.ControllerContext,
operatorClient v1helpers.StaticPodOperatorClient,
kubeClient kubernetes.Interface,
kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,
) error {
generation := getNextUpdateSetupGeneration()
targetNode := determineUpdateSetupTargetNode(nodeList, triggeringNode, eventType)
if targetNode == nil {
return fmt.Errorf("failed to determine target node for update-setup (event: %s, triggering: %s, generation: %d)",
eventType, triggeringNode.Name, generation)
}
klog.Infof("Generation %d: Event=%s, Triggering=%s, Target=%s, NodeList=%v",
generation, eventType, triggeringNode.Name, targetNode.Name, getNodeNames(nodeList))
// Snapshot nodeList into ConfigMap for the runner (see comment on updateSetup).
nodeListData, err := encodeNodeList(nodeList)
if err != nil {
return fmt.Errorf("failed to encode node list: %w", err)
}
cmName := fmt.Sprintf("tnf-update-setup-%d", generation)
cmData := map[string]string{
"nodes": nodeListData,
"generation": fmt.Sprintf("%d", generation),
"eventType": eventType,
"timestamp": time.Now().Format(time.RFC3339),
"targetNode": targetNode.Name,
}
cm := &corev1.ConfigMap{
ObjectMeta: v1.ObjectMeta{
Name: cmName,
Namespace: operatorclient.TargetNamespace,
Labels: map[string]string{
"app.kubernetes.io/component": tools.TnfUpdateSetupComponentValue,
"tnf.etcd.openshift.io/generation": fmt.Sprintf("%d", generation),
},
},
Data: cmData,
}
_, err = kubeClient.CoreV1().ConfigMaps(operatorclient.TargetNamespace).Create(ctx, cm, v1.CreateOptions{})
if err != nil {
return fmt.Errorf("failed to create ConfigMap %s: %w", cmName, err)
}
func getNextUpdateSetupGeneration() int64 {
updateSetupGenerationMutex.Lock()
defer updateSetupGenerationMutex.Unlock()
next := time.Now().UnixNano()
if next <= updateSetupGeneration {
next = updateSetupGeneration + 1
}
updateSetupGeneration = next
return updateSetupGeneration
}
// updateSetup writes a snapshot ConfigMap, runs auth on all nodes, update-setup on one target, then after-setup on all.
func updateSetup(
nodeList []*corev1.Node,
triggeringNode *corev1.Node,
eventType string,
ctx context.Context,
controllerContext *controllercmd.ControllerContext,
operatorClient v1helpers.StaticPodOperatorClient,
kubeClient kubernetes.Interface,
kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,
) error {
generation := getNextUpdateSetupGeneration()
targetNode := determineUpdateSetupTargetNode(nodeList, triggeringNode, eventType)
if targetNode == nil {
return fmt.Errorf("failed to determine target node for update-setup (event: %s, triggering: %s, generation: %d)",
eventType, triggeringNode.Name, generation)
}
klog.Infof("Generation %d: Event=%s, Triggering=%s, Target=%s, NodeList=%v",
generation, eventType, triggeringNode.Name, targetNode.Name, getNodeNames(nodeList))
// Snapshot nodeList into ConfigMap for the runner (see comment on updateSetup).
nodeListData, err := encodeNodeList(nodeList)
if err != nil {
return fmt.Errorf("failed to encode node list: %w", err)
}
cmName := fmt.Sprintf("tnf-update-setup-%d", generation)
cmData := map[string]string{
"nodes": nodeListData,
"generation": fmt.Sprintf("%d", generation),
"eventType": eventType,
"timestamp": time.Now().Format(time.RFC3339),
"targetNode": targetNode.Name,
}
cm := &corev1.ConfigMap{
ObjectMeta: v1.ObjectMeta{
Name: cmName,
Namespace: operatorclient.TargetNamespace,
Labels: map[string]string{
"app.kubernetes.io/component": tools.TnfUpdateSetupComponentValue,
"tnf.etcd.openshift.io/generation": fmt.Sprintf("%d", generation),
},
},
Data: cmData,
}
_, err = kubeClient.CoreV1().ConfigMaps(operatorclient.TargetNamespace).Create(ctx, cm, v1.CreateOptions{})
if err != nil {
return fmt.Errorf("failed to create ConfigMap %s: %w", cmName, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/operator/nodehandler.go` around lines 337 - 396, The ConfigMap name
derived from the process-local updateSetupGeneration can collide across CEO
restarts; change the cmName generation in updateSetup (function updateSetup,
variable cmName) to include a restart-safe unique suffix (for example a
timestamp or a UUID/random nonce) so names are globally unique across process
restarts; preserve the existing "tnf-update-setup-<generation>" prefix and keep
the numeric generation in the ConfigMap Data/Labels (so
determineUpdateSetupTargetNode and any cleanup logic that reads the generation
still work) and update any tests or cleanup code that expect the exact name
format to tolerate the new suffix.

Fixed race conditions in update-setup during node replacement by using captured node state, sequential event processing, and single-job reconciliation. Also fixes install ordering where Ready and fencing paths could treat early TNF jobs as bootstrap progress and skip or race auth/setup. Fixes TNF bootstrap never starting when etcd-operator starts after both control-plane nodes are already Ready (informer never emits NotReady→Ready, while Add stays handoff-gated).

- Capture node state at event time in ConfigMap (source of truth, prevents fresh API call races)
- Use generation counter for sequential event processing (handles rapid add/delete scenarios)
- Run single update-setup job on target node based on event type (peer for add, survivor for delete)
- Split event handling into specialized handlers for add/ready/delete; Ready always runs startTnfJobcontrollers when two nodes are ready (no coarse Job-list gating: fencing can exist before auth/setup completes)
- Grant tnf-setup-manager get/list/watch on ConfigMaps in openshift-etcd so update-setup can list ConfigMaps labeled with the shared component value (tools.TnfUpdateSetupComponentValue; operator and runner stay in sync)
- Gate day-2 add/delete, fencing-secret-driven fencing restarts, and node Add handling on ExternalEtcdHasCompletedTransition instead of inferring bootstrap from any TNF Job (avoids update-setup racing initial auth/setup); enqueue Add for all control-plane nodes so day-2 work is not tied to the new node's first Ready
- After control-plane node informer registration, one-shot catch-up: if tnf-setup-job is missing and two control-plane nodes are Ready, run the NodeEventTypeReady path (covers operator restart and missed Ready edges); catch-up is not gated on ExternalEtcdHasCompletedTransition, which is set only after TNF drives etcd handoff
- Align nodehandler_test and starter_test with handleNodesWithRetry(triggeringNode, eventType) and the gating/Ready behavior above

Also:
- TNF auth tolerates a single control-plane node (pcs host auth one host).
- update-setup: continue on pcs remove output containing 401; ConfigureFencing after membership change; always pcs cluster sync/start at end.
- Tests: controlPlaneNodesExceptName; pcs 401 detection.
@jaypoulz jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from 8f6ff88 to e53b901 Compare June 1, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants