OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain#1605
Conversation
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (10)
🚧 Files skipped from review as they are similar to previous changes (10)
WalkthroughEvent-driven TNF node handling: per-event add/ready/delete dispatch, single-node update-setup runners using per-generation ConfigMap snapshots, pacemaker+etcd membership reconciliation by comparing Kubernetes node name+IP, staged removals/additions, and related tests, RBAC, and label support. ChangesTNF Node Lifecycle Refactor: Event-Driven Membership Reconciliation
Sequence DiagramsequenceDiagram
participant Node as Node (add/ready/delete)
participant Starter as Starter
participant NodeHandler as NodeHandler
participant UpdateSetup as UpdateSetup
participant Pacemaker as Pacemaker
participant Etcd as Etcd
participant JobCtrl as JobControllers
Node->>Starter: NodeAdded / NodeReady / NodeDeleted
Starter->>NodeHandler: handleNodesWithRetry(triggeringNode,eventType)
alt Add/Delete: wait ExternalEtcd transition
NodeHandler->>NodeHandler: Check ExternalEtcdCompletedTransition
end
NodeHandler->>NodeHandler: List control-plane nodes and check readiness
NodeHandler->>UpdateSetup: runUpdateSetupFunc(eventType, targetNode)
UpdateSetup->>UpdateSetup: Select highest-generation ConfigMap and decode node list
UpdateSetup->>Pacemaker: pcs cluster config show --output-format json
Pacemaker-->>UpdateSetup: pacemaker node map (name -> peer IP)
UpdateSetup->>UpdateSetup: compute nodesToRemove and nodesToAdd
UpdateSetup->>Pacemaker: pcs cluster node remove --force --skip-offline
UpdateSetup->>Etcd: etcdctl member list -w json
Etcd-->>UpdateSetup: member list
UpdateSetup->>Etcd: remove unstarted members (peer IP not in k8s nodes)
alt Nodes to Add
UpdateSetup->>Etcd: WaitForStableRevision
UpdateSetup->>Pacemaker: pcs cluster node add
UpdateSetup->>UpdateSetup: configure fencing and update etcd node_ip_map
UpdateSetup->>Pacemaker: pcs cluster force_new_cluster
UpdateSetup->>Pacemaker: pcs cluster enable/start
end
UpdateSetup->>Pacemaker: pcs cluster sync
UpdateSetup->>Pacemaker: pcs cluster start --all
UpdateSetup-->>NodeHandler: Reconciliation complete
NodeHandler->>JobCtrl: start TNF job controllers when ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@jaypoulz: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ba7540-43e1-11f1-8734-dc441d1b18ce-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/tnf/update-setup/runner_test.go (1)
18-102: Tests validate API primitives, but not the runner branch decisions end-to-end.This table test confirms fake-client
Get/IsNotFoundbehavior, but it doesn’t exerciseRunTnfUpdateSetup(or a helper) to assert “skip vs proceed vs fail” return-path decisions directly. For this critical flow, consider extracting the node-existence decision into a small function and unit-test that function.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner_test.go` around lines 18 - 102, The test is only exercising the fake client Get behavior, not the runner's branch decisions; extract the node-existence decision into a small function (e.g., func checkNodeExists(ctx context.Context, kube kubernetes.Interface, nodeName string) (exists bool, err error)) and update RunTnfUpdateSetup to call that helper; then replace the table test to unit-test checkNodeExists directly (using fake.NewClientset) to assert the three return paths (exists/skip, not found/proceed, unexpected error/fail). Ensure the helper is named uniquely (checkNodeExists) and referenced from RunTnfUpdateSetup so the new unit tests verify the runner branch logic end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 124-126: The test passes a non-error value (&runtime.Unknown{}) to
apierrors.NewInternalError via simulateError; replace the runtime.Unknown value
with an actual error (e.g., errors.New(...) or fmt.Errorf(...)) or format the
runtime.Unknown into an error string so apierrors.NewInternalError receives an
error interface; update the simulateError assignment that calls
apierrors.NewInternalError in runner_test.go accordingly (referencing
simulateError, apierrors.NewInternalError, and runtime.Unknown).
---
Nitpick comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 18-102: The test is only exercising the fake client Get behavior,
not the runner's branch decisions; extract the node-existence decision into a
small function (e.g., func checkNodeExists(ctx context.Context, kube
kubernetes.Interface, nodeName string) (exists bool, err error)) and update
RunTnfUpdateSetup to call that helper; then replace the table test to unit-test
checkNodeExists directly (using fake.NewClientset) to assert the three return
paths (exists/skip, not found/proceed, unexpected error/fail). Ensure the helper
is named uniquely (checkNodeExists) and referenced from RunTnfUpdateSetup so the
new unit tests verify the runner branch logic end-to-end.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b46e7555-1b53-4882-b0b2-fa6f26787667
📒 Files selected for processing (2)
pkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
|
/jira refresh |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e9ed644 to
0533a44
Compare
|
|
||
| // Check if offline node still exists in Kubernetes | ||
| // Node replacement requires user to DELETE the node from Kubernetes | ||
| // If the node still exists, it's likely just rebooted (e.g., during upgrade) |
There was a problem hiding this comment.
This job only runs when we have 2 ready nodes available... but yeah, pacemaker status might be behind, or reboot by upgrade just started... good catch 🙈
Might be good enough to compare offlineNodeName with currentNodeName and otherNodeName... when one of them matches, skip. But the result is the same I guess :)
There was a problem hiding this comment.
What I realized from this comment is that both approaches have the same problem. Same-name node replacement requires us to re-add the node. So the k8s node check doesn't work because it would block it on the same name, and the currentNodeName and otherNodeName check doesn't work they would match the replaced node.
I think we need to run update-setup to reconcile on node delete because offline is insufficient to distinguish between temporary outage and node-that-needs-replacement.
There was a problem hiding this comment.
OK - I'm pushing a new approach. Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state. This means that deleting the node in k8s results in us removing it from pacemaker.
This allows us to determine user intention based on the declarative k8s state. So now if you reboot, the k8s nodes are the same, so no-op. If you delete a node, that's an intentional change, so we clean it up on the backend.
It's more involved though, so I'll need to do some soak testing for this.
There was a problem hiding this comment.
Same-name node replacement
Ah yes, of course. Has been too long ago already, me dealing with TNF 😎
Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state
Hm, not sure about that tbh...
The idea still was: we have 2 ready nodes, but it's not the initial setup, so "update" pacemaker if needed to reflect reality. Changing the trigger would be a risky change IMHO, and can introduce regressions. I mean, this worked well so far for the intended use case, not?
IMHO we only need to better understand the situation during job execution: do we need to deal with the offline node or not? So: can we check something on the node resource to find out?
Pragmatic idea: check the creation date...? If the node is "new" run the setup, else skip it. 🤔
Maybe there is something better.
Or: I don't remember details, but I think I considered putting a label on nodes after they were successfully configured for pacemaker, but that didn't happen... could be useful in this case.
There was a problem hiding this comment.
I think we can afford the extra iteration time to get this right since the bug is easy enough to work around at upgrade time. Beyond it fixing this edge case, the other reason I like reworking the reconciliation approach is that I think we will need to in the future to support Adaptable topology.
Scaling from 1 to 2 to 3 nodes with automatic quorum recovery will require us to set up pacemaker on a single node cluster first, then scale up as we go. This would be one of multiple things that we would need to harden to handle that scale-up / scale-down logic.
There was a problem hiding this comment.
One thing is absolutely true though - we can't move forward with this patch before updated podman-etcd to handle the single-node case in the monitor operation first.
0533a44 to
3bf1712
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 121-139: The current logic uses jobsExist (from tnfSetupJobsExist)
as a post-setup signal which is set true as soon as startTnfJobcontrollers
creates setup/auth/after-setup jobs; change the gate to require a real
completion signal instead: replace or extend tnfSetupJobsExist with a check for
a TNF setup completion marker (for example a specific Job completion condition,
a "tnf-setup-completed" ConfigMap/CR, or a Job status check) and expose it as
something like tnfSetupCompleted(ctx, kubeClient) (or return a second boolean
from tnfSetupJobsExist indicating completion); then use that completion boolean
in the node count checks (i.e. when evaluating len(nodeList) < 2, consult
setupCompleted not jobsExist) so a partially-created setup after a CEO restart
is treated as not completed and the initial 2-node flow is preserved.
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 149-215: The test in runner_test.go reimplements the
reconciliation logic instead of exercising the production code in runner.go;
extract the reconciliation decision into a single exported or package-private
helper (e.g., ReconcileNodes or decideNodeAddRemove) in runner.go that takes
k8sNodes []string and pacemakerOnline/offline []string and returns
(nodeToRemove, nodeToAdd, skip bool), then update the table-driven tests to call
that helper directly (or alternatively drive RunTnfUpdateSetup using
fakes/mocks) so the tests validate the actual implementation rather than a
duplicate inline algorithm.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 313-326: Replace the unsafe substring check
strings.Contains(peerURL, nodeIP) with exact host comparison: parse each peerURL
(use url.Parse) inside the loop over member.PeerURLs, extract the host portion
(handle optional port with net.SplitHostPort and IPv6 brackets), then compare
the extracted host string exactly to nodeIP before proceeding to format
member.ID, log, and call exec.Execute to remove the member; only call member
remove when the host equals nodeIP to avoid accidental matches.
- Around line 168-176: The removal path must not abort when nodeToRemove isn't
present in cfg.NodeName1/NodeName2; instead, stop using
cfg.NodeName1/NodeName2/NodeIP1/NodeIP2 to derive the member to remove and query
Pacemaker/etcd state directly (e.g., list pacemaker nodes or etcd members) to
find the member ID and peerURL/IP for nodeToRemove, then use that ID to perform
the removal; update the logic in the block that references nodeToRemove (and the
similar code at the other instance around lines 239-248) to call the
pacemaker/etcd discovery functions, handle not-found as a no-op or
log-and-continue, and only fall back to cfg lookups if the live discovery fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3032f008-368a-4f16-b3ad-fdf0e5c3b3e4
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
78f3141 to
fad9968
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/tnf/update-setup/runner.go (2)
271-283:⚠️ Potential issue | 🔴 CriticalDeleted nodes cannot be resolved through the current cluster config.
nodeNameshere are the pacemaker members that Kubernetes no longer has. For that case,getNodeIP(nodeName)fails by definition, so the code aborts after pacemaker removal and leaves the stale etcd member behind.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 271 - 283, The loop in removeEtcdMembersForNodes currently aborts when getNodeIP(nodeName) fails (which happens for pacemaker-only/deleted nodes) and leaves stale etcd members; change the logic so a failed getNodeIP does not return but instead logs a warning and attempts removal by node identity: implement a fallback removal path (e.g., add removeEtcdMemberByName(nodeName) or enhance removeEtcdMemberByIP to accept a nodeName fallback) and call that when getNodeIP returns an error; ensure removeEtcdMembersForNodes still logs success/failure via klog and only returns on unrecoverable errors from the removal call (removeEtcdMemberByIP or new removeEtcdMemberByName).
311-324:⚠️ Potential issue | 🔴 CriticalMatch the etcd peer host exactly before removing a member.
strings.Contains(peerURL, nodeIP)is unsafe for values like10.0.0.1vs10.0.0.10, and it is brittle with ports/IPv6 formatting. Parse the peer URL and compare the hostname exactly before callingmember remove.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 311 - 324, The code in the loop over memberList.Members uses strings.Contains(peerURL, nodeIP) which can match substrings and fail on ports/IPv6; update the logic in the loop that iterates member.PeerURLs (the block around strings.Contains(peerURL, nodeIP)) to parse each peerURL (use url.Parse), extract the host component and if it contains a port use net.SplitHostPort to get the exact hostname (handling IPv6 brackets), then compare that hostname exactly to nodeIP before constructing memberIDHex and running exec.Execute for member remove; keep the existing memberIDHex formatting and logging but only call etcdctl member remove when the parsed host equals nodeIP.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 96-99: When building the k8sNodes set, skip empty node names so
the map does not contain "" which causes reconciliation to treat a non-existent
node as present; update the construction of k8sNodes to only insert
cfg.NodeName1 and cfg.NodeName2 when they are non-empty (e.g. check
len(cfg.NodeName1) > 0 and len(cfg.NodeName2) > 0 before adding), keeping the
map type and usage same so downstream logic that checks membership remains
unchanged.
---
Duplicate comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 271-283: The loop in removeEtcdMembersForNodes currently aborts
when getNodeIP(nodeName) fails (which happens for pacemaker-only/deleted nodes)
and leaves stale etcd members; change the logic so a failed getNodeIP does not
return but instead logs a warning and attempts removal by node identity:
implement a fallback removal path (e.g., add removeEtcdMemberByName(nodeName) or
enhance removeEtcdMemberByIP to accept a nodeName fallback) and call that when
getNodeIP returns an error; ensure removeEtcdMembersForNodes still logs
success/failure via klog and only returns on unrecoverable errors from the
removal call (removeEtcdMemberByIP or new removeEtcdMemberByName).
- Around line 311-324: The code in the loop over memberList.Members uses
strings.Contains(peerURL, nodeIP) which can match substrings and fail on
ports/IPv6; update the logic in the loop that iterates member.PeerURLs (the
block around strings.Contains(peerURL, nodeIP)) to parse each peerURL (use
url.Parse), extract the host component and if it contains a port use
net.SplitHostPort to get the exact hostname (handling IPv6 brackets), then
compare that hostname exactly to nodeIP before constructing memberIDHex and
running exec.Execute for member remove; keep the existing memberIDHex formatting
and logging but only call etcdctl member remove when the parsed host equals
nodeIP.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 975b05c2-c5db-40cd-ab9f-96db7d22ec11
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/tnf/operator/nodehandler_test.go
- pkg/tnf/update-setup/runner_test.go
- pkg/tnf/operator/nodehandler.go
fad9968 to
03a2a57
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/tnf/operator/nodehandler_test.go (1)
354-360: Consider using a prefix check instead of exact name matches.The hardcoded list of job names is fragile. If tests add new master names (e.g.,
tnf-after-setup-master-2), this condition would need updating.♻️ Suggested improvement
labels := map[string]string{ "app.kubernetes.io/component": "two-node-fencing-setup", } // Mark after-setup jobs to indicate initial setup completion - if name == "tnf-after-setup" || name == "tnf-after-setup-master-0" || name == "tnf-after-setup-master-1" { + if strings.HasPrefix(name, "tnf-after-setup") { labels["app.kubernetes.io/name"] = "tnf-after-setup" }This would require adding
"strings"to the imports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/operator/nodehandler_test.go` around lines 354 - 360, The code currently checks exact job names when setting labels for after-setup jobs (the block that updates the labels map when name == "tnf-after-setup" || name == "tnf-after-setup-master-0" || name == "tnf-after-setup-master-1"); replace these exact-match checks with a prefix check (e.g., use strings.HasPrefix(name, "tnf-after-setup")) so new masters like "tnf-after-setup-master-2" are handled automatically, and add the "strings" import to the test file.pkg/tnf/update-setup/runner.go (1)
293-296: Unused parameters:nodeNamesandgetNodeIPare never used.The function removes all unstarted etcd members based on
member.Name == "", but the passed parameters are never referenced. This makes the function signature misleading and the caller's effort to passgetNodeIPpointless.♻️ Suggested signature simplification
If the intent is to remove all unstarted members regardless of which nodes were removed:
-func removeEtcdMembersForNodes(ctx context.Context, nodeNames []string, getNodeIP func(string) (string, error)) error { - if len(nodeNames) == 0 { - return nil - } +func removeUnstartedEtcdMembers(ctx context.Context) error {Update call sites at lines 180 and 230 accordingly:
- if err := removeEtcdMembersForNodes(ctx, nodesToRemove, getNodeIP); err != nil { + if len(nodesToRemove) > 0 { + if err := removeUnstartedEtcdMembers(ctx); err != nil { + return err + } + }Alternatively, if you want to keep the node-specific behavior for future use, consider using
getNodeIPto match members by IP viaremoveEtcdMemberByIP.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 293 - 296, The function removeEtcdMembersForNodes has unused parameters nodeNames and getNodeIP; either simplify the signature by removing those parameters and update callers to call removeEtcdMembersForNodes(ctx) (i.e., make it remove all unstarted members as it already does), or implement node-specific removal by iterating nodeNames, calling getNodeIP(node) to obtain IPs and then call removeEtcdMemberByIP(ctx, ip) for matches; update removeEtcdMembersForNodes to reference getNodeIP and nodeNames (or remove them) and adjust call sites accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 349-412: The function removeEtcdMemberByIP is dead code (never
called); either delete it or wire it into the existing workflow — specifically
replace or refactor the current removeEtcdMembersForNodes usage to call
removeEtcdMemberByIP for each nodeIP, or remove the entire removeEtcdMemberByIP
function and clean up any now-unused imports (e.g., url, net, strings) and
tests; locate the existing caller(s) of removeEtcdMembersForNodes and either
iterate node IPs invoking removeEtcdMemberByIP (preserving its parsing/hex-ID
logic) or remove the function and ensure no references remain.
---
Nitpick comments:
In `@pkg/tnf/operator/nodehandler_test.go`:
- Around line 354-360: The code currently checks exact job names when setting
labels for after-setup jobs (the block that updates the labels map when name ==
"tnf-after-setup" || name == "tnf-after-setup-master-0" || name ==
"tnf-after-setup-master-1"); replace these exact-match checks with a prefix
check (e.g., use strings.HasPrefix(name, "tnf-after-setup")) so new masters like
"tnf-after-setup-master-2" are handled automatically, and add the "strings"
import to the test file.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 293-296: The function removeEtcdMembersForNodes has unused
parameters nodeNames and getNodeIP; either simplify the signature by removing
those parameters and update callers to call removeEtcdMembersForNodes(ctx)
(i.e., make it remove all unstarted members as it already does), or implement
node-specific removal by iterating nodeNames, calling getNodeIP(node) to obtain
IPs and then call removeEtcdMemberByIP(ctx, ip) for matches; update
removeEtcdMembersForNodes to reference getNodeIP and nodeNames (or remove them)
and adjust call sites accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 273e7f58-c686-46d5-b227-f828930b53ef
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/update-setup/runner_test.go
fonta-rh
left a comment
There was a problem hiding this comment.
Review: OCPBUGS-84695 — Fix TNF Pacemaker Split-Brain
The core reconciliation approach is solid — comparing k8s ConfigMap (desired state) vs pacemaker XML membership (actual state) is fundamentally better than the old grep Offline | awk shell parsing. Three findings from review:
1. Dead code: removeEtcdMemberByIP() + unused getNodeIP parameter
removeEtcdMemberByIP() (runner.go:472-535, 63 lines) is defined but never called anywhere in this PR. Additionally, the getNodeIP callback is passed to removeEtcdMembersForNodes() at lines 320 and 370, but it's never dereferenced inside the function body — removeEtcdMembersForNodes removes ALL unstarted members regardless of node identity.
Was removeEtcdMemberByIP intended to be wired in as a fallback (e.g., when unstarted-name detection fails) but forgotten? If it's not needed, removing both the function and the dead getNodeIP parameter would clean things up. If removing, consider renaming removeEtcdMembersForNodes → removeAllUnstartedEtcdMembers to match the actual behavior (blanket cleanup, not per-node targeting).
2. Missing tests for intermediate setup states
The tnfSetupJobsExist() semantics changed significantly — from "any TNF job exists" to "after-setup job completed successfully (Succeeded > 0)." This is the behavioral hinge of the fix, but two intermediate states aren't tested:
-
After-setup in progress:
tnf-after-setup-master-0exists withSucceeded: 0→tnfSetupJobsExist()returnsfalse. With 1 node present,handleNodesblocks. ThecreateTNFJob()fixture hardcodesSucceeded: 1, so this path can't currently be exercised. -
Partial setup (setup running, after-setup not yet created):
tnf-setupjob exists but notnf-after-setupjobs. The old test usedcreateTNFJob("tnf-setup")— that coverage was lost when fixtures were changed totnf-after-setup-*.
Both are realistic timing windows during normal cluster lifecycle. Suggested additions:
"1 node, after-setup job in progress (not completed) - skip reconciliation""2 nodes, setup jobs exist but no after-setup jobs - start controllers only"
3. removeEtcdMembersForNodes docstring vs behavior mismatch
The docstring says "removes unstarted etcd members after nodes are removed from pacemaker," implying per-node targeting. The actual behavior removes ALL unstarted members regardless of the nodeNames parameter (which is only used for logging). The behavior is correct and safe for a 2-node cluster, but the docstring could mislead a future engineer debugging member removal issues. Updating it to reflect the blanket-cleanup behavior would prevent misunderstandings.
Overall the approach is a clear improvement. The reconciliation logic in reconcileNodes() is clean, well-extracted for testability, and the test coverage for the core diff logic is thorough (14 cases). Nice work.
cb454cf to
1017cfc
Compare
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@jaypoulz: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6124da80-44c4-11f1-9cb9-891f45aba7aa-0 |
|
@CodeRabbit resume |
✅ Actions performedReviews resumed. |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/tnf/operator/nodehandler.go (1)
152-179:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon’t use raw TNF job existence as the post-setup signal.
tnfSetupJobsExist()is still true as soon as any TNF job object exists, so after a CEO restart mid-initial-setup, Line 160 can treat a partially initialized cluster as “already set up” and allow the 1-node reconciliation path on Line 179. Please gate this on a real completion signal instead of any labeled job.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/operator/nodehandler.go` around lines 152 - 179, tnfSetupJobsExist() currently treats any TNF Job object as "setup done" which lets code at the decision point between startTnfJobcontrollersFunc(...) and updateSetupFunc(...) incorrectly take the update-setup path after a partial initial run; change the gating to check a real completion signal (e.g., a Job completion status/condition, a specific finished label/annotation on the setup Job, or a dedicated ConfigMap/CR/Status field) rather than mere existence: update the pre-check that uses tnfSetupJobsExist() to call a new/extended helper (or enhance tnfSetupJobsExist) that inspects Job.Status.CompletionTime or a designated completion marker, only return true when that completion marker is present, and keep the rest of the flow (startTnfJobcontrollersFunc(...) and updateSetupFunc(...)) unchanged so only fully completed initial setups take the update-setup path.pkg/tnf/update-setup/runner.go (1)
257-280:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftNames-only reconciliation misses same-name replacements.
reconcileNodesonly keys on node name. If a replacement reuses the Kubernetes node name but comes back with a different IP, this helper returns no changes, and Lines 142-145 exit before pacemaker membership ornode_ip_mapis refreshed. The reconciliation key needs to include node identity/IP, not just the name.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 257 - 280, reconcileNodes currently compares only node.Name and misses replacements that reuse a name but change IP; change reconcileNodes signature to accept k8sNodes as a map[string]string (name -> IP) or a small struct including IP, and use pacemaker.Node's IP field to build pacemakerNodesMap keyed by a combined identity e.g. name+"|"+ip (or a tuple-equivalent) instead of name alone; then compute nodesToRemove/nodesToAdd by comparing those combined keys; update callers to pass the refreshed node_ip_map (name->ip) into reconcileNodes and ensure node_ip_map is refreshed before any early exit that depends on reconciliation (so pacemaker membership vs k8s identity comparison uses name+IP).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 152-179: tnfSetupJobsExist() currently treats any TNF Job object
as "setup done" which lets code at the decision point between
startTnfJobcontrollersFunc(...) and updateSetupFunc(...) incorrectly take the
update-setup path after a partial initial run; change the gating to check a real
completion signal (e.g., a Job completion status/condition, a specific finished
label/annotation on the setup Job, or a dedicated ConfigMap/CR/Status field)
rather than mere existence: update the pre-check that uses tnfSetupJobsExist()
to call a new/extended helper (or enhance tnfSetupJobsExist) that inspects
Job.Status.CompletionTime or a designated completion marker, only return true
when that completion marker is present, and keep the rest of the flow
(startTnfJobcontrollersFunc(...) and updateSetupFunc(...)) unchanged so only
fully completed initial setups take the update-setup path.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 257-280: reconcileNodes currently compares only node.Name and
misses replacements that reuse a name but change IP; change reconcileNodes
signature to accept k8sNodes as a map[string]string (name -> IP) or a small
struct including IP, and use pacemaker.Node's IP field to build
pacemakerNodesMap keyed by a combined identity e.g. name+"|"+ip (or a
tuple-equivalent) instead of name alone; then compute nodesToRemove/nodesToAdd
by comparing those combined keys; update callers to pass the refreshed
node_ip_map (name->ip) into reconcileNodes and ensure node_ip_map is refreshed
before any early exit that depends on reconciliation (so pacemaker membership vs
k8s identity comparison uses name+IP).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 45f6ce68-40be-46fd-97e4-b8f86da424ac
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
4831238 to
bb32848
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/tnf/update-setup/runner.go (1)
190-196: 💤 Low valueConsider individual error handling for node additions.
If
runCommandsfails partway through multiple node additions, earlier successful additions won't be rolled back, potentially leaving pacemaker in an inconsistent state. However, given the two-node cluster scope and that subsequent runs will re-reconcile, this is acceptable for the current use case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 190 - 196, The current batch execution using runCommands over nodesToAdd can hide which node failed and doesn't allow per-node handling; change the logic to run each add command individually inside the loop (for each nodeToAdd build the "/usr/sbin/pcs cluster node add %s" command and call a single-command runner or runCommands with a single-item slice), check and return the error immediately including the failing node name (reference nodesToAdd and runCommands), so callers get per-node failure context and easier reconciliation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 190-196: The current batch execution using runCommands over
nodesToAdd can hide which node failed and doesn't allow per-node handling;
change the logic to run each add command individually inside the loop (for each
nodeToAdd build the "/usr/sbin/pcs cluster node add %s" command and call a
single-command runner or runCommands with a single-item slice), check and return
the error immediately including the failing node name (reference nodesToAdd and
runCommands), so callers get per-node failure context and easier reconciliation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0dd5a0b1-09dd-4b24-b159-4754dd93a074
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/update-setup/runner_test.go
bb32848 to
61229f1
Compare
|
verification failed, moving back to draft. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 178-180: The early return when len(nodesToAdd) == 0 skips the etcd
node_ip_map update and leaves a stale name→IP mapping; instead of returning
immediately in the block that checks nodesToAdd, invoke the same etcd
refresh/pcs update logic that runs on the add path (call the function or code
that performs the "pcs resource update etcd node_ip_map=..." update) so the
node_ip_map is always refreshed after member/pacemaker cleanup; remove the
premature return or move the node_ip_map update into a common post-cleanup step
so that both remove-only and add+remove flows update etcd (refer to the
nodesToAdd check and the code that performs the pcs resource update of etcd
node_ip_map to locate where to change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 48976c7b-86a1-4c74-b132-783410e46ccb
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/update-setup/runner_test.go
61229f1 to
b496a74
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/tnf/update-setup/runner_test.go (2)
161-188: ⚖️ Poor tradeoffIP extraction logic is duplicated instead of testing production code.
The test reimplements the IP extraction logic inline rather than calling a shared helper from
runner.go. If the production code inremoveUnstartedEtcdMemberschanges, this test won't catch regressions. Consider extracting the IP parsing into a package-level helper and testing that directly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner_test.go` around lines 161 - 188, The test duplicates the IP extraction logic instead of reusing the production implementation; extract the peer-IP parsing into a package-level helper (e.g., parsePeerIP or extractIPFromPeerURL) in the same package where removeUnstartedEtcdMembers lives, move the current parsing logic from removeUnstartedEtcdMembers into that helper, update removeUnstartedEtcdMembers to call the helper, and change the test to call the new helper directly (use the same test vectors as before) so tests exercise the single source of truth for parsing.
23-36: 💤 Low valueDuplicate test cases.
The test cases "Upgrade - both nodes in k8s and pacemaker with same IPs - no changes" (lines 23-36) and "No changes - k8s and pacemaker match (name+IP)" (lines 63-76) have identical inputs and expectations. Consider removing one to avoid redundancy.
Also applies to: 63-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner_test.go` around lines 23 - 36, Remove the duplicate table-driven test entry by deleting one of the identical cases — either "Upgrade - both nodes in k8s and pacemaker with same IPs - no changes" or "No changes - k8s and pacemaker match (name+IP)" — so only a single test with those k8sNodes/pacemakerNodes maps and expectRemove/expectAdd=nil, expectSkip=true remains; ensure the remaining test keeps the intended descriptive name and that the test slice used by Test... (the table variable in runner_test.go) is updated accordingly.pkg/tnf/operator/nodehandler_test.go (1)
399-402: 💤 Low valueUnused label in test helper.
The
app.kubernetes.io/namelabel added for after-setup jobs isn't used bytnfSetupJobsExist(), which only checks forapp.kubernetes.io/component=two-node-fencing-setup. If this label is intended for future use or documentation, consider adding a comment. Otherwise, it can be removed to avoid confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/operator/nodehandler_test.go` around lines 399 - 402, The test helper is adding labels["app.kubernetes.io/name"]="tnf-after-setup" inside the strings.HasPrefix(name, "tnf-after-setup") branch but that label is unused by tnfSetupJobsExist() (which only checks app.kubernetes.io/component=two-node-fencing-setup); either remove the unused assignment to avoid confusion or keep it but add a brief comment next to the strings.HasPrefix(...) branch explaining the label is for future/documentation purposes and not used by tnfSetupJobsExist().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/tnf/operator/nodehandler_test.go`:
- Around line 399-402: The test helper is adding
labels["app.kubernetes.io/name"]="tnf-after-setup" inside the
strings.HasPrefix(name, "tnf-after-setup") branch but that label is unused by
tnfSetupJobsExist() (which only checks
app.kubernetes.io/component=two-node-fencing-setup); either remove the unused
assignment to avoid confusion or keep it but add a brief comment next to the
strings.HasPrefix(...) branch explaining the label is for future/documentation
purposes and not used by tnfSetupJobsExist().
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 161-188: The test duplicates the IP extraction logic instead of
reusing the production implementation; extract the peer-IP parsing into a
package-level helper (e.g., parsePeerIP or extractIPFromPeerURL) in the same
package where removeUnstartedEtcdMembers lives, move the current parsing logic
from removeUnstartedEtcdMembers into that helper, update
removeUnstartedEtcdMembers to call the helper, and change the test to call the
new helper directly (use the same test vectors as before) so tests exercise the
single source of truth for parsing.
- Around line 23-36: Remove the duplicate table-driven test entry by deleting
one of the identical cases — either "Upgrade - both nodes in k8s and pacemaker
with same IPs - no changes" or "No changes - k8s and pacemaker match (name+IP)"
— so only a single test with those k8sNodes/pacemakerNodes maps and
expectRemove/expectAdd=nil, expectSkip=true remains; ensure the remaining test
keeps the intended descriptive name and that the test slice used by Test... (the
table variable in runner_test.go) is updated accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 50b0a5a6-de76-4e7a-a702-d2e2fa448c0d
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/update-setup/runner.go
b496a74 to
8f6ff88
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/update-setup/runner.go (1)
113-119:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't silently succeed when the targeted node is not running pacemaker.
Now that only one node runs
update-setup, returningnilhere drops the reconcile entirely if that chosen node is rebooting or otherwise not runningpcsyet. Bubble this up as a failure so the job retries instead of reporting success without reconciling membership.Suggested change
command := "/usr/sbin/pcs cluster status" _, _, err = exec.Execute(ctx, command) if err != nil { - klog.Infof("Cluster not running (err: %v), skipping update-setup on this node", err) - return nil + return fmt.Errorf("cluster not running on targeted node %s: %w", currentNodeName, err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner.go` around lines 113 - 119, The code currently treats a failure running "/usr/sbin/pcs cluster status" as success by logging and returning nil; change this so the failure bubbles up and the job retries: in the block that runs exec.Execute(ctx, command) (the "/usr/sbin/pcs cluster status" check) return the error (or a wrapped error using fmt.Errorf) instead of nil and change the log to reflect an error return (e.g., klog.Errorf or klog.Infof with the intent to return err). This ensures the reconcile in runner.go requeues when pcs is not available rather than silently succeeding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 147-165: Node Ready events currently always call handleNodeReady,
which re-triggers initial setup even after ExternalEtcdHasCompletedTransition;
change the switch so NodeEventTypeReady only calls handleNodeReady when
externalEtcdTransitionComplete is false (i.e., during initial bootstrap) and
otherwise skip/return no-op; update the same gating in the other occurrence of
this logic (the block covering lines 232-271) to use the same check; reference
externalEtcdTransitionComplete, ceohelpers.HasExternalEtcdCompletedTransition,
and handleNodeReady when locating and applying the change.
- Around line 337-396: The ConfigMap name derived from the process-local
updateSetupGeneration can collide across CEO restarts; change the cmName
generation in updateSetup (function updateSetup, variable cmName) to include a
restart-safe unique suffix (for example a timestamp or a UUID/random nonce) so
names are globally unique across process restarts; preserve the existing
"tnf-update-setup-<generation>" prefix and keep the numeric generation in the
ConfigMap Data/Labels (so determineUpdateSetupTargetNode and any cleanup logic
that reads the generation still work) and update any tests or cleanup code that
expect the exact name format to tolerate the new suffix.
---
Outside diff comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 113-119: The code currently treats a failure running
"/usr/sbin/pcs cluster status" as success by logging and returning nil; change
this so the failure bubbles up and the job retries: in the block that runs
exec.Execute(ctx, command) (the "/usr/sbin/pcs cluster status" check) return the
error (or a wrapped error using fmt.Errorf) instead of nil and change the log to
reflect an error return (e.g., klog.Errorf or klog.Infof with the intent to
return err). This ensures the reconcile in runner.go requeues when pcs is not
available rather than silently succeeding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b461e84f-3a05-4de2-8a47-e5212c1140b9
📒 Files selected for processing (10)
bindata/tnfdeployment/role.yamlpkg/tnf/auth/runner.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/tnf_labels.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/pkg/tools/tnf_labels.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/operator/nodehandler_test.go
| // Add/delete post-handoff paths consult ExternalEtcdHasCompletedTransition; Ready does not run update-setup. | ||
| var externalEtcdTransitionComplete bool | ||
| if eventType == NodeEventTypeAdd || eventType == NodeEventTypeDelete { | ||
| var err error | ||
| externalEtcdTransitionComplete, err = ceohelpers.HasExternalEtcdCompletedTransition(ctx, operatorClient) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check external etcd transition status: %w", err) | ||
| } | ||
| } | ||
|
|
||
| // Handle different event types | ||
| switch eventType { | ||
| case NodeEventTypeAdd: | ||
| return handleNodeAddition(nodeList, triggeringNode, externalEtcdTransitionComplete, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer) | ||
| case NodeEventTypeReady: | ||
| return handleNodeReady(nodeList, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer) | ||
| case NodeEventTypeDelete: | ||
| return handleNodeDeletion(nodeList, triggeringNode, externalEtcdTransitionComplete, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer) | ||
| default: |
There was a problem hiding this comment.
Gate NodeEventTypeReady to initial bootstrap only.
After ExternalEtcdHasCompletedTransition becomes true, any later NotReady→Ready transition will still run handleNodeReady and kick startTnfJobcontrollersFunc again on an already-initialized cluster. That makes reboot recovery re-enter the initial setup path instead of leaving day-2 reconciliation to add/delete events.
Suggested change
- // Add/delete post-handoff paths consult ExternalEtcdHasCompletedTransition; Ready does not run update-setup.
+ // Post-handoff paths consult ExternalEtcdHasCompletedTransition. Ready should stay bootstrap-only.
var externalEtcdTransitionComplete bool
- if eventType == NodeEventTypeAdd || eventType == NodeEventTypeDelete {
+ if eventType == NodeEventTypeAdd || eventType == NodeEventTypeDelete || eventType == NodeEventTypeReady {
var err error
externalEtcdTransitionComplete, err = ceohelpers.HasExternalEtcdCompletedTransition(ctx, operatorClient)
if err != nil {
return fmt.Errorf("failed to check external etcd transition status: %w", err)
}
}
@@
case NodeEventTypeReady:
- return handleNodeReady(nodeList, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer)
+ return handleNodeReady(nodeList, externalEtcdTransitionComplete, ctx, controllerContext, operatorClient, kubeClient, kubeInformersForNamespaces, etcdInformer) func handleNodeReady(
nodeList []*corev1.Node,
+ externalEtcdTransitionComplete bool,
ctx context.Context,
controllerContext *controllercmd.ControllerContext,
operatorClient v1helpers.StaticPodOperatorClient,
kubeClient kubernetes.Interface,
kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces,
etcdInformer operatorv1informers.EtcdInformer,
) error {
+ if externalEtcdTransitionComplete {
+ klog.V(4).Info("ignoring Ready event after external etcd handoff")
+ return nil
+ }
+
// Initial setup: need 2 ready nodesAlso applies to: 232-271
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/nodehandler.go` around lines 147 - 165, Node Ready events
currently always call handleNodeReady, which re-triggers initial setup even
after ExternalEtcdHasCompletedTransition; change the switch so
NodeEventTypeReady only calls handleNodeReady when
externalEtcdTransitionComplete is false (i.e., during initial bootstrap) and
otherwise skip/return no-op; update the same gating in the other occurrence of
this logic (the block covering lines 232-271) to use the same check; reference
externalEtcdTransitionComplete, ceohelpers.HasExternalEtcdCompletedTransition,
and handleNodeReady when locating and applying the change.
| func getNextUpdateSetupGeneration() int64 { | ||
| updateSetupGenerationMutex.Lock() | ||
| defer updateSetupGenerationMutex.Unlock() | ||
| updateSetupGeneration++ | ||
| return updateSetupGeneration | ||
| } | ||
|
|
||
| // updateSetup writes a snapshot ConfigMap, runs auth on all nodes, update-setup on one target, then after-setup on all. | ||
| func updateSetup( | ||
| nodeList []*corev1.Node, | ||
| triggeringNode *corev1.Node, | ||
| eventType string, | ||
| ctx context.Context, | ||
| controllerContext *controllercmd.ControllerContext, | ||
| operatorClient v1helpers.StaticPodOperatorClient, | ||
| kubeClient kubernetes.Interface, | ||
| kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, | ||
| ) error { | ||
|
|
||
| generation := getNextUpdateSetupGeneration() | ||
|
|
||
| targetNode := determineUpdateSetupTargetNode(nodeList, triggeringNode, eventType) | ||
| if targetNode == nil { | ||
| return fmt.Errorf("failed to determine target node for update-setup (event: %s, triggering: %s, generation: %d)", | ||
| eventType, triggeringNode.Name, generation) | ||
| } | ||
|
|
||
| klog.Infof("Generation %d: Event=%s, Triggering=%s, Target=%s, NodeList=%v", | ||
| generation, eventType, triggeringNode.Name, targetNode.Name, getNodeNames(nodeList)) | ||
|
|
||
| // Snapshot nodeList into ConfigMap for the runner (see comment on updateSetup). | ||
| nodeListData, err := encodeNodeList(nodeList) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to encode node list: %w", err) | ||
| } | ||
|
|
||
| cmName := fmt.Sprintf("tnf-update-setup-%d", generation) | ||
| cmData := map[string]string{ | ||
| "nodes": nodeListData, | ||
| "generation": fmt.Sprintf("%d", generation), | ||
| "eventType": eventType, | ||
| "timestamp": time.Now().Format(time.RFC3339), | ||
| "targetNode": targetNode.Name, | ||
| } | ||
| cm := &corev1.ConfigMap{ | ||
| ObjectMeta: v1.ObjectMeta{ | ||
| Name: cmName, | ||
| Namespace: operatorclient.TargetNamespace, | ||
| Labels: map[string]string{ | ||
| "app.kubernetes.io/component": tools.TnfUpdateSetupComponentValue, | ||
| "tnf.etcd.openshift.io/generation": fmt.Sprintf("%d", generation), | ||
| }, | ||
| }, | ||
| Data: cmData, | ||
| } | ||
|
|
||
| _, err = kubeClient.CoreV1().ConfigMaps(operatorclient.TargetNamespace).Create(ctx, cm, v1.CreateOptions{}) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create ConfigMap %s: %w", cmName, err) | ||
| } |
There was a problem hiding this comment.
Make update-setup generations restart-safe.
updateSetupGeneration is process-local, but the ConfigMap name is derived directly from it. If CEO restarts after creating tnf-update-setup-N and cleanup never happens, the next reconcile can reuse the same name and fail with AlreadyExists, blocking later update-setup runs.
Suggested change
func getNextUpdateSetupGeneration() int64 {
updateSetupGenerationMutex.Lock()
defer updateSetupGenerationMutex.Unlock()
- updateSetupGeneration++
+ next := time.Now().UnixNano()
+ if next <= updateSetupGeneration {
+ next = updateSetupGeneration + 1
+ }
+ updateSetupGeneration = next
return updateSetupGeneration
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func getNextUpdateSetupGeneration() int64 { | |
| updateSetupGenerationMutex.Lock() | |
| defer updateSetupGenerationMutex.Unlock() | |
| updateSetupGeneration++ | |
| return updateSetupGeneration | |
| } | |
| // updateSetup writes a snapshot ConfigMap, runs auth on all nodes, update-setup on one target, then after-setup on all. | |
| func updateSetup( | |
| nodeList []*corev1.Node, | |
| triggeringNode *corev1.Node, | |
| eventType string, | |
| ctx context.Context, | |
| controllerContext *controllercmd.ControllerContext, | |
| operatorClient v1helpers.StaticPodOperatorClient, | |
| kubeClient kubernetes.Interface, | |
| kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, | |
| ) error { | |
| generation := getNextUpdateSetupGeneration() | |
| targetNode := determineUpdateSetupTargetNode(nodeList, triggeringNode, eventType) | |
| if targetNode == nil { | |
| return fmt.Errorf("failed to determine target node for update-setup (event: %s, triggering: %s, generation: %d)", | |
| eventType, triggeringNode.Name, generation) | |
| } | |
| klog.Infof("Generation %d: Event=%s, Triggering=%s, Target=%s, NodeList=%v", | |
| generation, eventType, triggeringNode.Name, targetNode.Name, getNodeNames(nodeList)) | |
| // Snapshot nodeList into ConfigMap for the runner (see comment on updateSetup). | |
| nodeListData, err := encodeNodeList(nodeList) | |
| if err != nil { | |
| return fmt.Errorf("failed to encode node list: %w", err) | |
| } | |
| cmName := fmt.Sprintf("tnf-update-setup-%d", generation) | |
| cmData := map[string]string{ | |
| "nodes": nodeListData, | |
| "generation": fmt.Sprintf("%d", generation), | |
| "eventType": eventType, | |
| "timestamp": time.Now().Format(time.RFC3339), | |
| "targetNode": targetNode.Name, | |
| } | |
| cm := &corev1.ConfigMap{ | |
| ObjectMeta: v1.ObjectMeta{ | |
| Name: cmName, | |
| Namespace: operatorclient.TargetNamespace, | |
| Labels: map[string]string{ | |
| "app.kubernetes.io/component": tools.TnfUpdateSetupComponentValue, | |
| "tnf.etcd.openshift.io/generation": fmt.Sprintf("%d", generation), | |
| }, | |
| }, | |
| Data: cmData, | |
| } | |
| _, err = kubeClient.CoreV1().ConfigMaps(operatorclient.TargetNamespace).Create(ctx, cm, v1.CreateOptions{}) | |
| if err != nil { | |
| return fmt.Errorf("failed to create ConfigMap %s: %w", cmName, err) | |
| } | |
| func getNextUpdateSetupGeneration() int64 { | |
| updateSetupGenerationMutex.Lock() | |
| defer updateSetupGenerationMutex.Unlock() | |
| next := time.Now().UnixNano() | |
| if next <= updateSetupGeneration { | |
| next = updateSetupGeneration + 1 | |
| } | |
| updateSetupGeneration = next | |
| return updateSetupGeneration | |
| } | |
| // updateSetup writes a snapshot ConfigMap, runs auth on all nodes, update-setup on one target, then after-setup on all. | |
| func updateSetup( | |
| nodeList []*corev1.Node, | |
| triggeringNode *corev1.Node, | |
| eventType string, | |
| ctx context.Context, | |
| controllerContext *controllercmd.ControllerContext, | |
| operatorClient v1helpers.StaticPodOperatorClient, | |
| kubeClient kubernetes.Interface, | |
| kubeInformersForNamespaces v1helpers.KubeInformersForNamespaces, | |
| ) error { | |
| generation := getNextUpdateSetupGeneration() | |
| targetNode := determineUpdateSetupTargetNode(nodeList, triggeringNode, eventType) | |
| if targetNode == nil { | |
| return fmt.Errorf("failed to determine target node for update-setup (event: %s, triggering: %s, generation: %d)", | |
| eventType, triggeringNode.Name, generation) | |
| } | |
| klog.Infof("Generation %d: Event=%s, Triggering=%s, Target=%s, NodeList=%v", | |
| generation, eventType, triggeringNode.Name, targetNode.Name, getNodeNames(nodeList)) | |
| // Snapshot nodeList into ConfigMap for the runner (see comment on updateSetup). | |
| nodeListData, err := encodeNodeList(nodeList) | |
| if err != nil { | |
| return fmt.Errorf("failed to encode node list: %w", err) | |
| } | |
| cmName := fmt.Sprintf("tnf-update-setup-%d", generation) | |
| cmData := map[string]string{ | |
| "nodes": nodeListData, | |
| "generation": fmt.Sprintf("%d", generation), | |
| "eventType": eventType, | |
| "timestamp": time.Now().Format(time.RFC3339), | |
| "targetNode": targetNode.Name, | |
| } | |
| cm := &corev1.ConfigMap{ | |
| ObjectMeta: v1.ObjectMeta{ | |
| Name: cmName, | |
| Namespace: operatorclient.TargetNamespace, | |
| Labels: map[string]string{ | |
| "app.kubernetes.io/component": tools.TnfUpdateSetupComponentValue, | |
| "tnf.etcd.openshift.io/generation": fmt.Sprintf("%d", generation), | |
| }, | |
| }, | |
| Data: cmData, | |
| } | |
| _, err = kubeClient.CoreV1().ConfigMaps(operatorclient.TargetNamespace).Create(ctx, cm, v1.CreateOptions{}) | |
| if err != nil { | |
| return fmt.Errorf("failed to create ConfigMap %s: %w", cmName, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/nodehandler.go` around lines 337 - 396, The ConfigMap name
derived from the process-local updateSetupGeneration can collide across CEO
restarts; change the cmName generation in updateSetup (function updateSetup,
variable cmName) to include a restart-safe unique suffix (for example a
timestamp or a UUID/random nonce) so names are globally unique across process
restarts; preserve the existing "tnf-update-setup-<generation>" prefix and keep
the numeric generation in the ConfigMap Data/Labels (so
determineUpdateSetupTargetNode and any cleanup logic that reads the generation
still work) and update any tests or cleanup code that expect the exact name
format to tolerate the new suffix.
Fixed race conditions in update-setup during node replacement by using captured node state, sequential event processing, and single-job reconciliation. Also fixes install ordering where Ready and fencing paths could treat early TNF jobs as bootstrap progress and skip or race auth/setup. Fixes TNF bootstrap never starting when etcd-operator starts after both control-plane nodes are already Ready (informer never emits NotReady→Ready, while Add stays handoff-gated). - Capture node state at event time in ConfigMap (source of truth, prevents fresh API call races) - Use generation counter for sequential event processing (handles rapid add/delete scenarios) - Run single update-setup job on target node based on event type (peer for add, survivor for delete) - Split event handling into specialized handlers for add/ready/delete; Ready always runs startTnfJobcontrollers when two nodes are ready (no coarse Job-list gating: fencing can exist before auth/setup completes) - Grant tnf-setup-manager get/list/watch on ConfigMaps in openshift-etcd so update-setup can list ConfigMaps labeled with the shared component value (tools.TnfUpdateSetupComponentValue; operator and runner stay in sync) - Gate day-2 add/delete, fencing-secret-driven fencing restarts, and node Add handling on ExternalEtcdHasCompletedTransition instead of inferring bootstrap from any TNF Job (avoids update-setup racing initial auth/setup); enqueue Add for all control-plane nodes so day-2 work is not tied to the new node's first Ready - After control-plane node informer registration, one-shot catch-up: if tnf-setup-job is missing and two control-plane nodes are Ready, run the NodeEventTypeReady path (covers operator restart and missed Ready edges); catch-up is not gated on ExternalEtcdHasCompletedTransition, which is set only after TNF drives etcd handoff - Align nodehandler_test and starter_test with handleNodesWithRetry(triggeringNode, eventType) and the gating/Ready behavior above Also: - TNF auth tolerates a single control-plane node (pcs host auth one host). - update-setup: continue on pcs remove output containing 401; ConfigureFencing after membership change; always pcs cluster sync/start at end. - Tests: controlPlaneNodesExceptName; pcs 401 detection.
8f6ff88 to
e53b901
Compare
The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.
When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:
The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:
If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)
If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)
This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.
Tests cover:
This PR works in tandem with ClusterLabs/resource-agents#2155
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores