Skip to content

OCPEDGE-2711 : register pacemaker alert agents for fencing events to taint/untaint nodes#1625

Open
vimauro wants to merge 5 commits into
openshift:mainfrom
vimauro:tnf-alerts-registration
Open

OCPEDGE-2711 : register pacemaker alert agents for fencing events to taint/untaint nodes#1625
vimauro wants to merge 5 commits into
openshift:mainfrom
vimauro:tnf-alerts-registration

Conversation

@vimauro

@vimauro vimauro commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Two pacemaker fencing alert agents are now automatically configured during cluster setup and update; setup/update will abort if alert registration fails. Existing alert definitions are validated and replaced as needed, including verification that alert scripts are present and executable and optional XML select-filters are applied.
  • Tests

    • Added unit test coverage verifying the two expected alert configurations.

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6bef9088-9203-49c0-b2f9-7c54882747dc

📥 Commits

Reviewing files that changed from the base of the PR and between 6944e1f and 982fa87.

📒 Files selected for processing (2)
  • pkg/tnf/pkg/pcs/alerts.go
  • pkg/tnf/pkg/pcs/alerts_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/pkg/pcs/alerts_test.go

Walkthrough

Adds ConfigureAlerts to register two pacemaker alert agents for fencing; includes helpers to check script executability, detect/create/update alerts via pcs and cibadmin, and inject optional CIB filters. TNF setup and update-setup call ConfigureAlerts after fencing. Changes Pacemaker Alert Agent Configuration Layer / File(s) Summary Alert configuration contract and test pkg/tnf/pkg/pcs/alerts.go, pkg/tnf/pkg/pcs/alerts_test.go Defines alertConfig and the static alertConfigs list containing two alert IDs, expected script filesystem paths, and optional <select> XML fragments. Adds TestAlertConfigs asserting exactly two entries and validating their fields. ConfigureAlerts entrypoint and runner integration pkg/tnf/pkg/pcs/alerts.go, pkg/tnf/setup/runner.go, pkg/tnf/update-setup/runner.go Adds ConfigureAlerts(ctx) which logs start/success and iterates configured alerts, returning on first error. RunTnfSetup and RunTnfUpdateSetup now call pcs.ConfigureAlerts(ctx) immediately after fencing and abort on error. Alert creation and helpers pkg/tnf/pkg/pcs/alerts.go configureAlert verifies the alert script is executable on the host, checks for existing alerts via pcs alert config --output-format json (unmarshals JSON), updates existing alerts via cibadmin --modify --xml-text or creates alerts with a fixed timeout and optionally applies a CIB <select> filter. Helper functions applyAlertSelectFilter, alertExists, pcsAlertConfigOutput, and fileExistsOnHost were added. Estimated code review effort 🎯 3 (Moderate) | ⏱️ ~20 minutes [!IMPORTANT] Pre-merge checks failed Please resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 errors, 1 warning) Check name Status Explanation Resolution Container-Privileges ❌ Error PR adds Kubernetes manifests with privileged: true, allowPrivilegeEscalation: true, and hostPID: true without justification in job.yaml manifest. Add justification comments to bindata/tnfdeployment/job.yaml explaining why these privileged settings are required, matching the pattern in cronjob.yaml. No-Sensitive-Data-In-Logs ❌ Error Code logs command stdout/stderr (lines 75, 105, 120, 137 in alerts.go) which could expose sensitive operational data, command output, or system details in error messages. Remove or sanitize logging of stdout/stderr from command executions; log only error codes or sanitized error messages without raw command output. Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold. ✅ Passed checks (12 passed) Check name Status Explanation Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled. Title check ✅ Passed The PR title accurately describes the main change: registering pacemaker alert agents for fencing events to taint/untaint nodes, which is reflected in the new ConfigureAlerts function and its integration into the setup runners. Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request. Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request. Stable And Deterministic Test Names ✅ Passed The PR adds no Ginkgo tests. The only new test is TestAlertConfigs, a standard Go test with no dynamic information in the test name or body. Test Structure And Quality ✅ Passed PR adds standard Go unit tests (not Ginkgo) with clear assertion messages and no resource cleanup issues required. Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Only a standard Go unit test (TestAlertConfigs) is added, which uses testing.T and does not use Ginkgo patterns. Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds implementation code and a standard Go unit test (using *testing.T), not Ginkgo e2e tests. The custom check only applies to Ginkgo e2e tests, so it's not applicable here. Topology-Aware Scheduling Compatibility ✅ Passed PR adds backend pacemaker alert configuration via pcs/cibadmin commands, not Kubernetes manifests or pod scheduling specs. No topology constraints introduced. Ote Binary Stdout Contract ✅ Passed New ConfigureAlerts function uses only klog inside function bodies with no stdout writes; called only from tnf-setup-runner, not OTE binary, which is unaffected. Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. Only standard Go unit tests (TestAlertConfigs) and production code are modified, so the IPv6/disconnected network compatibility check does not apply. No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found in the pull request code. ✨ Finishing Touches 🧪 Generate unit tests (beta) [ ] Create PR with unit tests [!WARNING] There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2) Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mshitrit for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tnf/pkg/pcs/alerts.go`:
- Around line 84-89: The fileExistsOnHost function currently treats any
exec.Execute error as "file missing"; instead, update fileExistsOnHost to
distinguish a normal non-zero exit (command ran but file absent) from real
execution failures: call exec.Execute(ctx, cmd) and if err is nil return true,
nil; if err represents a process exit/non-zero status treat that as file not
present and return false, nil; otherwise return false, err to propagate real
runtime/exec errors. Ensure you reference the fileExistsOnHost function and the
exec.Execute call when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07b04fa5-83d2-4758-9ba1-dbe9f780a6a6

📥 Commits

Reviewing files that changed from the base of the PR and between 484906e and bfbfd25.

📒 Files selected for processing (4)
  • pkg/tnf/pkg/pcs/alerts.go
  • pkg/tnf/pkg/pcs/alerts_test.go
  • pkg/tnf/setup/runner.go
  • pkg/tnf/update-setup/runner.go

Comment thread pkg/tnf/pkg/pcs/alerts.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/tnf/pkg/pcs/alerts.go (1)

117-122: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only treat test -x exit code 1 as "not present".

Lines 117-120 still map every *os/exec.ExitError to false, nil. For test -x, exit code 1 means "missing/not executable", but exit code 2 means a real command error. Collapsing both can still hide broken command construction and turn an actionable failure into a misleading retry.

💡 Minimal fix
 	if err != nil {
 		var exitErr *osexec.ExitError
 		if errors.As(err, &exitErr) {
-			return false, nil
+			if exitErr.ExitCode() == 1 {
+				return false, nil
+			}
 		}
 		return false, fmt.Errorf("failed to check file %s on host: %w", path, err)
 	}
#!/bin/bash
set -euo pipefail

printf '\n== exec.Execute implementation ==\n'
rg -n -C4 --type go '^func Execute\('

printf '\n== current ExitError handling ==\n'
rg -n -C3 --type go 'ExitError|ExitCode\(' pkg/tnf/pkg

As per coding guidelines, **/*.go: Go security (prodsec-skills): Never ignore error returns.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/pcs/alerts.go` around lines 117 - 122, The current errors.As
block in the file-checking logic erroneously treats all *os/exec.ExitError
values as success (return false, nil); change the handling in the function that
checks a path (uses variables path, err and the exitErr *osexec.ExitError) to
inspect the exit code via exitErr.ExitCode(): if ExitCode() == 1 return false,
nil (file missing/not executable), but for any other exit code return false with
the wrapped error (fmt.Errorf("failed to check file %s on host: %w", path, err))
so real command failures (e.g., exit 2) are surfaced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/tnf/pkg/pcs/alerts.go`:
- Around line 117-122: The current errors.As block in the file-checking logic
erroneously treats all *os/exec.ExitError values as success (return false, nil);
change the handling in the function that checks a path (uses variables path, err
and the exitErr *osexec.ExitError) to inspect the exit code via
exitErr.ExitCode(): if ExitCode() == 1 return false, nil (file missing/not
executable), but for any other exit code return false with the wrapped error
(fmt.Errorf("failed to check file %s on host: %w", path, err)) so real command
failures (e.g., exit 2) are surfaced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 982cc16f-9d35-4a3b-bd33-435735d38f94

📥 Commits

Reviewing files that changed from the base of the PR and between bfbfd25 and e1885d4.

📒 Files selected for processing (2)
  • pkg/tnf/pkg/pcs/alerts.go
  • pkg/tnf/pkg/pcs/alerts_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/pkg/pcs/alerts_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/tnf/pkg/pcs/alerts.go (1)

110-118: ⚡ Quick win

Include command output in alertExists failures.

If pcs emits unexpected text, the setup flows fail here but the returned error drops the stdout that explains what actually came back. Include both streams in the exec-failure and JSON-parse paths.

💡 Proposed fix
 	stdOut, stdErr, err := exec.Execute(ctx, cmd)
 	if err != nil {
-		return false, fmt.Errorf("pcs alert config failed: stderr=%s: %w", stdErr, err)
+		return false, fmt.Errorf("pcs alert config failed: stdout=%q stderr=%q: %w", stdOut, stdErr, err)
 	}
 
 	var config pcsAlertConfigOutput
 	if err := json.Unmarshal([]byte(stdOut), &config); err != nil {
-		return false, fmt.Errorf("failed to parse pcs alert config JSON: %w", err)
+		return false, fmt.Errorf("failed to parse pcs alert config JSON: stdout=%q stderr=%q: %w", stdOut, stdErr, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/pcs/alerts.go` around lines 110 - 118, In alertExists, include
both stdOut and stdErr in the returned errors: when exec.Execute(ctx, cmd)
returns err, wrap it with both stdout and stderr (use stdOut and stdErr) instead
of only stderr; similarly, when json.Unmarshal into pcsAlertConfigOutput fails,
return an error that contains the raw stdOut and stdErr so callers see what pcs
actually emitted. Update the error messages around exec.Execute and
json.Unmarshal to concatenate or format stdOut/stdErr alongside the original
error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/tnf/pkg/pcs/alerts.go`:
- Around line 110-118: In alertExists, include both stdOut and stdErr in the
returned errors: when exec.Execute(ctx, cmd) returns err, wrap it with both
stdout and stderr (use stdOut and stdErr) instead of only stderr; similarly,
when json.Unmarshal into pcsAlertConfigOutput fails, return an error that
contains the raw stdOut and stdErr so callers see what pcs actually emitted.
Update the error messages around exec.Execute and json.Unmarshal to concatenate
or format stdOut/stdErr alongside the original error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d6df571c-91a5-4551-9020-ba890f61b515

📥 Commits

Reviewing files that changed from the base of the PR and between e1885d4 and 9e5ade1.

📒 Files selected for processing (1)
  • pkg/tnf/pkg/pcs/alerts.go

@vimauro

vimauro commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@vimauro vimauro marked this pull request as ready for review June 4, 2026 14:51
@openshift-ci openshift-ci Bot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 4, 2026
@openshift-ci openshift-ci Bot requested review from eggfoobar and fonta-rh June 4, 2026 14:52

@fonta-rh fonta-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-repo review of CEO #1625 + MCO #6092 (fencing taint/untaint alert agents). One finding for this PR survived adversarial vetting (four others are on MCO #6092).

Comment thread pkg/tnf/pkg/pcs/alerts.go
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@vimauro: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn 982fa87 link true /test e2e-agnostic-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants