Skip to content

SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping#2709

Open
dustman9000 wants to merge 1 commit into
openshift:masterfrom
dustman9000:drow/mhc-alert-tuning
Open

SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping#2709
dustman9000 wants to merge 1 commit into
openshift:masterfrom
dustman9000:drow/mhc-alert-tuning

Conversation

@dustman9000
Copy link
Copy Markdown
Member

@dustman9000 dustman9000 commented Apr 18, 2026

Summary

  • Adds keep_firing_for: 1h to the MachineHealthCheckUnterminatedShortCircuitSRE alert rule
  • Keeps the original for: 30m detection threshold unchanged
  • Reduces PagerDuty noise from flapping MHC alerts by keeping the alert firing through brief recovery windows

Problem

Observed 40+ PagerDuty incidents in 29 hours on dapa-c1b.zl8c.p1 due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert (5 min recovery) before it re-fired, creating a new PD incident each time.

Solution

keep_firing_for: 1h means once the alert fires (after the 30m for: threshold), it stays firing through recovery windows up to 1 hour. This collapses the flap into a single PD incident while preserving detection sensitivity for new occurrences.

Note on hack/ changes

The hack/ template diffs include re-ordering of osd-cluster-acks-* SelectorSyncSet resources. This is a pre-existing make idempotency issue from the recent 4.22 credential policies merge (#2707), which did not regenerate hack/ templates. Running make on a clean upstream/master checkout produces the same re-ordering. The only actual change in this PR is the keep_firing_for: 1h addition.

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

  • Verify keep_firing_for is supported on OCP Prometheus (available since Prometheus 2.42 / OCP 4.13+)
  • Deploy to stage and verify alert behavior on a cluster with MHC flapping
  • Confirm no regression in alert detection for persistent MHC short-circuit events

Summary by CodeRabbit

  • Chores
    • Extended a Prometheus alert’s post-evaluation firing retention by 1 hour for improved monitoring persistence.
    • Retargeted multiple cluster selector sync sets to new OpenShift minor/patch versions for compatibility.
    • Updated cloud credential patch annotations to align with the revised selector version mappings.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 18, 2026

@dustman9000: This pull request references SREP-4577 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds keep_firing_for: 1h to the MachineHealthCheckUnterminatedShortCircuitSRE alert rule
  • Keeps the original for: 30m detection threshold unchanged
  • Reduces PagerDuty noise from flapping MHC alerts by keeping the alert firing through brief recovery windows

Problem

Observed 40+ PagerDuty incidents in 29 hours on dapa-c1b.zl8c.p1 due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert (5 min recovery) before it re-fired, creating a new PD incident each time.

Solution

keep_firing_for: 1h means once the alert fires (after the 30m for: threshold), it stays firing through recovery windows up to 1 hour. This collapses the flap into a single PD incident while preserving detection sensitivity for new occurrences.

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

  • Verify keep_firing_for is supported on OCP Prometheus (available since Prometheus 2.42 / OCP 4.13+)
  • Deploy to stage and verify alert behavior on a cluster with MHC flapping
  • Confirm no regression in alert detection for persistent MHC short-circuit events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from Tof1973 and bergmannf April 18, 2026 21:49
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

Walkthrough

Added keep_firing_for: 1h to a Prometheus alert and remapped multiple Hive SelectorSyncSet entries plus corresponding CloudCredential patch upgradeable-to values across integration, production, and stage OSD-managed cluster templates.

Changes

Cohort / File(s) Summary
Prometheus alert
deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
Added keep_firing_for: 1h to the MachineHealthCheckUnterminatedShortCircuitSRE alert (existing for: 30m retained).
OSD-managed cluster SelectorSyncSet & CloudCredential templates
hack/00-osd-managed-cluster-config-integration.yaml.tmpl, hack/00-osd-managed-cluster-config-production.yaml.tmpl, hack/00-osd-managed-cluster-config-stage.yaml.tmpl
Renamed and retargeted multiple SelectorSyncSet entries (STS/WIF) by shifting version identifiers (e.g., osd-cluster-acks-sts-4.9...-4.22, WIF entries shifted down one patch version), adjusted clusterDeploymentSelector.matchExpressions (hive.openshift.io/version-major-minor, api.openshift.com/gate-*, `api.openshift.com/{wif

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 10
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding keep_firing_for to a specific alert rule to reduce flapping.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed The custom check for Ginkgo test names is not applicable. This PR modifies only YAML configuration files (Prometheus rules and Kubernetes manifests) in a configuration management repository with no Go source code or test files.
Test Structure And Quality ✅ Passed PR contains only infrastructure configuration files with no Ginkgo test code to review.
Microshift Test Compatibility ✅ Passed The PR modifies only YAML configuration files and does not add any Ginkgo e2e tests, so this MicroShift compatibility check is not applicable and passes by default.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Repository is cluster configuration management with only YAML modifications; no Go test files added or modified, so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This pull request does not introduce or modify deployment manifests, operator code, or controllers subject to topology-aware scheduling constraints.
Ote Binary Stdout Contract ✅ Passed The OTE Binary Stdout Contract check is not applicable to this PR. The PR only modifies YAML configuration files and templates (Prometheus rules and Kubernetes manifests), which are purely declarative and do not contain any Go code, main functions, test setup, or any executable code that could write to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This pull request contains no new Ginkgo e2e tests; changes consist entirely of YAML configuration files outside the scope of the IPv6 and disconnected network test compatibility check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
hack/00-osd-managed-cluster-config-integration.yaml.tmpl (1)

40987-41001: ⚠️ Potential issue | 🟠 Major

Restore the gate-wif guard on the 4.21 selector.

Line 40991 drops api.openshift.com/gate-wif, so this osd-cluster-acks-wif-4.21 syncset now matches every WIF 4.20 cluster and applies upgradeable-to: v4.21 unconditionally. That is a broader rollout than the adjacent WIF selectors, which are still gate-controlled.

Suggested fix
       - key: hive.openshift.io/version-major-minor
         operator: In
         values:
         - '4.20'
+      - key: api.openshift.com/gate-wif
+        operator: In
+        values:
+        - '4.21'
       - key: api.openshift.com/wif
         operator: In
         values:
         - 'true'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 40987
- 41001, The osd-cluster-acks-wif-4.21 SyncSet selector lost the gate check and
now matches all WIF 4.20 clusters; restore the missing gate by adding the
api.openshift.com/gate-wif selector entry (operator: In, values: ['true'])
alongside api.openshift.com/wif under the same selector block so the SyncSet
only targets clusters that have gate-wif=true; update the selector that precedes
resourceApplyMode/Patches in the osd-cluster-acks-wif-4.21 manifest to include
the gate-wif key.
hack/00-osd-managed-cluster-config-production.yaml.tmpl (1)

49797-49898: ⚠️ Potential issue | 🔴 Critical

Remove or version-gate keep_firing_for field – unsupported on OCP 4.12 and earlier.

The keep_firing_for field added at line 49898 is not supported in OpenShift 4.12 and earlier versions. Since this template references older OCP minors (4.8, 4.9), this PrometheusRule will fail validation on clusters running those versions. Either confirm the manifest targets only OCP 4.13+ clusters, or remove keep_firing_for and add version-specific gating if the field is required for newer versions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 49797 -
49898, The PrometheusRule for MachineHealthCheckUnterminatedShortCircuitSRE
contains the unsupported keep_firing_for field; remove the keep_firing_for: 1h
line from the PrometheusRule spec (the rule block under kind: PrometheusRule
with name sre-machine-health-check-unterminated-short-circuit-alert and alert:
MachineHealthCheckUnterminatedShortCircuitSRE) or wrap that field in a version
gate so it only renders for OCP >= 4.13 (e.g., conditional templating around
keep_firing_for tied to cluster minor version); ensure the resulting manifest
validates on OCP 4.12 and earlier by either omitting the field or only emitting
it for 4.13+.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 49897-49898: The PrometheusRule
deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
uses the field keep_firing_for: 1h which is incompatible with OCP 4.12; either
gate this field to OCP 4.13+ or remove it: implement a version-specific variant
using the template's config.yaml selectors (add a 4.13+ variant that includes
keep_firing_for and keep the existing rule for older versions), or simply delete
the keep_firing_for line from the rule so it remains compatible with 4.12;
update the template for hack/00-osd-managed-cluster-config-integration.yaml.tmpl
accordingly and ensure the rule name and selectors in the PrometheusRule
manifest
(100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml) are
adjusted to reflect the versioned variants.

---

Outside diff comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 40987-41001: The osd-cluster-acks-wif-4.21 SyncSet selector lost
the gate check and now matches all WIF 4.20 clusters; restore the missing gate
by adding the api.openshift.com/gate-wif selector entry (operator: In, values:
['true']) alongside api.openshift.com/wif under the same selector block so the
SyncSet only targets clusters that have gate-wif=true; update the selector that
precedes resourceApplyMode/Patches in the osd-cluster-acks-wif-4.21 manifest to
include the gate-wif key.

In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl`:
- Around line 49797-49898: The PrometheusRule for
MachineHealthCheckUnterminatedShortCircuitSRE contains the unsupported
keep_firing_for field; remove the keep_firing_for: 1h line from the
PrometheusRule spec (the rule block under kind: PrometheusRule with name
sre-machine-health-check-unterminated-short-circuit-alert and alert:
MachineHealthCheckUnterminatedShortCircuitSRE) or wrap that field in a version
gate so it only renders for OCP >= 4.13 (e.g., conditional templating around
keep_firing_for tied to cluster minor version); ensure the resulting manifest
validates on OCP 4.12 and earlier by either omitting the field or only emitting
it for 4.13+.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b622c142-16c9-4a79-a108-04ae02d7c8fc

📥 Commits

Reviewing files that changed from the base of the PR and between e2cf895 and 9bdc085.

📒 Files selected for processing (4)
  • deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
  • hack/00-osd-managed-cluster-config-integration.yaml.tmpl
  • hack/00-osd-managed-cluster-config-production.yaml.tmpl
  • hack/00-osd-managed-cluster-config-stage.yaml.tmpl

Comment thread hack/00-osd-managed-cluster-config-integration.yaml.tmpl
The MachineHealthCheckUnterminatedShortCircuitSRE alert was generating
40+ PagerDuty incidents in 29 hours on a single cluster (dapa-c1b) due
to machines cycling between unhealthy/healthy states every ~45 minutes.
Each cycle briefly cleared the alert before it re-fired, creating a new
PD incident each time.

Adding keep_firing_for: 1h keeps the alert firing through brief recovery
windows (up to 1 hour), collapsing the flap into a single PD incident
while preserving the original 30m detection sensitivity.
@dustman9000 dustman9000 force-pushed the drow/mhc-alert-tuning branch from 9bdc085 to 08fa188 Compare April 18, 2026 21:54
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl`:
- Line 40819: Update the PR description to explicitly document the
SelectorSyncSet version mapping changes included in this commit (e.g., the STS
entries like "osd-cluster-acks-sts-4.22" and WIF changes for 4.18–4.21) and
state whether these mapping updates are intentional dependencies required for
the keep_firing_for: 1h change or unrelated refactors that should be split into
a separate PR; mention the established convention that sync set version N
targets clusters on version N-1 and list the affected environment templates
(stage, production, integration) so reviewers can verify the scope and intent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1944c992-4e8b-4d90-b2cf-2474d8bfdaa1

📥 Commits

Reviewing files that changed from the base of the PR and between 9bdc085 and 08fa188.

📒 Files selected for processing (4)
  • deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
  • hack/00-osd-managed-cluster-config-integration.yaml.tmpl
  • hack/00-osd-managed-cluster-config-production.yaml.tmpl
  • hack/00-osd-managed-cluster-config-stage.yaml.tmpl
🚧 Files skipped from review as they are similar to previous changes (2)
  • deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
  • hack/00-osd-managed-cluster-config-production.yaml.tmpl

Comment thread hack/00-osd-managed-cluster-config-stage.yaml.tmpl
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2026

@dustman9000: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@joshbranham
Copy link
Copy Markdown
Contributor

Fixing the template issue in #2711 standalone since I also have to open a PR and can't merge without it

@dustman9000
Copy link
Copy Markdown
Member Author

/rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants