SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping by dustman9000 · Pull Request #2709 · openshift/managed-cluster-config

dustman9000 · 2026-04-18T21:49:33Z

Summary

Adds keep_firing_for: 1h to the MachineHealthCheckUnterminatedShortCircuitSRE alert rule
Keeps the original for: 30m detection threshold unchanged
Reduces PagerDuty noise from flapping MHC alerts by keeping the alert firing through brief recovery windows

Problem

Observed 40+ PagerDuty incidents in 29 hours on dapa-c1b.zl8c.p1 due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert (5 min recovery) before it re-fired, creating a new PD incident each time.

Solution

keep_firing_for: 1h means once the alert fires (after the 30m for: threshold), it stays firing through recovery windows up to 1 hour. This collapses the flap into a single PD incident while preserving detection sensitivity for new occurrences.

Note on hack/ changes

The hack/ template diffs include re-ordering of osd-cluster-acks-* SelectorSyncSet resources. This is a pre-existing make idempotency issue from the recent 4.22 credential policies merge (#2707), which did not regenerate hack/ templates. Running make on a clean upstream/master checkout produces the same re-ordering. The only actual change in this PR is the keep_firing_for: 1h addition.

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

Verify keep_firing_for is supported on OCP Prometheus (available since Prometheus 2.42 / OCP 4.13+)
Deploy to stage and verify alert behavior on a cluster with MHC flapping
Confirm no regression in alert detection for persistent MHC short-circuit events

Summary by CodeRabbit

Chores
- Extended a Prometheus alert’s post-evaluation firing retention by 1 hour for improved monitoring persistence.
- Retargeted multiple cluster selector sync sets to new OpenShift minor/patch versions for compatibility.
- Updated cloud credential patch annotations to align with the revised selector version mappings.

openshift-ci-robot · 2026-04-18T21:49:38Z

@dustman9000: This pull request references SREP-4577 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds keep_firing_for: 1h to the MachineHealthCheckUnterminatedShortCircuitSRE alert rule

Keeps the original for: 30m detection threshold unchanged

Reduces PagerDuty noise from flapping MHC alerts by keeping the alert firing through brief recovery windows

Problem

Observed 40+ PagerDuty incidents in 29 hours on dapa-c1b.zl8c.p1 due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert (5 min recovery) before it re-fired, creating a new PD incident each time.

Solution

keep_firing_for: 1h means once the alert fires (after the 30m for: threshold), it stays firing through recovery windows up to 1 hour. This collapses the flap into a single PD incident while preserving detection sensitivity for new occurrences.

Jira: https://redhat.atlassian.net/browse/SREP-4577

Test plan

Verify keep_firing_for is supported on OCP Prometheus (available since Prometheus 2.42 / OCP 4.13+)

Deploy to stage and verify alert behavior on a cluster with MHC flapping

Confirm no regression in alert detection for persistent MHC short-circuit events

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-04-18T21:49:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~deploy/sre-prometheus/OWNERS~~ [dustman9000]
~~hack/OWNERS~~ [dustman9000]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-18T21:49:48Z

Walkthrough

Added keep_firing_for: 1h to a Prometheus alert and remapped multiple Hive SelectorSyncSet entries plus corresponding CloudCredential patch upgradeable-to values across integration, production, and stage OSD-managed cluster templates.

Changes

Cohort / File(s)	Summary
Prometheus alert `deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml`	Added `keep_firing_for: 1h` to the `MachineHealthCheckUnterminatedShortCircuitSRE` alert (existing `for: 30m` retained).
OSD-managed cluster SelectorSyncSet & CloudCredential templates `hack/00-osd-managed-cluster-config-integration.yaml.tmpl`, `hack/00-osd-managed-cluster-config-production.yaml.tmpl`, `hack/00-osd-managed-cluster-config-stage.yaml.tmpl`	Renamed and retargeted multiple `SelectorSyncSet` entries (STS/WIF) by shifting version identifiers (e.g., `osd-cluster-acks-sts-4.9` → `...-4.22`, WIF entries shifted down one patch version), adjusted `clusterDeploymentSelector.matchExpressions` (`hive.openshift.io/version-major-minor`, `api.openshift.com/gate-*`, `api.openshift.com/{wif

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 10

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding keep_firing_for to a specific alert rule to reduce flapping.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	The custom check for Ginkgo test names is not applicable. This PR modifies only YAML configuration files (Prometheus rules and Kubernetes manifests) in a configuration management repository with no Go source code or test files.
Test Structure And Quality	✅ Passed	PR contains only infrastructure configuration files with no Ginkgo test code to review.
Microshift Test Compatibility	✅ Passed	The PR modifies only YAML configuration files and does not add any Ginkgo e2e tests, so this MicroShift compatibility check is not applicable and passes by default.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Repository is cluster configuration management with only YAML modifications; no Go test files added or modified, so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	This pull request does not introduce or modify deployment manifests, operator code, or controllers subject to topology-aware scheduling constraints.
Ote Binary Stdout Contract	✅ Passed	The OTE Binary Stdout Contract check is not applicable to this PR. The PR only modifies YAML configuration files and templates (Prometheus rules and Kubernetes manifests), which are purely declarative and do not contain any Go code, main functions, test setup, or any executable code that could write to stdout.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This pull request contains no new Ginkgo e2e tests; changes consist entirely of YAML configuration files outside the scope of the IPv6 and disconnected network test compatibility check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

hack/00-osd-managed-cluster-config-integration.yaml.tmpl (1)
40987-41001: ⚠️ Potential issue | 🟠 Major

Restore the gate-wif guard on the 4.21 selector.

Line 40991 drops api.openshift.com/gate-wif, so this osd-cluster-acks-wif-4.21 syncset now matches every WIF 4.20 cluster and applies upgradeable-to: v4.21 unconditionally. That is a broader rollout than the adjacent WIF selectors, which are still gate-controlled.
Suggested fix
       - key: hive.openshift.io/version-major-minor
         operator: In
         values:
         - '4.20'
+      - key: api.openshift.com/gate-wif
+        operator: In
+        values:
+        - '4.21'
       - key: api.openshift.com/wif
         operator: In
         values:
         - 'true'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 40987
- 41001, The osd-cluster-acks-wif-4.21 SyncSet selector lost the gate check and
now matches all WIF 4.20 clusters; restore the missing gate by adding the
api.openshift.com/gate-wif selector entry (operator: In, values: ['true'])
alongside api.openshift.com/wif under the same selector block so the SyncSet
only targets clusters that have gate-wif=true; update the selector that precedes
resourceApplyMode/Patches in the osd-cluster-acks-wif-4.21 manifest to include
the gate-wif key.
hack/00-osd-managed-cluster-config-production.yaml.tmpl (1)
49797-49898: ⚠️ Potential issue | 🔴 Critical

Remove or version-gate keep_firing_for field – unsupported on OCP 4.12 and earlier.

The keep_firing_for field added at line 49898 is not supported in OpenShift 4.12 and earlier versions. Since this template references older OCP minors (4.8, 4.9), this PrometheusRule will fail validation on clusters running those versions. Either confirm the manifest targets only OCP 4.13+ clusters, or remove keep_firing_for and add version-specific gating if the field is required for newer versions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 49797 -
49898, The PrometheusRule for MachineHealthCheckUnterminatedShortCircuitSRE
contains the unsupported keep_firing_for field; remove the keep_firing_for: 1h
line from the PrometheusRule spec (the rule block under kind: PrometheusRule
with name sre-machine-health-check-unterminated-short-circuit-alert and alert:
MachineHealthCheckUnterminatedShortCircuitSRE) or wrap that field in a version
gate so it only renders for OCP >= 4.13 (e.g., conditional templating around
keep_firing_for tied to cluster minor version); ensure the resulting manifest
validates on OCP 4.12 and earlier by either omitting the field or only emitting
it for 4.13+.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 49897-49898: The PrometheusRule
deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
uses the field keep_firing_for: 1h which is incompatible with OCP 4.12; either
gate this field to OCP 4.13+ or remove it: implement a version-specific variant
using the template's config.yaml selectors (add a 4.13+ variant that includes
keep_firing_for and keep the existing rule for older versions), or simply delete
the keep_firing_for line from the rule so it remains compatible with 4.12;
update the template for hack/00-osd-managed-cluster-config-integration.yaml.tmpl
accordingly and ensure the rule name and selectors in the PrometheusRule
manifest
(100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml) are
adjusted to reflect the versioned variants.

---

Outside diff comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 40987-41001: The osd-cluster-acks-wif-4.21 SyncSet selector lost
the gate check and now matches all WIF 4.20 clusters; restore the missing gate
by adding the api.openshift.com/gate-wif selector entry (operator: In, values:
['true']) alongside api.openshift.com/wif under the same selector block so the
SyncSet only targets clusters that have gate-wif=true; update the selector that
precedes resourceApplyMode/Patches in the osd-cluster-acks-wif-4.21 manifest to
include the gate-wif key.

In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl`:
- Around line 49797-49898: The PrometheusRule for
MachineHealthCheckUnterminatedShortCircuitSRE contains the unsupported
keep_firing_for field; remove the keep_firing_for: 1h line from the
PrometheusRule spec (the rule block under kind: PrometheusRule with name
sre-machine-health-check-unterminated-short-circuit-alert and alert:
MachineHealthCheckUnterminatedShortCircuitSRE) or wrap that field in a version
gate so it only renders for OCP >= 4.13 (e.g., conditional templating around
keep_firing_for tied to cluster minor version); ensure the resulting manifest
validates on OCP 4.12 and earlier by either omitting the field or only emitting
it for 4.13+.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b622c142-16c9-4a79-a108-04ae02d7c8fc

📥 Commits

Reviewing files that changed from the base of the PR and between e2cf895 and 9bdc085.

📒 Files selected for processing (4)

deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
hack/00-osd-managed-cluster-config-integration.yaml.tmpl
hack/00-osd-managed-cluster-config-production.yaml.tmpl
hack/00-osd-managed-cluster-config-stage.yaml.tmpl

The MachineHealthCheckUnterminatedShortCircuitSRE alert was generating 40+ PagerDuty incidents in 29 hours on a single cluster (dapa-c1b) due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert before it re-fired, creating a new PD incident each time. Adding keep_firing_for: 1h keeps the alert firing through brief recovery windows (up to 1 hour), collapsing the flap into a single PD incident while preserving the original 30m detection sensitivity.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl`:
- Line 40819: Update the PR description to explicitly document the
SelectorSyncSet version mapping changes included in this commit (e.g., the STS
entries like "osd-cluster-acks-sts-4.22" and WIF changes for 4.18–4.21) and
state whether these mapping updates are intentional dependencies required for
the keep_firing_for: 1h change or unrelated refactors that should be split into
a separate PR; mention the established convention that sync set version N
targets clusters on version N-1 and list the affected environment templates
(stage, production, integration) so reviewers can verify the scope and intent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1944c992-4e8b-4d90-b2cf-2474d8bfdaa1

📥 Commits

Reviewing files that changed from the base of the PR and between 9bdc085 and 08fa188.

📒 Files selected for processing (4)

deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
hack/00-osd-managed-cluster-config-integration.yaml.tmpl
hack/00-osd-managed-cluster-config-production.yaml.tmpl
hack/00-osd-managed-cluster-config-stage.yaml.tmpl

🚧 Files skipped from review as they are similar to previous changes (2)

deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
hack/00-osd-managed-cluster-config-production.yaml.tmpl

openshift-ci · 2026-04-18T21:59:50Z

@dustman9000: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

joshbranham · 2026-04-20T15:19:49Z

Fixing the template issue in #2711 standalone since I also have to open a PR and can't merge without it

dustman9000 · 2026-04-30T14:22:44Z

/rebase

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 18, 2026

openshift-ci Bot requested review from Tof1973 and bergmannf April 18, 2026 21:49

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2026

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread hack/00-osd-managed-cluster-config-integration.yaml.tmpl

dustman9000 force-pushed the drow/mhc-alert-tuning branch from 9bdc085 to 08fa188 Compare April 18, 2026 21:54

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread hack/00-osd-managed-cluster-config-stage.yaml.tmpl

joshbranham reviewed Apr 18, 2026

View reviewed changes

Comment thread deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping#2709

SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping#2709
dustman9000 wants to merge 1 commit into
openshift:masterfrom
dustman9000:drow/mhc-alert-tuning

dustman9000 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented Apr 18, 2026 •

edited by openshift-ci Bot

Loading

Summary

Problem

Solution

Test plan

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

Uh oh!

joshbranham commented Apr 20, 2026

Uh oh!

dustman9000 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dustman9000 commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Note on hack/ changes

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 18, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

Uh oh!

joshbranham commented Apr 20, 2026

Uh oh!

dustman9000 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dustman9000 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 18, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading