SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping#2709
SREP-4577: Add keep_firing_for to MHC short circuit alert to reduce flapping#2709dustman9000 wants to merge 1 commit into
Conversation
|
@dustman9000: This pull request references SREP-4577 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dustman9000 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdded Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 10✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
hack/00-osd-managed-cluster-config-integration.yaml.tmpl (1)
40987-41001:⚠️ Potential issue | 🟠 MajorRestore the
gate-wifguard on the 4.21 selector.Line 40991 drops
api.openshift.com/gate-wif, so thisosd-cluster-acks-wif-4.21syncset now matches every WIF 4.20 cluster and appliesupgradeable-to: v4.21unconditionally. That is a broader rollout than the adjacent WIF selectors, which are still gate-controlled.Suggested fix
- key: hive.openshift.io/version-major-minor operator: In values: - '4.20' + - key: api.openshift.com/gate-wif + operator: In + values: + - '4.21' - key: api.openshift.com/wif operator: In values: - 'true'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 40987 - 41001, The osd-cluster-acks-wif-4.21 SyncSet selector lost the gate check and now matches all WIF 4.20 clusters; restore the missing gate by adding the api.openshift.com/gate-wif selector entry (operator: In, values: ['true']) alongside api.openshift.com/wif under the same selector block so the SyncSet only targets clusters that have gate-wif=true; update the selector that precedes resourceApplyMode/Patches in the osd-cluster-acks-wif-4.21 manifest to include the gate-wif key.hack/00-osd-managed-cluster-config-production.yaml.tmpl (1)
49797-49898:⚠️ Potential issue | 🔴 CriticalRemove or version-gate
keep_firing_forfield – unsupported on OCP 4.12 and earlier.The
keep_firing_forfield added at line 49898 is not supported in OpenShift 4.12 and earlier versions. Since this template references older OCP minors (4.8, 4.9), this PrometheusRule will fail validation on clusters running those versions. Either confirm the manifest targets only OCP 4.13+ clusters, or removekeep_firing_forand add version-specific gating if the field is required for newer versions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 49797 - 49898, The PrometheusRule for MachineHealthCheckUnterminatedShortCircuitSRE contains the unsupported keep_firing_for field; remove the keep_firing_for: 1h line from the PrometheusRule spec (the rule block under kind: PrometheusRule with name sre-machine-health-check-unterminated-short-circuit-alert and alert: MachineHealthCheckUnterminatedShortCircuitSRE) or wrap that field in a version gate so it only renders for OCP >= 4.13 (e.g., conditional templating around keep_firing_for tied to cluster minor version); ensure the resulting manifest validates on OCP 4.12 and earlier by either omitting the field or only emitting it for 4.13+.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 49897-49898: The PrometheusRule
deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
uses the field keep_firing_for: 1h which is incompatible with OCP 4.12; either
gate this field to OCP 4.13+ or remove it: implement a version-specific variant
using the template's config.yaml selectors (add a 4.13+ variant that includes
keep_firing_for and keep the existing rule for older versions), or simply delete
the keep_firing_for line from the rule so it remains compatible with 4.12;
update the template for hack/00-osd-managed-cluster-config-integration.yaml.tmpl
accordingly and ensure the rule name and selectors in the PrometheusRule
manifest
(100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml) are
adjusted to reflect the versioned variants.
---
Outside diff comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 40987-41001: The osd-cluster-acks-wif-4.21 SyncSet selector lost
the gate check and now matches all WIF 4.20 clusters; restore the missing gate
by adding the api.openshift.com/gate-wif selector entry (operator: In, values:
['true']) alongside api.openshift.com/wif under the same selector block so the
SyncSet only targets clusters that have gate-wif=true; update the selector that
precedes resourceApplyMode/Patches in the osd-cluster-acks-wif-4.21 manifest to
include the gate-wif key.
In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl`:
- Around line 49797-49898: The PrometheusRule for
MachineHealthCheckUnterminatedShortCircuitSRE contains the unsupported
keep_firing_for field; remove the keep_firing_for: 1h line from the
PrometheusRule spec (the rule block under kind: PrometheusRule with name
sre-machine-health-check-unterminated-short-circuit-alert and alert:
MachineHealthCheckUnterminatedShortCircuitSRE) or wrap that field in a version
gate so it only renders for OCP >= 4.13 (e.g., conditional templating around
keep_firing_for tied to cluster minor version); ensure the resulting manifest
validates on OCP 4.12 and earlier by either omitting the field or only emitting
it for 4.13+.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b622c142-16c9-4a79-a108-04ae02d7c8fc
📒 Files selected for processing (4)
deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yamlhack/00-osd-managed-cluster-config-integration.yaml.tmplhack/00-osd-managed-cluster-config-production.yaml.tmplhack/00-osd-managed-cluster-config-stage.yaml.tmpl
The MachineHealthCheckUnterminatedShortCircuitSRE alert was generating 40+ PagerDuty incidents in 29 hours on a single cluster (dapa-c1b) due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert before it re-fired, creating a new PD incident each time. Adding keep_firing_for: 1h keeps the alert firing through brief recovery windows (up to 1 hour), collapsing the flap into a single PD incident while preserving the original 30m detection sensitivity.
9bdc085 to
08fa188
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl`:
- Line 40819: Update the PR description to explicitly document the
SelectorSyncSet version mapping changes included in this commit (e.g., the STS
entries like "osd-cluster-acks-sts-4.22" and WIF changes for 4.18–4.21) and
state whether these mapping updates are intentional dependencies required for
the keep_firing_for: 1h change or unrelated refactors that should be split into
a separate PR; mention the established convention that sync set version N
targets clusters on version N-1 and list the affected environment templates
(stage, production, integration) so reviewers can verify the scope and intent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 1944c992-4e8b-4d90-b2cf-2474d8bfdaa1
📒 Files selected for processing (4)
deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yamlhack/00-osd-managed-cluster-config-integration.yaml.tmplhack/00-osd-managed-cluster-config-production.yaml.tmplhack/00-osd-managed-cluster-config-stage.yaml.tmpl
🚧 Files skipped from review as they are similar to previous changes (2)
- deploy/sre-prometheus/100-machine-health-check-unterminated-short-circuit.PrometheusRule.yaml
- hack/00-osd-managed-cluster-config-production.yaml.tmpl
|
@dustman9000: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Fixing the template issue in #2711 standalone since I also have to open a PR and can't merge without it |
|
/rebase |
Summary
Problem
Observed 40+ PagerDuty incidents in 29 hours on dapa-c1b.zl8c.p1 due to machines cycling between unhealthy/healthy states every ~45 minutes. Each cycle briefly cleared the alert (5 min recovery) before it re-fired, creating a new PD incident each time.
Solution
keep_firing_for: 1h means once the alert fires (after the 30m for: threshold), it stays firing through recovery windows up to 1 hour. This collapses the flap into a single PD incident while preserving detection sensitivity for new occurrences.
Note on hack/ changes
The hack/ template diffs include re-ordering of osd-cluster-acks-* SelectorSyncSet resources. This is a pre-existing make idempotency issue from the recent 4.22 credential policies merge (#2707), which did not regenerate hack/ templates. Running make on a clean upstream/master checkout produces the same re-ordering. The only actual change in this PR is the keep_firing_for: 1h addition.
Jira: https://redhat.atlassian.net/browse/SREP-4577
Test plan
Summary by CodeRabbit