[DRAFT] - SREP-3200: Add PrometheusRule for IAM ELB ingress LoadBalancer degradation#2722
[DRAFT] - SREP-3200: Add PrometheusRule for IAM ELB ingress LoadBalancer degradation#2722anispate wants to merge 4 commits into
Conversation
…ation Adds a PrometheusRule (monitoring.rhobs/v1) to the OBO monitoring stack that fires when ROSA HCP ingress has been unable to provision a LoadBalancer for 30+ minutes, indicating likely AWS IAM ELB permission issues. Also adds the companion ManagedFleetNotification CR to deliver a customer-facing notification via OCM Agent with remediation guidance. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
@anispate: This pull request references SREP-3200 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughThis PR adds a new OpenShift OCM agent ChangesIngress LoadBalancer Provisioning Failure Notification
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Microsoft Presidio Analyzer (2.2.362)hack/00-osd-managed-cluster-config-production.yaml.tmplMicrosoft Presidio Analyzer failed to scan this file hack/00-osd-managed-cluster-config-stage.yaml.tmplMicrosoft Presidio Analyzer failed to scan this file Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anispate The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml (1)
15-15: Rename alert to dropNotificationsuffix for consistency.The
IngressLoadBalancerProvisioningFailedNotificationalert usessend_managed_notification: "true"andmanaged_notification_template, which is the standard routing mechanism across this repo. Alerts with theNotificationsuffix represent a small minority of PrometheusRules; the idiomatic convention in this codebase names alerts after the condition (e.g.,IngressLoadBalancerProvisioningFailed) and relies on these routing labels instead. Renaming would improve consistency and make alerts easier to search and deduplicate in Alertmanager.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml` at line 15, Rename the Prometheus Rule alert symbol IngressLoadBalancerProvisioningFailedNotification to IngressLoadBalancerProvisioningFailed in the rule definition (the alert: field) and update any local references inside that YAML (annotations/labels) and any alert routing/recording/remediation configs that reference the old name so they point to IngressLoadBalancerProvisioningFailed; keep the existing send_managed_notification and managed_notification_template labels intact so routing behavior remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml`:
- Line 19: Update the remediation message that currently reads "Review the IAM
permissions attached to your cluster's worker node role and ensure ELB
permissions are present" to instead reference the control-plane ingress
operator's IAM/STS role used in ROSA HyperShift (e.g., the
ingress-operator-managed role assumed via the cluster OIDC provider) and the
account-level ELB service-linked role; specifically, change the guidance text in
ingress-loadbalancer-provisioning-failed.yaml to point users to check the
ingress operator's IAM role/STS trust policy and the ELB service-linked role
rather than the worker node instance role.
In
`@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml`:
- Line 46: The alert currently uses
cluster_operator_conditions{condition="Degraded"} plus for: 30m which
effectively doubles the wait; either shorten the for: to a smaller window (e.g.,
for: 5m or for: 10m) if you want the alert to fire roughly ~30 minutes after LB
provisioning begins failing, or keep for: 30m but update the alert description
to explicitly state it will fire after the operator’s internal degraded
threshold plus an additional ~30 minutes (i.e., ~30+ minutes total); adjust the
for: value or the descriptive text accordingly and ensure you edit the rule that
references cluster_operator_conditions{condition="Degraded"} in this
PrometheusRule.
- Line 43: The metric name contains a typo: replace all occurrences of
hypershift_cluster_waiting_initial_avaibility_duration_seconds with the
corrected hypershift_cluster_waiting_initial_availability_duration_seconds so
the alert suppression unless clause matches the upstream metric; update both
instances mentioned in the review (the ELB ingress degraded rule and the OIDC
missing rule) and validate the alerting rules reference the corrected metric
name exactly.
---
Nitpick comments:
In
`@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml`:
- Line 15: Rename the Prometheus Rule alert symbol
IngressLoadBalancerProvisioningFailedNotification to
IngressLoadBalancerProvisioningFailed in the rule definition (the alert: field)
and update any local references inside that YAML (annotations/labels) and any
alert routing/recording/remediation configs that reference the old name so they
point to IngressLoadBalancerProvisioningFailed; keep the existing
send_managed_notification and managed_notification_template labels intact so
routing behavior remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ad26325f-46cf-45f8-a832-456532bef424
📒 Files selected for processing (2)
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yamldeploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml
| - VPC or subnet configuration preventing LoadBalancer provisioning | ||
|
|
||
| To resolve this issue: | ||
| 1. Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present. |
There was a problem hiding this comment.
Remediation guidance points at the wrong IAM principal for ROSA HCP.
For ROSA HyperShift, ELB provisioning for the default IngressController is performed by the control-plane ingress operator using its STS/IAM role (e.g., the ingress-operator-managed role assumed via the cluster's OIDC provider), not by the worker node instance role. Advising customers to "review the IAM permissions attached to your cluster's worker node role" is likely to send them down the wrong path and delay remediation. Consider rewording to point at the ingress operator's IAM role / STS trust policy and the account-level ELB service-linked role.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml`
at line 19, Update the remediation message that currently reads "Review the IAM
permissions attached to your cluster's worker node role and ensure ELB
permissions are present" to instead reference the control-plane ingress
operator's IAM/STS role used in ROSA HyperShift (e.g., the
ingress-operator-managed role assumed via the cluster OIDC provider) and the
account-level ELB service-linked role; specifically, change the guidance text in
ingress-loadbalancer-provisioning-failed.yaml to point users to check the
ingress operator's IAM role/STS trust policy and the ELB service-linked role
rather than the worker node instance role.
| } | ||
| ) == 1 | ||
| unless on (exported_namespace) ( | ||
| hypershift_cluster_waiting_initial_avaibility_duration_seconds |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how this metric is spelled elsewhere in the repo.
rg -nP 'hypershift_cluster_(waiting_initial_ava[il]+bility|deleting)_duration_seconds'Repository: openshift/managed-cluster-config
Length of output: 2924
🌐 Web query:
hypershift operator metric cluster waiting initial availability duration seconds name
💡 Result:
The metric name is hypershift_cluster_waiting_initial_availability_duration_seconds. It is emitted by the HyperShift operator's HostedCluster controller. It measures the time in seconds it is taking for the HostedClusterAvailable condition to become true since the creation of the HostedCluster. The metric is undefined if the condition has already become true once or if the cluster no longer exists. This metric was recently fixed for a typo (from "avaibility" to "availability") in February 2026 via openshift/hypershift PR #7730. It is part of the per-HostedCluster metrics exposed by the operator, which can be scraped via Prometheus and included in monitoring dashboards when enabled.
Citations:
- 1: NO-JIRA: bug(metrics): fix typo in
hypershift_cluster_waiting_initial_availability_duration_secondsmetric hypershift#7730 - 2: openshift/hypershift@8b3a734...7ce27e1
Fix metric name typo affecting alert suppression: update avaibility to availability.
The metric name hypershift_cluster_waiting_initial_avaibility_duration_seconds contains a typo. The upstream hypershift operator fixed this to hypershift_cluster_waiting_initial_availability_duration_seconds in February 2026. Without this correction, the unless clause will fail to match the metric, breaking alert suppression for newly-provisioned clusters and causing spurious alerts.
Update the metric name in:
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml:43deploy/sre-prometheus/ocm-agent/obo-monitoring/100-oidc-missing.PrometheusRule.yaml:20-23
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml`
at line 43, The metric name contains a typo: replace all occurrences of
hypershift_cluster_waiting_initial_avaibility_duration_seconds with the
corrected hypershift_cluster_waiting_initial_availability_duration_seconds so
the alert suppression unless clause matches the upstream metric; update both
instances mentioned in the review (the ELB ingress degraded rule and the OIDC
missing rule) and validate the alerting rules reference the corrected metric
name exactly.
…alert Run generate_template.py to include the new PrometheusRule and ManagedFleetNotification in the hack/00-osd-managed-cluster-config-*.yaml.tmpl templates, as required by the pr-check CI gate. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 34824-34858: The notificationMessage incorrectly directs customers
to check the "worker node role" and links to ROSA classic STS docs; update the
source manifest ingress-loadbalancer-provisioning-failed.yaml (the
notificationMessage block) to: (1) replace step 1 with guidance to review the
ingress-operator/service load balancer role(s) and the ELB-related
service-linked role (e.g., ingress-operator role and
AWSServiceRoleForElasticLoadBalancing) instead of the worker node instance role,
(2) swap the docs URL to the ROSA HCP-specific IAM/STS documentation that
describes ingress-operator and service-linked roles, and (3) ensure you edit the
source manifest
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
(not the generated templates) so the corrected message propagates to
integration/stage/production.
- Around line 52761-52762: There is a misspelled metric name: replace all
occurrences of hypershift_cluster_waiting_initial_avaibility_duration_seconds
with the correct
hypershift_cluster_waiting_initial_availability_duration_seconds (5 instances
found across PrometheusRule.yaml files and stage/integration/production
.yaml.tmpl templates), ensuring every PrometheusRule, alert expression, and
template reference uses the corrected metric name (including the unless/on
clauses shown) so the rules match the upstream HyperShift metric; run a
repo-wide search-and-replace for that exact symbol and verify no residual
references remain.
In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl`:
- Around line 34826-34858: The notificationMessage text is incorrect for ROSA
HyperShift (HCP) customers; replace the guidance that references the cluster
"worker node role" and ad-hoc AWS SLR commands with HCP-appropriate
instructions: explain that LoadBalancer provisioning is performed by hosted
control-plane components using the control-plane/ST S role (not data-plane
worker/nodepool roles), remove or soften steps instructing customers to run aws
iam get-role/aws iam create-service-linked-role, and update the documentation
link to an HCP-specific troubleshooting/IAM page; mark the message as requiring
ingress/HCP team review before publishing. Ensure changes are made to the
notificationMessage entry so it references the hosted control-plane role and
requests HCP team validation.
- Around line 34813-34861: The generated template should not be edited
directly—update the source manifests in deploy/: modify
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
to change the ManagedFleetNotification (name:
ingress-loadbalancer-provisioning-failed) remediation steps so Step 1 refers to
the control-plane/ingress-operator IAM role used by HyperShift (or remove the
worker node role guidance), and remove or conditionally change Steps 2–3 about
creating AWSServiceRoleForElasticLoadBalancing to note that SLRs may be Red
Hat-managed for STS/HyperShift and link to HyperShift-specific troubleshooting
docs instead of the generic ROSA STS guide; also update the Prometheus rule
source
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml
to verify the PromQL exclusion uses labels that actually exist on
hypershift_cluster_waiting_initial_avaibility_duration_seconds and
hypershift_cluster_deleting_duration_seconds (ensure exported_namespace is
present or change the label matcher to a valid label or remove the 'unless on
(exported_namespace)' clause so the alert is correctly suppressed during
HyperShift init/deletion).
- Around line 52755-52762: The suppression clause for the hypershift metrics is
using the wrong join label; change the `unless on (exported_namespace)` to
`unless on (_id)` so the exclusion correctly matches
`hypershift_cluster_waiting_initial_avaibility_duration_seconds` and
`hypershift_cluster_deleting_duration_seconds` (which expose `_id`, `namespace`,
and `name` but not `exported_namespace`); keep the surrounding aggregation `max
by (exported_namespace, _id, namespace)` for
`cluster_operator_conditions{name="ingress",
condition="Degraded"/"LoadBalancerReady", ...}` intact and ensure the `unless on
(_id)` applies to those hypershift metric names exactly as in the
100-nodes-not-joining-nodepool rule.
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl`:
- Around line 34826-34858: The notificationMessage currently points users to the
worker node role and instructs checking the
AWSServiceRoleForElasticLoadBalancing SLR; update the message to instruct
customers on ROSA HCP/STS to check the Ingress Operator's IAM role (e.g. the
*-openshift-ingress-operator-cloud-credentials role or the STS role ARN
configured for ingress) and the attached policy containing
elasticloadbalancing:* actions, and either remove or reframe the
AWSServiceRoleForElasticLoadBalancing steps as an optional secondary check (not
the primary remediation); also add a one-line note requiring product/docs
review/sign-off before merge.
- Around line 34854-34855: Replace the redirecting docs.openshift.com URL with
the canonical docs.redhat.com target in the template string; specifically update
the literal URL
"https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html"
to
"https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws_classic_architecture/4/html/introduction_to_rosa/rosa-sts-about-iam-resources"
so the customer-facing text in the template references the final docs.redhat.com
location directly.
- Around line 34816-34861: Edit the source manifests (not the generated
.yaml.tmpl): in
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml,
correct the metric name
hypershift_cluster_waiting_initial_avaibility_duration_seconds to
hypershift_cluster_waiting_initial_availability_duration_seconds used in the
unless clause so the exclusion matches; and in
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml,
update the remediation step 1 to reference the cluster-specific Ingress Operator
IAM role (<cluster>-openshift-ingress-operator-cloud-credentials) and the
AWS-managed ROSAIngressOperatorPolicy (instead of suggesting worker node role
permissions).
- Around line 52761-52762: The metric name in the unless filter is misspelled;
update the expression inside the current unless on (exported_namespace) (...) to
check for both the old misspelled metric
hypershift_cluster_waiting_initial_avaibility_duration_seconds and the corrected
hypershift_cluster_waiting_initial_availability_duration_seconds (alongside the
existing hypershift_cluster_deleting_duration_seconds) so the filter suppresses
alerts during provisioning/deletion regardless of which metric name is present;
modify the predicate used by the unless (the part containing
hypershift_cluster_waiting_initial_avaibility_duration_seconds or
hypershift_cluster_deleting_duration_seconds) to include an OR for the corrected
metric name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 79221509-2d1e-45f7-86a7-fb1e2b219411
📒 Files selected for processing (3)
hack/00-osd-managed-cluster-config-integration.yaml.tmplhack/00-osd-managed-cluster-config-production.yaml.tmplhack/00-osd-managed-cluster-config-stage.yaml.tmpl
| summary: Action Required - Ingress LoadBalancer cannot be provisioned due | ||
| to IAM permission issues | ||
| notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer | ||
| has failed to provision for more than 30 minutes. This prevents external | ||
| traffic from reaching your cluster and is likely caused by missing or | ||
| insufficient AWS IAM permissions for Elastic Load Balancing (ELB). | ||
|
|
||
|
|
||
| Common causes include: | ||
|
|
||
| - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer, | ||
| elasticloadbalancing:DescribeLoadBalancers) | ||
|
|
||
| - Service-linked role for ELB not created in your AWS account | ||
|
|
||
| - VPC or subnet configuration preventing LoadBalancer provisioning | ||
|
|
||
|
|
||
| To resolve this issue: | ||
|
|
||
| 1. Review the IAM permissions attached to your cluster''s worker node | ||
| role and ensure ELB permissions are present. | ||
|
|
||
| 2. Verify the AWS service-linked role for ELB exists: aws iam get-role | ||
| --role-name AWSServiceRoleForElasticLoadBalancing | ||
|
|
||
| 3. If the role is missing, create it: aws iam create-service-linked-role | ||
| --aws-service-name elasticloadbalancing.amazonaws.com | ||
|
|
||
|
|
||
| For detailed troubleshooting steps and required IAM policies, please refer | ||
| to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html | ||
|
|
||
|
|
||
| If you need assistance, please open a support case.' |
There was a problem hiding this comment.
Customer-facing remediation guidance is inaccurate for ROSA HCP.
Two concerns in the notification message:
- Step 1 instructs customers to review "your cluster's worker node role" for ELB permissions. For ROSA HyperShift (STS), Service LoadBalancers are reconciled by the in-cluster Ingress/cloud-controller components using a dedicated STS role (the ingress-operator role and/or the service-linked role), not the worker node instance role. Pointing customers at the worker node role will likely send them in the wrong direction and they will not find the relevant ELB permissions there.
- The linked troubleshooting doc (
rosa_architecture/rosa-sts-about-iam-resources.html) is ROSA classic STS content. ROSA HCP has its own IAM/STS resource model; consider linking to the HCP-specific IAM documentation instead so customers see the roles that actually apply to their cluster.
Also note this message is duplicated across the integration/stage/production templates via the source manifest deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml — please update at the source rather than the generated template.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 34824
- 34858, The notificationMessage incorrectly directs customers to check the
"worker node role" and links to ROSA classic STS docs; update the source
manifest ingress-loadbalancer-provisioning-failed.yaml (the notificationMessage
block) to: (1) replace step 1 with guidance to review the
ingress-operator/service load balancer role(s) and the ELB-related
service-linked role (e.g., ingress-operator role and
AWSServiceRoleForElasticLoadBalancing) instead of the worker node instance role,
(2) swap the docs URL to the ROSA HCP-specific IAM/STS documentation that
describes ingress-operator and service-linked roles, and (3) ensure you edit the
source manifest
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
(not the generated templates) so the corrected message propagates to
integration/stage/production.
| False\"\n }\n) == 1\nunless on (exported_namespace) (\n hypershift_cluster_waiting_initial_avaibility_duration_seconds\n\ | ||
| \ or hypershift_cluster_deleting_duration_seconds\n)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Double-check we're using the same spelling that appears in other places of this repo
# (other rules/dashboards) as a sanity check against prod scrape target.
rg -nP -C1 'hypershift_cluster_waiting_initial_(avaibility|availability)_duration_seconds'Repository: openshift/managed-cluster-config
Length of output: 4958
🌐 Web query:
Is the Prometheus metric exported by the openshift/hypershift operator named "hypershift_cluster_waiting_initial_avaibility_duration_seconds" (with the avaibility typo), and is there any plan to rename it to "availability"?
💡 Result:
Yes. The HyperShift operator metric was (at least at one point) exported with the misspelling “avaibility”: hypershift_cluster_waiting_initial_avaibility_duration_seconds. A merged HyperShift PR in Feb 2026 explicitly fixes that typo by changing the metric name constant from hypershift_cluster_waiting_initial_avaibility_duration_seconds to hypershift_cluster_waiting_initial_availability_duration_seconds. That PR also notes that consumers will need to adjust alerting. As for a plan to rename it to “availability”: the evidence found indicates the rename happened as a typo fix to “availability” (correct spelling), not that a separate future rename is planned. So, in current code after the Feb 2026 fix, the metric is hypershift_cluster_waiting_initial_availability_duration_seconds, and there is no separate “plan” found beyond that typo correction.
Citations:
- 1: NO-JIRA: bug(metrics): fix typo in
hypershift_cluster_waiting_initial_availability_duration_secondsmetric hypershift#7730 - 2: openshift/hypershift@8b3a734...7ce27e1
🏁 Script executed:
# Confirm how many files need updating
rg -l 'hypershift_cluster_waiting_initial_avaibility_duration_seconds' | wc -lRepository: openshift/managed-cluster-config
Length of output: 78
Update metric name from avaibility to availability across all PrometheusRule and template files.
The HyperShift operator fixed this typo in upstream PR #7730 (merged Feb 2026), renaming the metric from hypershift_cluster_waiting_initial_avaibility_duration_seconds to hypershift_cluster_waiting_initial_availability_duration_seconds. All 5 instances in this repo (PrometheusRule.yaml files and stage/integration/production .yaml.tmpl files) must be updated to the corrected spelling, otherwise the rules will silently fail when scraping from HyperShift clusters with the updated operator.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 52761
- 52762, There is a misspelled metric name: replace all occurrences of
hypershift_cluster_waiting_initial_avaibility_duration_seconds with the correct
hypershift_cluster_waiting_initial_availability_duration_seconds (5 instances
found across PrometheusRule.yaml files and stage/integration/production
.yaml.tmpl templates), ensuring every PrometheusRule, alert expression, and
template reference uses the corrected metric name (including the unless/on
clauses shown) so the rules match the upstream HyperShift metric; run a
repo-wide search-and-replace for that exact symbol and verify no residual
references remain.
| resendWait: 24 | ||
| severity: Warning | ||
| limitedSupport: false | ||
| - apiVersion: ocmagent.managed.openshift.io/v1alpha1 | ||
| kind: ManagedFleetNotification | ||
| metadata: | ||
| name: ingress-loadbalancer-provisioning-failed | ||
| namespace: openshift-ocm-agent-operator | ||
| spec: | ||
| fleetNotification: | ||
| name: ingress-loadbalancer-provisioning-failed | ||
| summary: Action Required - Ingress LoadBalancer cannot be provisioned due | ||
| to IAM permission issues | ||
| notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer | ||
| has failed to provision for more than 30 minutes. This prevents external | ||
| traffic from reaching your cluster and is likely caused by missing or | ||
| insufficient AWS IAM permissions for Elastic Load Balancing (ELB). | ||
|
|
||
|
|
||
| Common causes include: | ||
|
|
||
| - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer, | ||
| elasticloadbalancing:DescribeLoadBalancers) | ||
|
|
||
| - Service-linked role for ELB not created in your AWS account | ||
|
|
||
| - VPC or subnet configuration preventing LoadBalancer provisioning | ||
|
|
||
|
|
||
| To resolve this issue: | ||
|
|
||
| 1. Review the IAM permissions attached to your cluster''s worker node | ||
| role and ensure ELB permissions are present. | ||
|
|
||
| 2. Verify the AWS service-linked role for ELB exists: aws iam get-role | ||
| --role-name AWSServiceRoleForElasticLoadBalancing | ||
|
|
||
| 3. If the role is missing, create it: aws iam create-service-linked-role | ||
| --aws-service-name elasticloadbalancing.amazonaws.com | ||
|
|
||
|
|
||
| For detailed troubleshooting steps and required IAM policies, please refer | ||
| to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html | ||
|
|
||
|
|
||
| If you need assistance, please open a support case.' | ||
| resendWait: 24 | ||
| severity: Warning | ||
| limitedSupport: false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate the source manifest for the new notification and PrometheusRule so fixes are applied at the right layer.
fd -t f 'ingress-loadbalancer-provisioning-failed' || true
fd -t f 'iam-elb-ingress-loadbalancer-degraded' || true
rg -n 'ingress-loadbalancer-provisioning-failed' -g '!hack/**' -C2
rg -n 'iam-elb-ingress-loadbalancer-degraded' -g '!hack/**' -C2
fd -t f 'generate_template.py' -x cat {}Repository: openshift/managed-cluster-config
Length of output: 14415
🏁 Script executed:
cat "deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml"Repository: openshift/managed-cluster-config
Length of output: 1756
🏁 Script executed:
cat "deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml"Repository: openshift/managed-cluster-config
Length of output: 2087
Edit source manifests in deploy/, not the generated template. This file is generated by generate_template.py from source files:
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yamldeploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml
ManagedFleetNotification: Step 1 gives misleading guidance for ROSA HyperShift. The notification message targets "ROSA HyperShift cluster" but instructs customers to "Review the IAM permissions attached to your cluster's worker node role." For HyperShift, the ingress operator runs in the control plane (using Red Hat's STS role), not the customer's worker/node role. This step should clarify that customers should check the control plane ingress operator's IAM role instead, or remove the worker node guidance entirely. Additionally, step 2-3 about creating the AWSServiceRoleForElasticLoadBalancing SLR may not be appropriate for ROSA STS clusters (typically Red Hat-managed). Update the remediation steps to align with HyperShift/STS reality, and consider pointing to HyperShift-specific troubleshooting documentation instead of the generic ROSA STS guide.
PrometheusRule PromQL: Verify label matching in the exclusion filter. The unless on (exported_namespace) (hypershift_cluster_waiting_initial_avaibility_duration_seconds or hypershift_cluster_deleting_duration_seconds) construct assumes the hypershift_* metrics carry the exported_namespace label. If they don't, the filter fails silently and the alert would fire during cluster initialization/deletion windows despite the intended safeguard. Confirm that these metrics include exported_namespace in their label set.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 34813 -
34861, The generated template should not be edited directly—update the source
manifests in deploy/: modify
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
to change the ManagedFleetNotification (name:
ingress-loadbalancer-provisioning-failed) remediation steps so Step 1 refers to
the control-plane/ingress-operator IAM role used by HyperShift (or remove the
worker node role guidance), and remove or conditionally change Steps 2–3 about
creating AWSServiceRoleForElasticLoadBalancing to note that SLRs may be Red
Hat-managed for STS/HyperShift and link to HyperShift-specific troubleshooting
docs instead of the generic ROSA STS guide; also update the Prometheus rule
source
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml
to verify the PromQL exclusion uses labels that actually exist on
hypershift_cluster_waiting_initial_avaibility_duration_seconds and
hypershift_cluster_deleting_duration_seconds (ensure exported_namespace is
present or change the label matcher to a valid label or remove the 'unless on
(exported_namespace)' clause so the alert is correctly suppressed during
HyperShift init/deletion).
| notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer | ||
| has failed to provision for more than 30 minutes. This prevents external | ||
| traffic from reaching your cluster and is likely caused by missing or | ||
| insufficient AWS IAM permissions for Elastic Load Balancing (ELB). | ||
|
|
||
|
|
||
| Common causes include: | ||
|
|
||
| - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer, | ||
| elasticloadbalancing:DescribeLoadBalancers) | ||
|
|
||
| - Service-linked role for ELB not created in your AWS account | ||
|
|
||
| - VPC or subnet configuration preventing LoadBalancer provisioning | ||
|
|
||
|
|
||
| To resolve this issue: | ||
|
|
||
| 1. Review the IAM permissions attached to your cluster''s worker node | ||
| role and ensure ELB permissions are present. | ||
|
|
||
| 2. Verify the AWS service-linked role for ELB exists: aws iam get-role | ||
| --role-name AWSServiceRoleForElasticLoadBalancing | ||
|
|
||
| 3. If the role is missing, create it: aws iam create-service-linked-role | ||
| --aws-service-name elasticloadbalancing.amazonaws.com | ||
|
|
||
|
|
||
| For detailed troubleshooting steps and required IAM policies, please refer | ||
| to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html | ||
|
|
||
|
|
||
| If you need assistance, please open a support case.' |
There was a problem hiding this comment.
Remediation guidance is misleading for ROSA HCP customers.
This notification is customer-facing and targets ROSA HyperShift, but the troubleshooting steps describe the classic/standalone ROSA model:
- Step 1 — "Review the IAM permissions attached to your cluster's worker node role" is incorrect for HCP. Ingress
Service type=LoadBalancerreconciliation on HCP is performed by the hosted control-plane components (cloud controller / ingress operator) using their own STS role, not by the data-plane workers. Pointing customers at the worker/nodepool role will send them down the wrong path. - Steps 2–3 — instructing customers to run
aws iam get-role AWSServiceRoleForElasticLoadBalancingandaws iam create-service-linked-roleis questionable for STS-based ROSA HCP, where the SLR lifecycle and the operator roles are managed as part of the ROSA install flow. Most customers will not have (or want) the IAM privileges to do this ad-hoc. - The linked doc (
rosa_architecture/rosa-sts-about-iam-resources.html) is the classic ROSA STS IAM resources page; consider linking an HCP-specific troubleshooting/IAM page instead.
Because SRE may close the support path based on whether the customer followed these steps, getting this right matters. Please have the ingress/HCP team review the message text before un-drafting the PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 34826 -
34858, The notificationMessage text is incorrect for ROSA HyperShift (HCP)
customers; replace the guidance that references the cluster "worker node role"
and ad-hoc AWS SLR commands with HCP-appropriate instructions: explain that
LoadBalancer provisioning is performed by hosted control-plane components using
the control-plane/ST S role (not data-plane worker/nodepool roles), remove or
soften steps instructing customers to run aws iam get-role/aws iam
create-service-linked-role, and update the documentation link to an HCP-specific
troubleshooting/IAM page; mark the message as requiring ingress/HCP team review
before publishing. Ensure changes are made to the notificationMessage entry so
it references the hosted control-plane role and requests HCP team validation.
| - apiVersion: ocmagent.managed.openshift.io/v1alpha1 | ||
| kind: ManagedFleetNotification | ||
| metadata: | ||
| name: ingress-loadbalancer-provisioning-failed | ||
| namespace: openshift-ocm-agent-operator | ||
| spec: | ||
| fleetNotification: | ||
| name: ingress-loadbalancer-provisioning-failed | ||
| summary: Action Required - Ingress LoadBalancer cannot be provisioned due | ||
| to IAM permission issues | ||
| notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer | ||
| has failed to provision for more than 30 minutes. This prevents external | ||
| traffic from reaching your cluster and is likely caused by missing or | ||
| insufficient AWS IAM permissions for Elastic Load Balancing (ELB). | ||
|
|
||
|
|
||
| Common causes include: | ||
|
|
||
| - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer, | ||
| elasticloadbalancing:DescribeLoadBalancers) | ||
|
|
||
| - Service-linked role for ELB not created in your AWS account | ||
|
|
||
| - VPC or subnet configuration preventing LoadBalancer provisioning | ||
|
|
||
|
|
||
| To resolve this issue: | ||
|
|
||
| 1. Review the IAM permissions attached to your cluster''s worker node | ||
| role and ensure ELB permissions are present. | ||
|
|
||
| 2. Verify the AWS service-linked role for ELB exists: aws iam get-role | ||
| --role-name AWSServiceRoleForElasticLoadBalancing | ||
|
|
||
| 3. If the role is missing, create it: aws iam create-service-linked-role | ||
| --aws-service-name elasticloadbalancing.amazonaws.com | ||
|
|
||
|
|
||
| For detailed troubleshooting steps and required IAM policies, please refer | ||
| to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html | ||
|
|
||
|
|
||
| If you need assistance, please open a support case.' | ||
| resendWait: 24 | ||
| severity: Warning | ||
| limitedSupport: false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate the source ManagedFleetNotification and PrometheusRule manifests this template was generated from.
fd -t f -e yaml -e yml ingress-loadbalancer-provisioning-failed
fd -t f -e yaml -e yml iam-elb-ingress-loadbalancer-degraded
# Fall back to a content search if filenames differ.
rg -n --no-heading 'ingress-loadbalancer-provisioning-failed' -g '!hack/**'
rg -n --no-heading 'iam-elb-ingress-loadbalancer-degraded' -g '!hack/**'
# Confirm the generator and which tree it walks.
fd -t f 'generate_template.py'Repository: openshift/managed-cluster-config
Length of output: 1130
🏁 Script executed:
# Read the source ManagedFleetNotification to check the notificationMessage content
cat "deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml"Repository: openshift/managed-cluster-config
Length of output: 1756
🏁 Script executed:
# Read the source PrometheusRule to check the metric name spelling
cat "deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml"Repository: openshift/managed-cluster-config
Length of output: 2087
🌐 Web query:
hypershift metric name hypershift_cluster_waiting_initial_availability
💡 Result:
The HyperShift metric name you asked about is: hypershift_cluster_waiting_initial_availability_duration_seconds It is defined in HyperShift’s hosted cluster metrics code as a “Time in seconds it is taking to get the HostedClusterAvailable condition becoming true since the creation of the HostedCluster.” Important: there was an earlier misspelled metric name where “availability” was misspelled as “avaibility/availbility” in the metric string; HyperShift introduced a fix to correct the metric identifier to hypershift_cluster_waiting_initial_availability_duration_seconds, and any dashboards/alerts using the old misspelling need to be updated.
Citations:
- 1: NO-JIRA: bug(metrics): fix typo in
hypershift_cluster_waiting_initial_availability_duration_secondsmetric hypershift#7730 - 2: openshift/hypershift@8b3a734...7ce27e1
🌐 Web query:
ROSA STS ELB Ingress IAM permissions documentation redhat
💡 Result:
For ROSA clusters on AWS classic that use AWS STS, the “ROSA STS ELB/Ingress” component is handled by the ROSA Ingress Operator. The specific IAM permission set the operator needs is the cluster-specific operator role named like <cluster_name>--openshift-ingress-operator-cloud-credentials, which is attached to the AWS-managed policy ROSAIngressOperatorPolicy (or the classic equivalent policy/policy JSON generated by ROSA). 1) Which IAM role/policy is used for Ingress/ELB? - Red Hat ROSA (STS, classic) documents that each STS-based ROSA cluster requires cluster-specific Operator IAM roles, including an IAM role required by the “ROSA Ingress Operator” to manage external access to the cluster (the Ingress cloud credentials operator role). - The AWS-managed policy for that operator is ROSAIngressOperatorPolicy, described by AWS as allowing the OpenShift Ingress Operator to provision/manage load balancers and Route 53 DNS configurations for ROSA clusters. 2) Required actions/permissions (what the IAM policy allows) ROSAIngressOperatorPolicy JSON (AWS managed policy) allows: - elasticloadbalancing:DescribeLoadBalancers - route53:ListHostedZones - tag:GetResources - route53:ChangeResourceRecordSets (with a condition restricting normalized record names to specific allowed domain patterns) 3) Where this is documented (official) - Red Hat (STS clusters, classic) IAM resources for STS clusters: reference for the cluster-specific Ingress Operator role and the policy name/role name pattern. - AWS managed policy reference: ROSAIngressOperatorPolicy description and JSON policy document. If you tell me whether you’re using ROSA on AWS classic (STS) or ROSA with hosted control planes (HCP), and whether you mean AWS ALB (Ingress Operator external access) vs a Kubernetes service LoadBalancer, I can point to the exact role/policy name and how it’s created in that specific flow.
Citations:
- 1: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html
- 2: https://docs.aws.amazon.com/rosa/latest/userguide/security-iam-awsmanpol.html
- 3: https://docs.aws.amazon.com/rosa/latest/userguide/security-iam-rosa-classic-operator-policies.html
Edit the source manifest, not the generated template.
Per the commit message, this file is produced by generate_template.py from source manifests under deploy/ and resources/. Any fixes must be made in the source YAML so they survive re-generation; the generated *.yaml.tmpl files should only ever be touched by the generator.
Critical: Fix metric name spelling in the PrometheusRule source.
In deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml, the metric is spelled hypershift_cluster_waiting_initial_avaibility_duration_seconds (missing 'l'). HyperShift has since corrected this to hypershift_cluster_waiting_initial_availability_duration_seconds. Using the misspelled metric name means the unless clause silently fails to match, and the alert will incorrectly fire during initial cluster provisioning and deletion—exactly the windows this rule intends to exclude.
Fix inaccurate IAM remediation guidance in ManagedFleetNotification source.
In deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml, step 1 of the remediation states: "Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present." For ROSA STS clusters, the Ingress Operator (not worker nodes) manages ELB provisioning via the cluster-specific Ingress Operator IAM role (<cluster>-openshift-ingress-operator-cloud-credentials), which requires the AWS-managed policy ROSAIngressOperatorPolicy. The current guidance will misdirect customers to the wrong IAM role.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 34816 -
34861, Edit the source manifests (not the generated .yaml.tmpl): in
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml,
correct the metric name
hypershift_cluster_waiting_initial_avaibility_duration_seconds to
hypershift_cluster_waiting_initial_availability_duration_seconds used in the
unless clause so the exclusion matches; and in
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml,
update the remediation step 1 to reference the cluster-specific Ingress Operator
IAM role (<cluster>-openshift-ingress-operator-cloud-credentials) and the
AWS-managed ROSAIngressOperatorPolicy (instead of suggesting worker node role
permissions).
| notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer | ||
| has failed to provision for more than 30 minutes. This prevents external | ||
| traffic from reaching your cluster and is likely caused by missing or | ||
| insufficient AWS IAM permissions for Elastic Load Balancing (ELB). | ||
|
|
||
|
|
||
| Common causes include: | ||
|
|
||
| - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer, | ||
| elasticloadbalancing:DescribeLoadBalancers) | ||
|
|
||
| - Service-linked role for ELB not created in your AWS account | ||
|
|
||
| - VPC or subnet configuration preventing LoadBalancer provisioning | ||
|
|
||
|
|
||
| To resolve this issue: | ||
|
|
||
| 1. Review the IAM permissions attached to your cluster''s worker node | ||
| role and ensure ELB permissions are present. | ||
|
|
||
| 2. Verify the AWS service-linked role for ELB exists: aws iam get-role | ||
| --role-name AWSServiceRoleForElasticLoadBalancing | ||
|
|
||
| 3. If the role is missing, create it: aws iam create-service-linked-role | ||
| --aws-service-name elasticloadbalancing.amazonaws.com | ||
|
|
||
|
|
||
| For detailed troubleshooting steps and required IAM policies, please refer | ||
| to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html | ||
|
|
||
|
|
||
| If you need assistance, please open a support case.' |
There was a problem hiding this comment.
Remediation guidance is inaccurate for ROSA HCP/STS and risks sending customers down the wrong path.
Two issues in notificationMessage:
-
Step 1 tells customers to "Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present." On ROSA HCP with STS, the default IngressController's AWS LoadBalancer is provisioned via the Ingress Operator's IAM role, not the worker-node instance role. Pointing customers at the worker role will waste time and won't fix the actual issue. Please direct them to the Ingress Operator role (e.g.
*-openshift-ingress-operator-cloud-credentials/ the STS role ARN configured for ingress) and the policy containingelasticloadbalancing:*actions. -
Steps 2–3 describe the AWS-managed
AWSServiceRoleForElasticLoadBalancingSLR, which is a separate concern from the STS role permissions in step 1 and is usually already present. Consider either dropping these steps or clearly framing them as a secondary check.
Also, this is an SOP-style message that will be shown to customers verbatim — a product/docs reviewer should sign off on the final wording before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 34826 -
34858, The notificationMessage currently points users to the worker node role
and instructs checking the AWSServiceRoleForElasticLoadBalancing SLR; update the
message to instruct customers on ROSA HCP/STS to check the Ingress Operator's
IAM role (e.g. the *-openshift-ingress-operator-cloud-credentials role or the
STS role ARN configured for ingress) and the attached policy containing
elasticloadbalancing:* actions, and either remove or reframe the
AWSServiceRoleForElasticLoadBalancing steps as an optional secondary check (not
the primary remediation); also add a one-line note requiring product/docs
review/sign-off before merge.
| For detailed troubleshooting steps and required IAM policies, please refer | ||
| to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check that the customer-facing docs URL is live and not a redirect/404.
curl -sSIL -o /dev/null -w 'final_url=%{url_effective}\nhttp_code=%{http_code}\n' \
'https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html'Repository: openshift/managed-cluster-config
Length of output: 252
Update the documentation URL to the canonical docs.redhat.com location.
The provided URL https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html is currently live (HTTP 200), but it redirects to the canonical ROSA documentation on docs.redhat.com. Since this URL is customer-facing, update it directly to the target: https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws_classic_architecture/4/html/introduction_to_rosa/rosa-sts-about-iam-resources
This avoids relying on a redirect that may be removed as the migration completes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 34854 -
34855, Replace the redirecting docs.openshift.com URL with the canonical
docs.redhat.com target in the template string; specifically update the literal
URL
"https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html"
to
"https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws_classic_architecture/4/html/introduction_to_rosa/rosa-sts-about-iam-resources"
so the customer-facing text in the template references the final docs.redhat.com
location directly.
| False\"\n }\n) == 1\nunless on (exported_namespace) (\n hypershift_cluster_waiting_initial_avaibility_duration_seconds\n\ | ||
| \ or hypershift_cluster_deleting_duration_seconds\n)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1. See how this metric is spelled elsewhere in the repo (other alerts, recording rules, tests).
rg -nP '\bhypershift_cluster_waiting_initial_(avaibility|availability)_duration_seconds\b'
# 2. Check if a parallel PR/commit in this repo already renamed it.
rg -nP --no-heading 'SREP-3883|Resolve typo.*hypershift_cluster_waiting_initial'Repository: openshift/managed-cluster-config
Length of output: 2154
🌐 Web query:
In openshift/hypershift, what is the current exported metric name for the "waiting for initial availability" duration — is it hypershift_cluster_waiting_initial_avaibility_duration_seconds or hypershift_cluster_waiting_initial_availability_duration_seconds? Has a rename landed?
💡 Result:
The current exported metric name for the "waiting for initial availability" duration in OpenShift/HyperShift is hypershift_cluster_waiting_initial_availability_duration_seconds (with two 'i's in "availability"). A rename has landed to fix a previous typo from hypershift_cluster_waiting_initial_avaibility_duration_seconds (one 'i').
Citations:
- 1: NO-JIRA: bug(metrics): fix typo in
hypershift_cluster_waiting_initial_availability_duration_secondsmetric hypershift#7730 - 2: openshift/hypershift@8b3a734...7ce27e1
- 3: https://pkg.go.dev/github.com/openshift/hypershift@v0.1.71
🏁 Script executed:
#!/bin/bash
# Search for any commits/references to SREP-3883 in this repo
git log --all --oneline --grep="SREP-3883" | head -20
# Also check for any recent changes to the metric names
git log -p --all -- '*osd-managed-cluster-config*.yaml.tmpl' | \
grep -A 5 -B 5 'hypershift_cluster_waiting_initial_avaibility\|hypershift_cluster_waiting_initial_availability' | \
head -100
# Check if there's any reference to HyperShift version pinning
rg -nP 'hypershift.*version|version.*hypershift' hack/ deploy/ | head -20Repository: openshift/managed-cluster-config
Length of output: 58
🌐 Web query:
HyperShift PR #7730 - when did it merge and what version was it released in?
💡 Result:
No evidence found of HyperShift PR #7730 existing or being merged. Searches across GitHub, including direct URL attempts, pull request lists, and targeted queries for mentions of "#7730" or "PR #7730" in the openshift/hypershift repository returned no matching results. PRs around that number (e.g., #7711 merged 2026-03-07, #7774 merged 2026-03-27) exist, but #7730 specifically does not appear in search results or recent activity. It may have been closed without merging, deleted, or the number misremembered. HyperShift does not follow strict semantic versioning aligned with OCP releases; versions are tagged periodically from main for Red Hat services, with supported OCP versions up to 4.17.
Citations:
- 1: CNTRLPLANE-2551: Hypershift integration for API Auth Config changes hypershift#7774
- 2: openshift/hypershift@806fc3b
- 3: https://github.com/openshift/hypershift/pulls
- 4: OCPBUGS-76645: Make generated container names k8s compliant hypershift#7711
- 5: https://github.com/openshift/hypershift/tree/main
🏁 Script executed:
#!/bin/bash
# Search for both metric name spellings in HyperShift repo context
# First, let's try to fetch and search the HyperShift repository
# Look for references to the metric in code or comments
# Check if we have the hypershift repo available or can search online
# Let's search for commits mentioning the typo
git ls-remote https://github.com/openshift/hypershift.git | head -5
# Try to get recent commits that might mention this metric
curl -s 'https://api.github.com/repos/openshift/hypershift/commits?per_page=30&grep=avaibility\|availability' | \
jq -r '.[] | "\(.commit.message | split("\n")[0]) - \(.commit.author.date)"' 2>/dev/null | head -10
# Also search for issues/PRs mentioning this specific typo
curl -s 'https://api.github.com/search/issues?q=repo:openshift/hypershift+avaibility' | \
jq -r '.items[] | "\(.title) (\(.number))"' 2>/dev/null | head -5Repository: openshift/managed-cluster-config
Length of output: 1574
🌐 Web query:
openshift/hypershift metric name typo fix availability avaibility - which spelling is currently emitted?
💡 Result:
Currently emitted metric spelling: hypershift_cluster_waiting_initial_availability_duration_seconds (correct spelling with “availability”). Previously, HyperShift emitted a typo: hypershift_cluster_waiting_initial_avaibility_duration_seconds / hypershift_cluster_waiting_initial_availbility_duration_seconds (the PR fixes the typo and notes consumers must adjust alerting/queries).
Citations:
Update metric name to account for HyperShift typo fix — critical correctness issue.
HyperShift has merged a fix renaming hypershift_cluster_waiting_initial_avaibility_duration_seconds (misspelled) to hypershift_cluster_waiting_initial_availability_duration_seconds (correct spelling). This repository has not been updated. If the fleet is running the newer HyperShift version, the old metric name no longer exists, causing the unless on (exported_namespace) ( ... ) filter to silently evaluate empty. The alert will then fire during normal cluster provisioning and deletion — the exact windows this filter is meant to suppress — and page customers unnecessarily.
To safely handle the transition, guard against both metric spellings:
Suggested fix
unless on (exported_namespace) (
hypershift_cluster_waiting_initial_avaibility_duration_seconds
+ or hypershift_cluster_waiting_initial_availability_duration_seconds
or hypershift_cluster_deleting_duration_seconds
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 52761 -
52762, The metric name in the unless filter is misspelled; update the expression
inside the current unless on (exported_namespace) (...) to check for both the
old misspelled metric
hypershift_cluster_waiting_initial_avaibility_duration_seconds and the corrected
hypershift_cluster_waiting_initial_availability_duration_seconds (alongside the
existing hypershift_cluster_deleting_duration_seconds) so the filter suppresses
alerts during provisioning/deletion regardless of which metric name is present;
modify the predicate used by the unless (the part containing
hypershift_cluster_waiting_initial_avaibility_duration_seconds or
hypershift_cluster_deleting_duration_seconds) to include an OR for the corrected
metric name.
| namespace: openshift-observability-rhobs | ||
| spec: | ||
| groups: | ||
| - name: IAMELBIngressLoadBalancerDegraded |
There was a problem hiding this comment.
These should be defined in the rhobs configuration repo and/or dynatrace-config, not in MCC.
- Remove status="True" filter (not a label on cluster_operator_conditions) - Remove LoadBalancerReady AND clause (IngressController condition, not ClusterOperator condition; never present in cluster_operator_conditions) - Fix unless join: exported_namespace → _id (shared label between cluster_operator_conditions and hypershift_cluster_* metrics) - Fix max by: remove exported_namespace (not a label on this metric) - Update description annotation to reflect simplified detection logic Validated against production RHOBS: query returns 24 real ingress-degraded clusters, unless clause correctly excludes clusters in startup/deletion windows. Regenerate SelectorSyncSet templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The monitoring.rhobs/v1 PrometheusRule belongs in the rhobs configuration repo (dynatrace-config), not in MCC. The ManagedFleetNotification CR remains here as it is deployed via SelectorSyncSet to the cluster. Regenerate SelectorSyncSet templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
@anispate: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Adds a PrometheusRule (monitoring.rhobs/v1) to the OBO monitoring stack that fires when ROSA HCP ingress has been unable to provision a LoadBalancer for 30+ minutes, indicating likely AWS IAM ELB permission issues.
Also adds the companion ManagedFleetNotification CR to deliver a customer-facing notification via OCM Agent with remediation guidance.
What type of PR is this?
(bug/feature/cleanup/documentation)
What this PR does / why we need it?
Which Jira/Github issue(s) this PR fixes?
Fixes #
Special notes for your reviewer:
Pre-checks (if applicable):
Tested latest changes against a cluster
Included documentation changes with PR
If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:
Summary by CodeRabbit