[DRAFT] - SREP-3200: Add PrometheusRule for IAM ELB ingress LoadBalancer degradation by anispate · Pull Request #2722 · openshift/managed-cluster-config

anispate · 2026-04-23T18:45:00Z

Adds a PrometheusRule (monitoring.rhobs/v1) to the OBO monitoring stack that fires when ROSA HCP ingress has been unable to provision a LoadBalancer for 30+ minutes, indicating likely AWS IAM ELB permission issues.

Also adds the companion ManagedFleetNotification CR to deliver a customer-facing notification via OCM Agent with remediation guidance.

What type of PR is this?

(bug/feature/cleanup/documentation)

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Fixes #

Special notes for your reviewer:

Pre-checks (if applicable):

Tested latest changes against a cluster
Included documentation changes with PR
If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:
```
matchExpressions:
- key: api.openshift.com/fedramp
  operator: NotIn
  values: ["true"]
```

Summary by CodeRabbit

New Features
- Added alert notifications for ingress LoadBalancer provisioning failures in OpenShift clusters. Users receive comprehensive remediation guidance addressing AWS IAM permission issues, ELB service-linked role configuration, and troubleshooting steps with AWS CLI commands to resolve load balancer provisioning failures.

…ation Adds a PrometheusRule (monitoring.rhobs/v1) to the OBO monitoring stack that fires when ROSA HCP ingress has been unable to provision a LoadBalancer for 30+ minutes, indicating likely AWS IAM ELB permission issues. Also adds the companion ManagedFleetNotification CR to deliver a customer-facing notification via OCM Agent with remediation guidance. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

openshift-ci-robot · 2026-04-23T18:45:05Z

@anispate: This pull request references SREP-3200 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Adds a PrometheusRule (monitoring.rhobs/v1) to the OBO monitoring stack that fires when ROSA HCP ingress has been unable to provision a LoadBalancer for 30+ minutes, indicating likely AWS IAM ELB permission issues.

Also adds the companion ManagedFleetNotification CR to deliver a customer-facing notification via OCM Agent with remediation guidance.

What type of PR is this?

(bug/feature/cleanup/documentation)

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Fixes #

Special notes for your reviewer:

Pre-checks (if applicable):
Tested latest changes against a cluster

Included documentation changes with PR
If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:
matchExpressions:
- key: api.openshift.com/fedramp
  operator: NotIn
  values: ["true"]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-23T18:45:16Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffa5f91b-f8e7-4a7c-b40a-45d6701fd09d

📥 Commits

Reviewing files that changed from the base of the PR and between ffc3a3b and 968a702.

📒 Files selected for processing (3)

hack/00-osd-managed-cluster-config-integration.yaml.tmpl
hack/00-osd-managed-cluster-config-production.yaml.tmpl
hack/00-osd-managed-cluster-config-stage.yaml.tmpl

🚧 Files skipped from review as they are similar to previous changes (3)

hack/00-osd-managed-cluster-config-integration.yaml.tmpl
hack/00-osd-managed-cluster-config-stage.yaml.tmpl
hack/00-osd-managed-cluster-config-production.yaml.tmpl

Walkthrough

This PR adds a new OpenShift OCM agent ManagedFleetNotification resource named ingress-loadbalancer-provisioning-failed across multiple deployment configurations. The notification warns about ROSA HyperShift ingress LoadBalancer provisioning failures due to IAM permission issues, includes remediation steps, and is deployed to the openshift-ocm-agent-operator namespace with a 24-hour resend interval.

Changes

Ingress LoadBalancer Provisioning Failure Notification

Layer / File(s)	Summary
Base Manifest `deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml`	New `ManagedFleetNotification` resource defining the warning for ELB provisioning failures, with summary, detailed message, AWS CLI remediation steps, and documentation link.
Template Integration `hack/00-osd-managed-cluster-config-integration.yaml.tmpl`, `hack/00-osd-managed-cluster-config-production.yaml.tmpl`, `hack/00-osd-managed-cluster-config-stage.yaml.tmpl`	Same `ManagedFleetNotification` resource added to environment-specific configuration templates (integration, production, stage) with message variations describing IAM/ELB root causes and remediation actions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title references adding a PrometheusRule for IAM ELB ingress LoadBalancer degradation, but the actual changeset only adds ManagedFleetNotification resources—the PrometheusRule was moved to a separate repository in later commits.	Update the title to accurately reflect the current changeset: 'Add ManagedFleetNotification for ingress LoadBalancer provisioning failures' or similar, since PrometheusRule is no longer included in this PR.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	This PR contains no Ginkgo test files or test definitions. The changes are Kubernetes manifests and OpenShift templates only. The custom check for stable/deterministic test names is not applicable.
Test Structure And Quality	✅ Passed	Custom check not applicable. This PR adds Kubernetes manifests to a configuration repository with no Go or Ginkgo test code. The check requires reviewing Ginkgo test quality, which is inapplicable.
Microshift Test Compatibility	✅ Passed	This PR does not add any Ginkgo e2e tests. It only adds OpenShift manifest configuration files (YAML). The repository contains no Go source code. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Check not applicable. PR contains only YAML manifests/templates for OpenShift configuration. No Ginkgo e2e tests are added. SNO compatibility check applies only to test code.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds only ManagedFleetNotification CRs (notification templates) with no pod/deployment specs or scheduling constraints. Check for topology-aware scheduling is not applicable.
Ote Binary Stdout Contract	✅ Passed	Custom check is not applicable. PR contains only Kubernetes YAML manifests, no Go code or test infrastructure that could violate OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Custom check for IPv6 and disconnected network compatibility of Ginkgo e2e tests is not applicable. PR only adds Kubernetes manifest files (YAML), no Ginkgo e2e tests present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Microsoft Presidio Analyzer (2.2.362)

hack/00-osd-managed-cluster-config-production.yaml.tmpl

Microsoft Presidio Analyzer failed to scan this file

hack/00-osd-managed-cluster-config-stage.yaml.tmpl

Microsoft Presidio Analyzer failed to scan this file

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-23T18:45:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anispate
Once this PR has been reviewed and has the lgtm label, please assign sam-nguyen7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml (1)
15-15: Rename alert to drop Notification suffix for consistency.

The IngressLoadBalancerProvisioningFailedNotification alert uses send_managed_notification: "true" and managed_notification_template, which is the standard routing mechanism across this repo. Alerts with the Notification suffix represent a small minority of PrometheusRules; the idiomatic convention in this codebase names alerts after the condition (e.g., IngressLoadBalancerProvisioningFailed) and relies on these routing labels instead. Renaming would improve consistency and make alerts easier to search and deduplicate in Alertmanager.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml`
at line 15, Rename the Prometheus Rule alert symbol
IngressLoadBalancerProvisioningFailedNotification to
IngressLoadBalancerProvisioningFailed in the rule definition (the alert: field)
and update any local references inside that YAML (annotations/labels) and any
alert routing/recording/remediation configs that reference the old name so they
point to IngressLoadBalancerProvisioningFailed; keep the existing
send_managed_notification and managed_notification_template labels intact so
routing behavior remains unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml`:
- Line 19: Update the remediation message that currently reads "Review the IAM
permissions attached to your cluster's worker node role and ensure ELB
permissions are present" to instead reference the control-plane ingress
operator's IAM/STS role used in ROSA HyperShift (e.g., the
ingress-operator-managed role assumed via the cluster OIDC provider) and the
account-level ELB service-linked role; specifically, change the guidance text in
ingress-loadbalancer-provisioning-failed.yaml to point users to check the
ingress operator's IAM role/STS trust policy and the ELB service-linked role
rather than the worker node instance role.

In
`@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml`:
- Line 46: The alert currently uses
cluster_operator_conditions{condition="Degraded"} plus for: 30m which
effectively doubles the wait; either shorten the for: to a smaller window (e.g.,
for: 5m or for: 10m) if you want the alert to fire roughly ~30 minutes after LB
provisioning begins failing, or keep for: 30m but update the alert description
to explicitly state it will fire after the operator’s internal degraded
threshold plus an additional ~30 minutes (i.e., ~30+ minutes total); adjust the
for: value or the descriptive text accordingly and ensure you edit the rule that
references cluster_operator_conditions{condition="Degraded"} in this
PrometheusRule.
- Line 43: The metric name contains a typo: replace all occurrences of
hypershift_cluster_waiting_initial_avaibility_duration_seconds with the
corrected hypershift_cluster_waiting_initial_availability_duration_seconds so
the alert suppression unless clause matches the upstream metric; update both
instances mentioned in the review (the ELB ingress degraded rule and the OIDC
missing rule) and validate the alerting rules reference the corrected metric
name exactly.

---

Nitpick comments:
In
`@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml`:
- Line 15: Rename the Prometheus Rule alert symbol
IngressLoadBalancerProvisioningFailedNotification to
IngressLoadBalancerProvisioningFailed in the rule definition (the alert: field)
and update any local references inside that YAML (annotations/labels) and any
alert routing/recording/remediation configs that reference the old name so they
point to IngressLoadBalancerProvisioningFailed; keep the existing
send_managed_notification and managed_notification_template labels intact so
routing behavior remains unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad26325f-46cf-45f8-a832-456532bef424

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb3ff3 and 58504df.

📒 Files selected for processing (2)

deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml

coderabbitai · 2026-04-23T18:48:51Z

+      - VPC or subnet configuration preventing LoadBalancer provisioning
+
+      To resolve this issue:
+      1. Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present.


⚠️ Potential issue | 🟠 Major

Remediation guidance points at the wrong IAM principal for ROSA HCP.

For ROSA HyperShift, ELB provisioning for the default IngressController is performed by the control-plane ingress operator using its STS/IAM role (e.g., the ingress-operator-managed role assumed via the cluster's OIDC provider), not by the worker node instance role. Advising customers to "review the IAM permissions attached to your cluster's worker node role" is likely to send them down the wrong path and delay remediation. Consider rewording to point at the ingress operator's IAM role / STS trust policy and the account-level ELB service-linked role.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml` at line 19, Update the remediation message that currently reads "Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present" to instead reference the control-plane ingress operator's IAM/STS role used in ROSA HyperShift (e.g., the ingress-operator-managed role assumed via the cluster OIDC provider) and the account-level ELB service-linked role; specifically, change the guidance text in ingress-loadbalancer-provisioning-failed.yaml to point users to check the ingress operator's IAM role/STS trust policy and the ELB service-linked role rather than the worker node instance role.

coderabbitai · 2026-04-23T18:48:51Z

+              }
+            ) == 1
+            unless on (exported_namespace) (
+              hypershift_cluster_waiting_initial_avaibility_duration_seconds


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check how this metric is spelled elsewhere in the repo. rg -nP 'hypershift_cluster_(waiting_initial_ava[il]+bility|deleting)_duration_seconds'

Repository: openshift/managed-cluster-config

Length of output: 2924

🌐 Web query:

hypershift operator metric cluster waiting initial availability duration seconds name

💡 Result:

The metric name is hypershift_cluster_waiting_initial_availability_duration_seconds. It is emitted by the HyperShift operator's HostedCluster controller. It measures the time in seconds it is taking for the HostedClusterAvailable condition to become true since the creation of the HostedCluster. The metric is undefined if the condition has already become true once or if the cluster no longer exists. This metric was recently fixed for a typo (from "avaibility" to "availability") in February 2026 via openshift/hypershift PR #7730. It is part of the per-HostedCluster metrics exposed by the operator, which can be scraped via Prometheus and included in monitoring dashboards when enabled.

Citations:

1: NO-JIRA: bug(metrics): fix typo in hypershift_cluster_waiting_initial_availability_duration_seconds metric hypershift#7730

2: openshift/hypershift@8b3a734...7ce27e1

Fix metric name typo affecting alert suppression: update avaibility to availability.

The metric name hypershift_cluster_waiting_initial_avaibility_duration_seconds contains a typo. The upstream hypershift operator fixed this to hypershift_cluster_waiting_initial_availability_duration_seconds in February 2026. Without this correction, the unless clause will fail to match the metric, breaking alert suppression for newly-provisioned clusters and causing spurious alerts.

Update the metric name in:

deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml:43

deploy/sre-prometheus/ocm-agent/obo-monitoring/100-oidc-missing.PrometheusRule.yaml:20-23

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml` at line 43, The metric name contains a typo: replace all occurrences of hypershift_cluster_waiting_initial_avaibility_duration_seconds with the corrected hypershift_cluster_waiting_initial_availability_duration_seconds so the alert suppression unless clause matches the upstream metric; update both instances mentioned in the review (the ELB ingress degraded rule and the OIDC missing rule) and validate the alerting rules reference the corrected metric name exactly.

…alert Run generate_template.py to include the new PrometheusRule and ManagedFleetNotification in the hack/00-osd-managed-cluster-config-*.yaml.tmpl templates, as required by the pr-check CI gate. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl`:
- Around line 34824-34858: The notificationMessage incorrectly directs customers
to check the "worker node role" and links to ROSA classic STS docs; update the
source manifest ingress-loadbalancer-provisioning-failed.yaml (the
notificationMessage block) to: (1) replace step 1 with guidance to review the
ingress-operator/service load balancer role(s) and the ELB-related
service-linked role (e.g., ingress-operator role and
AWSServiceRoleForElasticLoadBalancing) instead of the worker node instance role,
(2) swap the docs URL to the ROSA HCP-specific IAM/STS documentation that
describes ingress-operator and service-linked roles, and (3) ensure you edit the
source manifest
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
(not the generated templates) so the corrected message propagates to
integration/stage/production.
- Around line 52761-52762: There is a misspelled metric name: replace all
occurrences of hypershift_cluster_waiting_initial_avaibility_duration_seconds
with the correct
hypershift_cluster_waiting_initial_availability_duration_seconds (5 instances
found across PrometheusRule.yaml files and stage/integration/production
.yaml.tmpl templates), ensuring every PrometheusRule, alert expression, and
template reference uses the corrected metric name (including the unless/on
clauses shown) so the rules match the upstream HyperShift metric; run a
repo-wide search-and-replace for that exact symbol and verify no residual
references remain.

In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl`:
- Around line 34826-34858: The notificationMessage text is incorrect for ROSA
HyperShift (HCP) customers; replace the guidance that references the cluster
"worker node role" and ad-hoc AWS SLR commands with HCP-appropriate
instructions: explain that LoadBalancer provisioning is performed by hosted
control-plane components using the control-plane/ST S role (not data-plane
worker/nodepool roles), remove or soften steps instructing customers to run aws
iam get-role/aws iam create-service-linked-role, and update the documentation
link to an HCP-specific troubleshooting/IAM page; mark the message as requiring
ingress/HCP team review before publishing. Ensure changes are made to the
notificationMessage entry so it references the hosted control-plane role and
requests HCP team validation.
- Around line 34813-34861: The generated template should not be edited
directly—update the source manifests in deploy/: modify
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml
to change the ManagedFleetNotification (name:
ingress-loadbalancer-provisioning-failed) remediation steps so Step 1 refers to
the control-plane/ingress-operator IAM role used by HyperShift (or remove the
worker node role guidance), and remove or conditionally change Steps 2–3 about
creating AWSServiceRoleForElasticLoadBalancing to note that SLRs may be Red
Hat-managed for STS/HyperShift and link to HyperShift-specific troubleshooting
docs instead of the generic ROSA STS guide; also update the Prometheus rule
source
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml
to verify the PromQL exclusion uses labels that actually exist on
hypershift_cluster_waiting_initial_avaibility_duration_seconds and
hypershift_cluster_deleting_duration_seconds (ensure exported_namespace is
present or change the label matcher to a valid label or remove the 'unless on
(exported_namespace)' clause so the alert is correctly suppressed during
HyperShift init/deletion).
- Around line 52755-52762: The suppression clause for the hypershift metrics is
using the wrong join label; change the `unless on (exported_namespace)` to
`unless on (_id)` so the exclusion correctly matches
`hypershift_cluster_waiting_initial_avaibility_duration_seconds` and
`hypershift_cluster_deleting_duration_seconds` (which expose `_id`, `namespace`,
and `name` but not `exported_namespace`); keep the surrounding aggregation `max
by (exported_namespace, _id, namespace)` for
`cluster_operator_conditions{name="ingress",
condition="Degraded"/"LoadBalancerReady", ...}` intact and ensure the `unless on
(_id)` applies to those hypershift metric names exactly as in the
100-nodes-not-joining-nodepool rule.

In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl`:
- Around line 34826-34858: The notificationMessage currently points users to the
worker node role and instructs checking the
AWSServiceRoleForElasticLoadBalancing SLR; update the message to instruct
customers on ROSA HCP/STS to check the Ingress Operator's IAM role (e.g. the
*-openshift-ingress-operator-cloud-credentials role or the STS role ARN
configured for ingress) and the attached policy containing
elasticloadbalancing:* actions, and either remove or reframe the
AWSServiceRoleForElasticLoadBalancing steps as an optional secondary check (not
the primary remediation); also add a one-line note requiring product/docs
review/sign-off before merge.
- Around line 34854-34855: Replace the redirecting docs.openshift.com URL with
the canonical docs.redhat.com target in the template string; specifically update
the literal URL
"https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html"
to
"https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws_classic_architecture/4/html/introduction_to_rosa/rosa-sts-about-iam-resources"
so the customer-facing text in the template references the final docs.redhat.com
location directly.
- Around line 34816-34861: Edit the source manifests (not the generated
.yaml.tmpl): in
deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml,
correct the metric name
hypershift_cluster_waiting_initial_avaibility_duration_seconds to
hypershift_cluster_waiting_initial_availability_duration_seconds used in the
unless clause so the exclusion matches; and in
deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml,
update the remediation step 1 to reference the cluster-specific Ingress Operator
IAM role (<cluster>-openshift-ingress-operator-cloud-credentials) and the
AWS-managed ROSAIngressOperatorPolicy (instead of suggesting worker node role
permissions).
- Around line 52761-52762: The metric name in the unless filter is misspelled;
update the expression inside the current unless on (exported_namespace) (...) to
check for both the old misspelled metric
hypershift_cluster_waiting_initial_avaibility_duration_seconds and the corrected
hypershift_cluster_waiting_initial_availability_duration_seconds (alongside the
existing hypershift_cluster_deleting_duration_seconds) so the filter suppresses
alerts during provisioning/deletion regardless of which metric name is present;
modify the predicate used by the unless (the part containing
hypershift_cluster_waiting_initial_avaibility_duration_seconds or
hypershift_cluster_deleting_duration_seconds) to include an OR for the corrected
metric name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 79221509-2d1e-45f7-86a7-fb1e2b219411

📥 Commits

Reviewing files that changed from the base of the PR and between 58504df and ffc3a3b.

📒 Files selected for processing (3)

hack/00-osd-managed-cluster-config-integration.yaml.tmpl
hack/00-osd-managed-cluster-config-production.yaml.tmpl
hack/00-osd-managed-cluster-config-stage.yaml.tmpl

coderabbitai · 2026-04-24T16:25:54Z

+          summary: Action Required - Ingress LoadBalancer cannot be provisioned due
+            to IAM permission issues
+          notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer
+            has failed to provision for more than 30 minutes. This prevents external
+            traffic from reaching your cluster and is likely caused by missing or
+            insufficient AWS IAM permissions for Elastic Load Balancing (ELB).
+
+
+            Common causes include:
+
+            - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer,
+            elasticloadbalancing:DescribeLoadBalancers)
+
+            - Service-linked role for ELB not created in your AWS account
+
+            - VPC or subnet configuration preventing LoadBalancer provisioning
+
+
+            To resolve this issue:
+
+            1. Review the IAM permissions attached to your cluster''s worker node
+            role and ensure ELB permissions are present.
+
+            2. Verify the AWS service-linked role for ELB exists: aws iam get-role
+            --role-name AWSServiceRoleForElasticLoadBalancing
+
+            3. If the role is missing, create it: aws iam create-service-linked-role
+            --aws-service-name elasticloadbalancing.amazonaws.com
+
+
+            For detailed troubleshooting steps and required IAM policies, please refer
+            to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html
+
+
+            If you need assistance, please open a support case.'


⚠️ Potential issue | 🟠 Major

Customer-facing remediation guidance is inaccurate for ROSA HCP.

Two concerns in the notification message:

Step 1 instructs customers to review "your cluster's worker node role" for ELB permissions. For ROSA HyperShift (STS), Service LoadBalancers are reconciled by the in-cluster Ingress/cloud-controller components using a dedicated STS role (the ingress-operator role and/or the service-linked role), not the worker node instance role. Pointing customers at the worker node role will likely send them in the wrong direction and they will not find the relevant ELB permissions there.

The linked troubleshooting doc (rosa_architecture/rosa-sts-about-iam-resources.html) is ROSA classic STS content. ROSA HCP has its own IAM/STS resource model; consider linking to the HCP-specific IAM documentation instead so customers see the roles that actually apply to their cluster.

Also note this message is duplicated across the integration/stage/production templates via the source manifest deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml — please update at the source rather than the generated template.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 34824 - 34858, The notificationMessage incorrectly directs customers to check the "worker node role" and links to ROSA classic STS docs; update the source manifest ingress-loadbalancer-provisioning-failed.yaml (the notificationMessage block) to: (1) replace step 1 with guidance to review the ingress-operator/service load balancer role(s) and the ELB-related service-linked role (e.g., ingress-operator role and AWSServiceRoleForElasticLoadBalancing) instead of the worker node instance role, (2) swap the docs URL to the ROSA HCP-specific IAM/STS documentation that describes ingress-operator and service-linked roles, and (3) ensure you edit the source manifest deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml (not the generated templates) so the corrected message propagates to integration/stage/production.

coderabbitai · 2026-04-24T16:25:54Z

+              False\"\n  }\n) == 1\nunless on (exported_namespace) (\n  hypershift_cluster_waiting_initial_avaibility_duration_seconds\n\
+              \  or hypershift_cluster_deleting_duration_seconds\n)"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Double-check we're using the same spelling that appears in other places of this repo # (other rules/dashboards) as a sanity check against prod scrape target. rg -nP -C1 'hypershift_cluster_waiting_initial_(avaibility|availability)_duration_seconds'

Repository: openshift/managed-cluster-config

Length of output: 4958

🌐 Web query:

Is the Prometheus metric exported by the openshift/hypershift operator named "hypershift_cluster_waiting_initial_avaibility_duration_seconds" (with the avaibility typo), and is there any plan to rename it to "availability"?

💡 Result:

Yes. The HyperShift operator metric was (at least at one point) exported with the misspelling “avaibility”: hypershift_cluster_waiting_initial_avaibility_duration_seconds. A merged HyperShift PR in Feb 2026 explicitly fixes that typo by changing the metric name constant from hypershift_cluster_waiting_initial_avaibility_duration_seconds to hypershift_cluster_waiting_initial_availability_duration_seconds. That PR also notes that consumers will need to adjust alerting. As for a plan to rename it to “availability”: the evidence found indicates the rename happened as a typo fix to “availability” (correct spelling), not that a separate future rename is planned. So, in current code after the Feb 2026 fix, the metric is hypershift_cluster_waiting_initial_availability_duration_seconds, and there is no separate “plan” found beyond that typo correction.

Citations:

1: NO-JIRA: bug(metrics): fix typo in hypershift_cluster_waiting_initial_availability_duration_seconds metric hypershift#7730

2: openshift/hypershift@8b3a734...7ce27e1

🏁 Script executed:

# Confirm how many files need updating rg -l 'hypershift_cluster_waiting_initial_avaibility_duration_seconds' | wc -l

Repository: openshift/managed-cluster-config

Length of output: 78

Update metric name from avaibility to availability across all PrometheusRule and template files.

The HyperShift operator fixed this typo in upstream PR #7730 (merged Feb 2026), renaming the metric from hypershift_cluster_waiting_initial_avaibility_duration_seconds to hypershift_cluster_waiting_initial_availability_duration_seconds. All 5 instances in this repo (PrometheusRule.yaml files and stage/integration/production .yaml.tmpl files) must be updated to the corrected spelling, otherwise the rules will silently fail when scraping from HyperShift clusters with the updated operator.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-integration.yaml.tmpl` around lines 52761 - 52762, There is a misspelled metric name: replace all occurrences of hypershift_cluster_waiting_initial_avaibility_duration_seconds with the correct hypershift_cluster_waiting_initial_availability_duration_seconds (5 instances found across PrometheusRule.yaml files and stage/integration/production .yaml.tmpl templates), ensuring every PrometheusRule, alert expression, and template reference uses the corrected metric name (including the unless/on clauses shown) so the rules match the upstream HyperShift metric; run a repo-wide search-and-replace for that exact symbol and verify no residual references remain.

coderabbitai · 2026-04-24T16:25:54Z

          resendWait: 24
          severity: Warning
          limitedSupport: false
+    - apiVersion: ocmagent.managed.openshift.io/v1alpha1
+      kind: ManagedFleetNotification
+      metadata:
+        name: ingress-loadbalancer-provisioning-failed
+        namespace: openshift-ocm-agent-operator
+      spec:
+        fleetNotification:
+          name: ingress-loadbalancer-provisioning-failed
+          summary: Action Required - Ingress LoadBalancer cannot be provisioned due
+            to IAM permission issues
+          notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer
+            has failed to provision for more than 30 minutes. This prevents external
+            traffic from reaching your cluster and is likely caused by missing or
+            insufficient AWS IAM permissions for Elastic Load Balancing (ELB).
+
+
+            Common causes include:
+
+            - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer,
+            elasticloadbalancing:DescribeLoadBalancers)
+
+            - Service-linked role for ELB not created in your AWS account
+
+            - VPC or subnet configuration preventing LoadBalancer provisioning
+
+
+            To resolve this issue:
+
+            1. Review the IAM permissions attached to your cluster''s worker node
+            role and ensure ELB permissions are present.
+
+            2. Verify the AWS service-linked role for ELB exists: aws iam get-role
+            --role-name AWSServiceRoleForElasticLoadBalancing
+
+            3. If the role is missing, create it: aws iam create-service-linked-role
+            --aws-service-name elasticloadbalancing.amazonaws.com
+
+
+            For detailed troubleshooting steps and required IAM policies, please refer
+            to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html
+
+
+            If you need assistance, please open a support case.'
+          resendWait: 24
+          severity: Warning
+          limitedSupport: false


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate the source manifest for the new notification and PrometheusRule so fixes are applied at the right layer. fd -t f 'ingress-loadbalancer-provisioning-failed' || true fd -t f 'iam-elb-ingress-loadbalancer-degraded' || true rg -n 'ingress-loadbalancer-provisioning-failed' -g '!hack/**' -C2 rg -n 'iam-elb-ingress-loadbalancer-degraded' -g '!hack/**' -C2 fd -t f 'generate_template.py' -x cat {}

Repository: openshift/managed-cluster-config

Length of output: 14415

🏁 Script executed:

cat "deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml"

Repository: openshift/managed-cluster-config

Length of output: 1756

🏁 Script executed:

cat "deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml"

Repository: openshift/managed-cluster-config

Length of output: 2087

Edit source manifests in deploy/, not the generated template. This file is generated by generate_template.py from source files:

deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml

deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml

ManagedFleetNotification: Step 1 gives misleading guidance for ROSA HyperShift. The notification message targets "ROSA HyperShift cluster" but instructs customers to "Review the IAM permissions attached to your cluster's worker node role." For HyperShift, the ingress operator runs in the control plane (using Red Hat's STS role), not the customer's worker/node role. This step should clarify that customers should check the control plane ingress operator's IAM role instead, or remove the worker node guidance entirely. Additionally, step 2-3 about creating the AWSServiceRoleForElasticLoadBalancing SLR may not be appropriate for ROSA STS clusters (typically Red Hat-managed). Update the remediation steps to align with HyperShift/STS reality, and consider pointing to HyperShift-specific troubleshooting documentation instead of the generic ROSA STS guide.

PrometheusRule PromQL: Verify label matching in the exclusion filter. The unless on (exported_namespace) (hypershift_cluster_waiting_initial_avaibility_duration_seconds or hypershift_cluster_deleting_duration_seconds) construct assumes the hypershift_* metrics carry the exported_namespace label. If they don't, the filter fails silently and the alert would fire during cluster initialization/deletion windows despite the intended safeguard. Confirm that these metrics include exported_namespace in their label set.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 34813 - 34861, The generated template should not be edited directly—update the source manifests in deploy/: modify deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml to change the ManagedFleetNotification (name: ingress-loadbalancer-provisioning-failed) remediation steps so Step 1 refers to the control-plane/ingress-operator IAM role used by HyperShift (or remove the worker node role guidance), and remove or conditionally change Steps 2–3 about creating AWSServiceRoleForElasticLoadBalancing to note that SLRs may be Red Hat-managed for STS/HyperShift and link to HyperShift-specific troubleshooting docs instead of the generic ROSA STS guide; also update the Prometheus rule source deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml to verify the PromQL exclusion uses labels that actually exist on hypershift_cluster_waiting_initial_avaibility_duration_seconds and hypershift_cluster_deleting_duration_seconds (ensure exported_namespace is present or change the label matcher to a valid label or remove the 'unless on (exported_namespace)' clause so the alert is correctly suppressed during HyperShift init/deletion).

coderabbitai · 2026-04-24T16:25:54Z

+          notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer
+            has failed to provision for more than 30 minutes. This prevents external
+            traffic from reaching your cluster and is likely caused by missing or
+            insufficient AWS IAM permissions for Elastic Load Balancing (ELB).
+
+
+            Common causes include:
+
+            - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer,
+            elasticloadbalancing:DescribeLoadBalancers)
+
+            - Service-linked role for ELB not created in your AWS account
+
+            - VPC or subnet configuration preventing LoadBalancer provisioning
+
+
+            To resolve this issue:
+
+            1. Review the IAM permissions attached to your cluster''s worker node
+            role and ensure ELB permissions are present.
+
+            2. Verify the AWS service-linked role for ELB exists: aws iam get-role
+            --role-name AWSServiceRoleForElasticLoadBalancing
+
+            3. If the role is missing, create it: aws iam create-service-linked-role
+            --aws-service-name elasticloadbalancing.amazonaws.com
+
+
+            For detailed troubleshooting steps and required IAM policies, please refer
+            to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html
+
+
+            If you need assistance, please open a support case.'


⚠️ Potential issue | 🟠 Major

Remediation guidance is misleading for ROSA HCP customers.

This notification is customer-facing and targets ROSA HyperShift, but the troubleshooting steps describe the classic/standalone ROSA model:

Step 1 — "Review the IAM permissions attached to your cluster's worker node role" is incorrect for HCP. Ingress Service type=LoadBalancer reconciliation on HCP is performed by the hosted control-plane components (cloud controller / ingress operator) using their own STS role, not by the data-plane workers. Pointing customers at the worker/nodepool role will send them down the wrong path.

Steps 2–3 — instructing customers to run aws iam get-role AWSServiceRoleForElasticLoadBalancing and aws iam create-service-linked-role is questionable for STS-based ROSA HCP, where the SLR lifecycle and the operator roles are managed as part of the ROSA install flow. Most customers will not have (or want) the IAM privileges to do this ad-hoc.

The linked doc (rosa_architecture/rosa-sts-about-iam-resources.html) is the classic ROSA STS IAM resources page; consider linking an HCP-specific troubleshooting/IAM page instead.

Because SRE may close the support path based on whether the customer followed these steps, getting this right matters. Please have the ingress/HCP team review the message text before un-drafting the PR.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-production.yaml.tmpl` around lines 34826 - 34858, The notificationMessage text is incorrect for ROSA HyperShift (HCP) customers; replace the guidance that references the cluster "worker node role" and ad-hoc AWS SLR commands with HCP-appropriate instructions: explain that LoadBalancer provisioning is performed by hosted control-plane components using the control-plane/ST S role (not data-plane worker/nodepool roles), remove or soften steps instructing customers to run aws iam get-role/aws iam create-service-linked-role, and update the documentation link to an HCP-specific troubleshooting/IAM page; mark the message as requiring ingress/HCP team review before publishing. Ensure changes are made to the notificationMessage entry so it references the hosted control-plane role and requests HCP team validation.

coderabbitai · 2026-04-24T16:25:54Z

+    - apiVersion: ocmagent.managed.openshift.io/v1alpha1
+      kind: ManagedFleetNotification
+      metadata:
+        name: ingress-loadbalancer-provisioning-failed
+        namespace: openshift-ocm-agent-operator
+      spec:
+        fleetNotification:
+          name: ingress-loadbalancer-provisioning-failed
+          summary: Action Required - Ingress LoadBalancer cannot be provisioned due
+            to IAM permission issues
+          notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer
+            has failed to provision for more than 30 minutes. This prevents external
+            traffic from reaching your cluster and is likely caused by missing or
+            insufficient AWS IAM permissions for Elastic Load Balancing (ELB).
+
+
+            Common causes include:
+
+            - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer,
+            elasticloadbalancing:DescribeLoadBalancers)
+
+            - Service-linked role for ELB not created in your AWS account
+
+            - VPC or subnet configuration preventing LoadBalancer provisioning
+
+
+            To resolve this issue:
+
+            1. Review the IAM permissions attached to your cluster''s worker node
+            role and ensure ELB permissions are present.
+
+            2. Verify the AWS service-linked role for ELB exists: aws iam get-role
+            --role-name AWSServiceRoleForElasticLoadBalancing
+
+            3. If the role is missing, create it: aws iam create-service-linked-role
+            --aws-service-name elasticloadbalancing.amazonaws.com
+
+
+            For detailed troubleshooting steps and required IAM policies, please refer
+            to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html
+
+
+            If you need assistance, please open a support case.'
+          resendWait: 24
+          severity: Warning
+          limitedSupport: false


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate the source ManagedFleetNotification and PrometheusRule manifests this template was generated from. fd -t f -e yaml -e yml ingress-loadbalancer-provisioning-failed fd -t f -e yaml -e yml iam-elb-ingress-loadbalancer-degraded # Fall back to a content search if filenames differ. rg -n --no-heading 'ingress-loadbalancer-provisioning-failed' -g '!hack/**' rg -n --no-heading 'iam-elb-ingress-loadbalancer-degraded' -g '!hack/**' # Confirm the generator and which tree it walks. fd -t f 'generate_template.py'

Repository: openshift/managed-cluster-config

Length of output: 1130

🏁 Script executed:

# Read the source ManagedFleetNotification to check the notificationMessage content cat "deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml"

Repository: openshift/managed-cluster-config

Length of output: 1756

🏁 Script executed:

# Read the source PrometheusRule to check the metric name spelling cat "deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml"

Repository: openshift/managed-cluster-config

Length of output: 2087

🌐 Web query:

hypershift metric name hypershift_cluster_waiting_initial_availability

💡 Result:

The HyperShift metric name you asked about is: hypershift_cluster_waiting_initial_availability_duration_seconds It is defined in HyperShift’s hosted cluster metrics code as a “Time in seconds it is taking to get the HostedClusterAvailable condition becoming true since the creation of the HostedCluster.” Important: there was an earlier misspelled metric name where “availability” was misspelled as “avaibility/availbility” in the metric string; HyperShift introduced a fix to correct the metric identifier to hypershift_cluster_waiting_initial_availability_duration_seconds, and any dashboards/alerts using the old misspelling need to be updated.

Citations:

1: NO-JIRA: bug(metrics): fix typo in hypershift_cluster_waiting_initial_availability_duration_seconds metric hypershift#7730

2: openshift/hypershift@8b3a734...7ce27e1

🌐 Web query:

ROSA STS ELB Ingress IAM permissions documentation redhat

💡 Result:

For ROSA clusters on AWS classic that use AWS STS, the “ROSA STS ELB/Ingress” component is handled by the ROSA Ingress Operator. The specific IAM permission set the operator needs is the cluster-specific operator role named like <cluster_name>--openshift-ingress-operator-cloud-credentials, which is attached to the AWS-managed policy ROSAIngressOperatorPolicy (or the classic equivalent policy/policy JSON generated by ROSA). 1) Which IAM role/policy is used for Ingress/ELB? - Red Hat ROSA (STS, classic) documents that each STS-based ROSA cluster requires cluster-specific Operator IAM roles, including an IAM role required by the “ROSA Ingress Operator” to manage external access to the cluster (the Ingress cloud credentials operator role). - The AWS-managed policy for that operator is ROSAIngressOperatorPolicy, described by AWS as allowing the OpenShift Ingress Operator to provision/manage load balancers and Route 53 DNS configurations for ROSA clusters. 2) Required actions/permissions (what the IAM policy allows) ROSAIngressOperatorPolicy JSON (AWS managed policy) allows: - elasticloadbalancing:DescribeLoadBalancers - route53:ListHostedZones - tag:GetResources - route53:ChangeResourceRecordSets (with a condition restricting normalized record names to specific allowed domain patterns) 3) Where this is documented (official) - Red Hat (STS clusters, classic) IAM resources for STS clusters: reference for the cluster-specific Ingress Operator role and the policy name/role name pattern. - AWS managed policy reference: ROSAIngressOperatorPolicy description and JSON policy document. If you tell me whether you’re using ROSA on AWS classic (STS) or ROSA with hosted control planes (HCP), and whether you mean AWS ALB (Ingress Operator external access) vs a Kubernetes service LoadBalancer, I can point to the exact role/policy name and how it’s created in that specific flow.

Citations:

1: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html

2: https://docs.aws.amazon.com/rosa/latest/userguide/security-iam-awsmanpol.html

3: https://docs.aws.amazon.com/rosa/latest/userguide/security-iam-rosa-classic-operator-policies.html

Edit the source manifest, not the generated template.

Per the commit message, this file is produced by generate_template.py from source manifests under deploy/ and resources/. Any fixes must be made in the source YAML so they survive re-generation; the generated *.yaml.tmpl files should only ever be touched by the generator.

Critical: Fix metric name spelling in the PrometheusRule source.

In deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml, the metric is spelled hypershift_cluster_waiting_initial_avaibility_duration_seconds (missing 'l'). HyperShift has since corrected this to hypershift_cluster_waiting_initial_availability_duration_seconds. Using the misspelled metric name means the unless clause silently fails to match, and the alert will incorrectly fire during initial cluster provisioning and deletion—exactly the windows this rule intends to exclude.

Fix inaccurate IAM remediation guidance in ManagedFleetNotification source.

In deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml, step 1 of the remediation states: "Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present." For ROSA STS clusters, the Ingress Operator (not worker nodes) manages ELB provisioning via the cluster-specific Ingress Operator IAM role (<cluster>-openshift-ingress-operator-cloud-credentials), which requires the AWS-managed policy ROSAIngressOperatorPolicy. The current guidance will misdirect customers to the wrong IAM role.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 34816 - 34861, Edit the source manifests (not the generated .yaml.tmpl): in deploy/sre-prometheus/ocm-agent/obo-monitoring/100-iam-elb-ingress-loadbalancer-degraded.PrometheusRule.yaml, correct the metric name hypershift_cluster_waiting_initial_avaibility_duration_seconds to hypershift_cluster_waiting_initial_availability_duration_seconds used in the unless clause so the exclusion matches; and in deploy/ocm-agent-operator-managedfleetnotifications/ingress-loadbalancer-provisioning-failed.yaml, update the remediation step 1 to reference the cluster-specific Ingress Operator IAM role (<cluster>-openshift-ingress-operator-cloud-credentials) and the AWS-managed ROSAIngressOperatorPolicy (instead of suggesting worker node role permissions).

coderabbitai · 2026-04-24T16:25:54Z

+          notificationMessage: 'Your ROSA HyperShift cluster''s ingress LoadBalancer
+            has failed to provision for more than 30 minutes. This prevents external
+            traffic from reaching your cluster and is likely caused by missing or
+            insufficient AWS IAM permissions for Elastic Load Balancing (ELB).
+
+
+            Common causes include:
+
+            - Missing IAM permissions for ELB management (e.g., elasticloadbalancing:CreateLoadBalancer,
+            elasticloadbalancing:DescribeLoadBalancers)
+
+            - Service-linked role for ELB not created in your AWS account
+
+            - VPC or subnet configuration preventing LoadBalancer provisioning
+
+
+            To resolve this issue:
+
+            1. Review the IAM permissions attached to your cluster''s worker node
+            role and ensure ELB permissions are present.
+
+            2. Verify the AWS service-linked role for ELB exists: aws iam get-role
+            --role-name AWSServiceRoleForElasticLoadBalancing
+
+            3. If the role is missing, create it: aws iam create-service-linked-role
+            --aws-service-name elasticloadbalancing.amazonaws.com
+
+
+            For detailed troubleshooting steps and required IAM policies, please refer
+            to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html
+
+
+            If you need assistance, please open a support case.'


⚠️ Potential issue | 🟠 Major

Remediation guidance is inaccurate for ROSA HCP/STS and risks sending customers down the wrong path.

Two issues in notificationMessage:

Step 1 tells customers to "Review the IAM permissions attached to your cluster's worker node role and ensure ELB permissions are present." On ROSA HCP with STS, the default IngressController's AWS LoadBalancer is provisioned via the Ingress Operator's IAM role, not the worker-node instance role. Pointing customers at the worker role will waste time and won't fix the actual issue. Please direct them to the Ingress Operator role (e.g. *-openshift-ingress-operator-cloud-credentials / the STS role ARN configured for ingress) and the policy containing elasticloadbalancing:* actions.

Steps 2–3 describe the AWS-managed AWSServiceRoleForElasticLoadBalancing SLR, which is a separate concern from the STS role permissions in step 1 and is usually already present. Consider either dropping these steps or clearly framing them as a secondary check.

Also, this is an SOP-style message that will be shown to customers verbatim — a product/docs reviewer should sign off on the final wording before merge.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 34826 - 34858, The notificationMessage currently points users to the worker node role and instructs checking the AWSServiceRoleForElasticLoadBalancing SLR; update the message to instruct customers on ROSA HCP/STS to check the Ingress Operator's IAM role (e.g. the *-openshift-ingress-operator-cloud-credentials role or the STS role ARN configured for ingress) and the attached policy containing elasticloadbalancing:* actions, and either remove or reframe the AWSServiceRoleForElasticLoadBalancing steps as an optional secondary check (not the primary remediation); also add a one-line note requiring product/docs review/sign-off before merge.

coderabbitai · 2026-04-24T16:25:54Z

+            For detailed troubleshooting steps and required IAM policies, please refer
+            to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check that the customer-facing docs URL is live and not a redirect/404. curl -sSIL -o /dev/null -w 'final_url=%{url_effective}\nhttp_code=%{http_code}\n' \ 'https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html'

Repository: openshift/managed-cluster-config

Length of output: 252

Update the documentation URL to the canonical docs.redhat.com location.

The provided URL https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html is currently live (HTTP 200), but it redirects to the canonical ROSA documentation on docs.redhat.com. Since this URL is customer-facing, update it directly to the target: https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws_classic_architecture/4/html/introduction_to_rosa/rosa-sts-about-iam-resources

This avoids relying on a redirect that may be removed as the migration completes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 34854 - 34855, Replace the redirecting docs.openshift.com URL with the canonical docs.redhat.com target in the template string; specifically update the literal URL "https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html" to "https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws_classic_architecture/4/html/introduction_to_rosa/rosa-sts-about-iam-resources" so the customer-facing text in the template references the final docs.redhat.com location directly.

coderabbitai · 2026-04-24T16:25:54Z

+              False\"\n  }\n) == 1\nunless on (exported_namespace) (\n  hypershift_cluster_waiting_initial_avaibility_duration_seconds\n\
+              \  or hypershift_cluster_deleting_duration_seconds\n)"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # 1. See how this metric is spelled elsewhere in the repo (other alerts, recording rules, tests). rg -nP '\bhypershift_cluster_waiting_initial_(avaibility|availability)_duration_seconds\b' # 2. Check if a parallel PR/commit in this repo already renamed it. rg -nP --no-heading 'SREP-3883|Resolve typo.*hypershift_cluster_waiting_initial'

Repository: openshift/managed-cluster-config

Length of output: 2154

🌐 Web query:

In openshift/hypershift, what is the current exported metric name for the "waiting for initial availability" duration — is it hypershift_cluster_waiting_initial_avaibility_duration_seconds or hypershift_cluster_waiting_initial_availability_duration_seconds? Has a rename landed?

💡 Result:

The current exported metric name for the "waiting for initial availability" duration in OpenShift/HyperShift is hypershift_cluster_waiting_initial_availability_duration_seconds (with two 'i's in "availability"). A rename has landed to fix a previous typo from hypershift_cluster_waiting_initial_avaibility_duration_seconds (one 'i').

Citations:

1: NO-JIRA: bug(metrics): fix typo in hypershift_cluster_waiting_initial_availability_duration_seconds metric hypershift#7730

2: openshift/hypershift@8b3a734...7ce27e1

3: https://pkg.go.dev/github.com/openshift/hypershift@v0.1.71

🏁 Script executed:

#!/bin/bash # Search for any commits/references to SREP-3883 in this repo git log --all --oneline --grep="SREP-3883" | head -20 # Also check for any recent changes to the metric names git log -p --all -- '*osd-managed-cluster-config*.yaml.tmpl' | \ grep -A 5 -B 5 'hypershift_cluster_waiting_initial_avaibility\|hypershift_cluster_waiting_initial_availability' | \ head -100 # Check if there's any reference to HyperShift version pinning rg -nP 'hypershift.*version|version.*hypershift' hack/ deploy/ | head -20

Repository: openshift/managed-cluster-config

Length of output: 58

🌐 Web query:

HyperShift PR #7730 - when did it merge and what version was it released in?

💡 Result:

No evidence found of HyperShift PR #7730 existing or being merged. Searches across GitHub, including direct URL attempts, pull request lists, and targeted queries for mentions of "#7730" or "PR #7730" in the openshift/hypershift repository returned no matching results. PRs around that number (e.g., #7711 merged 2026-03-07, #7774 merged 2026-03-27) exist, but #7730 specifically does not appear in search results or recent activity. It may have been closed without merging, deleted, or the number misremembered. HyperShift does not follow strict semantic versioning aligned with OCP releases; versions are tagged periodically from main for Red Hat services, with supported OCP versions up to 4.17.

Citations:

1: CNTRLPLANE-2551: Hypershift integration for API Auth Config changes hypershift#7774

2: openshift/hypershift@806fc3b

3: https://github.com/openshift/hypershift/pulls

4: OCPBUGS-76645: Make generated container names k8s compliant hypershift#7711

5: https://github.com/openshift/hypershift/tree/main

🏁 Script executed:

#!/bin/bash # Search for both metric name spellings in HyperShift repo context # First, let's try to fetch and search the HyperShift repository # Look for references to the metric in code or comments # Check if we have the hypershift repo available or can search online # Let's search for commits mentioning the typo git ls-remote https://github.com/openshift/hypershift.git | head -5 # Try to get recent commits that might mention this metric curl -s 'https://api.github.com/repos/openshift/hypershift/commits?per_page=30&grep=avaibility\|availability' | \ jq -r '.[] | "\(.commit.message | split("\n")[0]) - \(.commit.author.date)"' 2>/dev/null | head -10 # Also search for issues/PRs mentioning this specific typo curl -s 'https://api.github.com/search/issues?q=repo:openshift/hypershift+avaibility' | \ jq -r '.items[] | "\(.title) (\(.number))"' 2>/dev/null | head -5

Repository: openshift/managed-cluster-config

Length of output: 1574

🌐 Web query:

openshift/hypershift metric name typo fix availability avaibility - which spelling is currently emitted?

💡 Result:

Currently emitted metric spelling: hypershift_cluster_waiting_initial_availability_duration_seconds (correct spelling with “availability”). Previously, HyperShift emitted a typo: hypershift_cluster_waiting_initial_avaibility_duration_seconds / hypershift_cluster_waiting_initial_availbility_duration_seconds (the PR fixes the typo and notes consumers must adjust alerting/queries).

Citations:

1: NO-JIRA: bug(metrics): fix typo in hypershift_cluster_waiting_initial_availability_duration_seconds metric hypershift#7730

Update metric name to account for HyperShift typo fix — critical correctness issue.

HyperShift has merged a fix renaming hypershift_cluster_waiting_initial_avaibility_duration_seconds (misspelled) to hypershift_cluster_waiting_initial_availability_duration_seconds (correct spelling). This repository has not been updated. If the fleet is running the newer HyperShift version, the old metric name no longer exists, causing the unless on (exported_namespace) ( ... ) filter to silently evaluate empty. The alert will then fire during normal cluster provisioning and deletion — the exact windows this filter is meant to suppress — and page customers unnecessarily.

To safely handle the transition, guard against both metric spellings:

Suggested fix

unless on (exported_namespace) ( hypershift_cluster_waiting_initial_avaibility_duration_seconds + or hypershift_cluster_waiting_initial_availability_duration_seconds or hypershift_cluster_deleting_duration_seconds )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hack/00-osd-managed-cluster-config-stage.yaml.tmpl` around lines 52761 - 52762, The metric name in the unless filter is misspelled; update the expression inside the current unless on (exported_namespace) (...) to check for both the old misspelled metric hypershift_cluster_waiting_initial_avaibility_duration_seconds and the corrected hypershift_cluster_waiting_initial_availability_duration_seconds (alongside the existing hypershift_cluster_deleting_duration_seconds) so the filter suppresses alerts during provisioning/deletion regardless of which metric name is present; modify the predicate used by the unless (the part containing hypershift_cluster_waiting_initial_avaibility_duration_seconds or hypershift_cluster_deleting_duration_seconds) to include an OR for the corrected metric name.

joshbranham · 2026-05-05T22:13:30Z

+  namespace: openshift-observability-rhobs
+spec:
+  groups:
+    - name: IAMELBIngressLoadBalancerDegraded


These should be defined in the rhobs configuration repo and/or dynatrace-config, not in MCC.

- Remove status="True" filter (not a label on cluster_operator_conditions) - Remove LoadBalancerReady AND clause (IngressController condition, not ClusterOperator condition; never present in cluster_operator_conditions) - Fix unless join: exported_namespace → _id (shared label between cluster_operator_conditions and hypershift_cluster_* metrics) - Fix max by: remove exported_namespace (not a label on this metric) - Update description annotation to reflect simplified detection logic Validated against production RHOBS: query returns 24 real ingress-degraded clusters, unless clause correctly excludes clusters in startup/deletion windows. Regenerate SelectorSyncSet templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The monitoring.rhobs/v1 PrometheusRule belongs in the rhobs configuration repo (dynatrace-config), not in MCC. The ManagedFleetNotification CR remains here as it is deployed via SelectorSyncSet to the cluster. Regenerate SelectorSyncSet templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

openshift-ci · 2026-05-07T18:48:12Z

@anispate: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 23, 2026

openshift-ci Bot requested review from reedcort and robotmaxtron April 23, 2026 18:45

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

joshbranham reviewed May 5, 2026

View reviewed changes

anispate and others added 2 commits May 7, 2026 14:20

		False\"\n }\n) == 1\nunless on (exported_namespace) (\n hypershift_cluster_waiting_initial_avaibility_duration_seconds\n\
		\ or hypershift_cluster_deleting_duration_seconds\n)"

		For detailed troubleshooting steps and required IAM policies, please refer
		to: https://docs.openshift.com/rosa/rosa_architecture/rosa-sts-about-iam-resources.html

Conversation

anispate commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 23, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

joshbranham May 5, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anispate commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 23, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading