-
Notifications
You must be signed in to change notification settings - Fork 11
Hyperfleet-557 : Document Sentinel Reliability and Observability #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tirthct
wants to merge
3
commits into
openshift-hyperfleet:main
Choose a base branch
from
tirthct:hyperfleet-557
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,288 @@ | ||
| # HyperFleet Sentinel Alerts | ||
|
|
||
| **Status**: Active | ||
| **Owner**: HyperFleet Team | ||
| **Last Updated**: 2026-03-12 | ||
|
|
||
| > **Audience:** Developers and SREs setting up monitoring for HyperFleet Sentinel. | ||
|
|
||
| --- | ||
|
|
||
| ## Purpose | ||
|
|
||
| This document provides ready alert rules that integrate with Prometheus to ensure reliable monitoring and incident response. | ||
|
|
||
| --- | ||
|
|
||
| ## Alert Rules Reference | ||
|
|
||
| The following 8 alert rules provide comprehensive monitoring for production Sentinel deployments. | ||
|
|
||
| ### Critical Alerts | ||
|
|
||
| #### SentinelDown | ||
| ```yaml | ||
| alert: SentinelDown | ||
| expr: absent(up{service="sentinel"}) or up{service="sentinel"} == 0 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Sentinel service is down" | ||
| description: "Sentinel metrics endpoint is not responding. Service may be down or unreachable." | ||
| ``` | ||
| **Impact**: Resource reconciliation stopped completely. | ||
|
|
||
| **Response**: Check pod status, logs, and resource constraints. | ||
|
|
||
| #### SentinelAPIErrorRateHigh | ||
| ```yaml | ||
| alert: SentinelAPIErrorRateHigh | ||
| expr: rate(hyperfleet_sentinel_api_errors_total[5m]) > 0.1 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High API error rate in Sentinel" | ||
| description: "Sentinel is experiencing {{ $value }} API errors/sec for resource_type {{ $labels.resource_type }}. Check HyperFleet API availability." | ||
| ``` | ||
| **Impact**: Unable to fetch resource status, reconciliation decisions based on stale data. | ||
|
|
||
| **Response**: Check HyperFleet API service health and network connectivity. | ||
|
|
||
| #### SentinelBrokerErrorRateHigh | ||
| ```yaml | ||
| alert: SentinelBrokerErrorRateHigh | ||
| expr: rate(hyperfleet_sentinel_broker_errors_total[5m]) > 0.05 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High broker error rate in Sentinel" | ||
| description: "Sentinel is experiencing {{ $value }} broker errors/sec for resource_type {{ $labels.resource_type }}. Check message broker connectivity." | ||
| ``` | ||
| **Impact**: Events not reaching adapters, reconciliation loops broken. | ||
|
|
||
| **Response**: Check message broker health and Sentinel broker configuration. | ||
|
|
||
| #### SentinelPollStale | ||
| ```yaml | ||
| alert: SentinelPollStale | ||
| expr: | | ||
| hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 0 | ||
| and time() - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 60 | ||
| for: 1m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Sentinel poll loop is stale" | ||
| description: "Sentinel has not completed a successful poll cycle in over 60 seconds. The service may be hung or unable to poll." | ||
| ``` | ||
| **Impact**: Complete polling failure, no reconciliation events generated. | ||
|
|
||
| **Response**: Check Sentinel logs and restart if necessary. | ||
|
|
||
| ### Warning Alerts | ||
|
|
||
| #### SentinelSlowPolling | ||
| ```yaml | ||
| alert: SentinelSlowPolling | ||
| expr: histogram_quantile(0.95, rate(hyperfleet_sentinel_poll_duration_seconds_bucket[5m])) > 5 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Sentinel polling cycles are slow" | ||
| description: "95th percentile poll duration is {{ $value }}s for {{ $labels.resource_type }}. This may indicate API latency or processing issues." | ||
| ``` | ||
| **Impact**: Delayed reconciliation, potentially missing max age intervals. | ||
|
|
||
| **Response**: Check resource count growth and API performance. | ||
|
|
||
| #### SentinelNoEventsPublished | ||
| ```yaml | ||
| alert: SentinelNoEventsPublished | ||
| expr: | | ||
| hyperfleet_sentinel_pending_resources > 0 | ||
| unless on(resource_type, resource_selector) | ||
| rate(hyperfleet_sentinel_events_published_total[15m]) > 0 | ||
| for: 15m | ||
| labels: | ||
| severity: warning | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Sentinel not publishing events" | ||
| description: "Sentinel has pending resources but hasn't published any events in 15 minutes. Service may be stuck." | ||
| ``` | ||
| **Impact**: Resources may be stuck without reconciliation events. | ||
|
|
||
| **Response**: Check decision engine logic and adapter status updates. | ||
|
|
||
| #### SentinelHighPendingResources | ||
| ```yaml | ||
| alert: SentinelHighPendingResources | ||
| expr: sum(hyperfleet_sentinel_pending_resources) > 100 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High number of pending resources in Sentinel" | ||
| description: "{{ $value }} resources are pending reconciliation for more than 10 minutes. This may indicate processing bottleneck or API issues." | ||
| ``` | ||
| **Impact**: May indicate capacity issues or API problems. | ||
|
|
||
| **Response**: Check resource count growth and consider horizontal scaling. | ||
|
|
||
| ### Informational Alerts | ||
|
|
||
| #### SentinelHighSkipRatio | ||
| ```yaml | ||
| alert: SentinelHighSkipRatio | ||
| expr: | | ||
| ( | ||
| rate(hyperfleet_sentinel_resources_skipped_total[10m]) / | ||
| (rate(hyperfleet_sentinel_resources_skipped_total[10m]) + | ||
| rate(hyperfleet_sentinel_events_published_total[10m])) | ||
| ) > 0.95 | ||
| for: 30m | ||
| labels: | ||
| severity: info | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High resource skip ratio in Sentinel" | ||
| description: "{{ $value | humanizePercentage }} of resources are being skipped. This may indicate max_age configuration issues." | ||
| ``` | ||
| **Impact**: May indicate max age intervals too long or adapter status update issues. | ||
|
|
||
| **Response**: Review max age configuration and adapter health. | ||
|
|
||
| --- | ||
|
|
||
| ### Alerting with Google Cloud Managed Prometheus | ||
|
|
||
| **Note**: Google Cloud Managed Prometheus uses Google Cloud Alerting for alert management, not PrometheusRule CRDs. The alerting rules below are provided as PromQL expressions that you can configure in Google Cloud Console → Monitoring → Alerting. | ||
|
|
||
| For Prometheus Operator compatibility, the Helm chart can optionally deploy a PrometheusRule resource when `monitoring.prometheusRule.enabled=true` (disabled by default for GMP). The following alert examples can be configured in either system: | ||
|
|
||
| ```yaml | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| metadata: | ||
| name: sentinel-alerts | ||
| namespace: hyperfleet-system | ||
| spec: | ||
| groups: | ||
| - name: sentinel.rules | ||
| interval: 30s | ||
| rules: | ||
| - alert: SentinelHighPendingResources | ||
| expr: | | ||
| sum(hyperfleet_sentinel_pending_resources) > 100 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
rafabene marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| component: sentinel | ||
| annotations: | ||
| summary: "High number of pending resources" | ||
| description: "{{ $value }} resources are pending reconciliation for more than 10 minutes." | ||
|
|
||
| - alert: SentinelAPIErrorRateHigh | ||
| expr: | | ||
| rate(hyperfleet_sentinel_api_errors_total[5m]) > 0.1 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High API error rate detected" | ||
| description: "API error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." | ||
|
|
||
| - alert: SentinelBrokerErrorRateHigh | ||
| expr: | | ||
| rate(hyperfleet_sentinel_broker_errors_total[5m]) > 0.05 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High broker error rate detected" | ||
| description: "Broker error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." | ||
|
|
||
| - alert: SentinelSlowPolling | ||
| expr: | | ||
| histogram_quantile(0.95, | ||
| rate(hyperfleet_sentinel_poll_duration_seconds_bucket[5m])) > 5 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Polling cycles are slow" | ||
| description: "95th percentile poll duration is {{ $value | humanize }}s for {{ $labels.resource_type }}." | ||
|
|
||
| - alert: SentinelNoEventsPublished | ||
| expr: | | ||
| hyperfleet_sentinel_pending_resources > 0 | ||
| unless on(resource_type, resource_selector) | ||
| rate(hyperfleet_sentinel_events_published_total[15m]) > 0s | ||
| for: 15m | ||
| labels: | ||
| severity: warning | ||
| component: sentinel | ||
| annotations: | ||
| summary: "No events published despite pending resources" | ||
| description: "Sentinel has pending resources but hasn't published events in 15 minutes." | ||
|
|
||
| - alert: SentinelPollStale | ||
| expr: | | ||
| hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 0 | ||
| and time() - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 60 | ||
| for: 1m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Sentinel poll loop is stale" | ||
| description: "Sentinel has not completed a successful poll cycle in over 60 seconds." | ||
| - alert: SentinelDown | ||
| expr: absent(up{service="sentinel"}) or up{service="sentinel"} == 0 | ||
| for: 5m | ||
| labels: | ||
| severity: critical | ||
| component: sentinel | ||
| annotations: | ||
| summary: "Sentinel service is down" | ||
| description: "Sentinel metrics endpoint is not responding. Service may be down or unreachable." | ||
| - alert: SentinelHighSkipRatio | ||
| expr: | | ||
| ( | ||
| rate(hyperfleet_sentinel_resources_skipped_total[10m]) / | ||
| (rate(hyperfleet_sentinel_resources_skipped_total[10m]) + | ||
| rate(hyperfleet_sentinel_events_published_total[10m])) | ||
| ) > 0.95 | ||
| for: 30m | ||
| labels: | ||
| severity: info | ||
| component: sentinel | ||
| annotations: | ||
| summary: "High resource skip ratio in Sentinel" | ||
| description: "{{ $value | humanizePercentage }} of resources are being skipped. This may indicate max_age configuration issues." | ||
| ``` | ||
|
|
||
| To configure these alerts in **Google Cloud Console**: | ||
| 1. Navigate to **Monitoring → Alerting** | ||
| 2. Click **Create Policy** | ||
| 3. Use the PromQL expressions above in the condition configuration | ||
| 4. Configure notification channels and documentation | ||
|
|
||
| For Prometheus Operator users, enable PrometheusRule in values.yaml and verify: | ||
|
|
||
| ```bash | ||
| kubectl get prometheusrule -n hyperfleet-system | ||
| ``` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc says "8 alert rules" but the PrometheusRule manifest only includes 6. Missing:
SentinelDown and SentinelHighSkipRatio. An operator deploying from this manifest will have
incomplete coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this snippet from the document served as "alert examples" (mentioned above) as we have alerts yaml defined under
charts/templates/prometheusrule.yamlbut yes, this can be confusing if someone tries to use the same manifest. Added the remaining rules here