From 7b937d043f4724c3c0337421b2dfd6b6d31e47d3 Mon Sep 17 00:00:00 2001 From: tithakka Date: Thu, 12 Mar 2026 13:22:51 -0500 Subject: [PATCH 1/3] Hyperfleet-542 : HYPERFLEET-557: Document Sentinel Reliability and Observability --- docs/alerts.md | 258 +++++++++++++++++++++++++++++ docs/metrics.md | 225 +------------------------ docs/multi-instance-deployment.md | 149 +++++++++++++++++ docs/runbook.md | 266 ++++++++++++++++++++++++++++++ docs/running-sentinel.md | 81 +++++++++ docs/sentinel-operator-guide.md | 4 + 6 files changed, 764 insertions(+), 219 deletions(-) create mode 100644 docs/alerts.md create mode 100644 docs/runbook.md diff --git a/docs/alerts.md b/docs/alerts.md new file mode 100644 index 0000000..a992a19 --- /dev/null +++ b/docs/alerts.md @@ -0,0 +1,258 @@ +# HyperFleet Sentinel Alerts + +**Status**: Active +**Owner**: HyperFleet Team +**Last Updated**: 2026-03-12 + +> **Audience:** Developers and SREs setting up monitoring for HyperFleet Sentinel. + +--- + +## Purpose + +This document provides ready alert rules that integrate with Prometheus to ensure reliable monitoring and incident response. + +--- + +## Alert Rules Reference + +The following 8 alert rules provide comprehensive monitoring for production Sentinel deployments. + +### Critical Alerts + +#### SentinelDown +```yaml +alert: SentinelDown +expr: absent(up{service="sentinel"}) or up{service="sentinel"} == 0 +for: 5m +labels: + severity: critical + component: sentinel +annotations: + summary: "Sentinel service is down" + description: "Sentinel metrics endpoint is not responding. Service may be down or unreachable." +``` +**Impact**: Resource reconciliation stopped completely. + +**Response**: Check pod status, logs, and resource constraints. + +#### SentinelAPIErrorRateHigh +```yaml +alert: SentinelAPIErrorRateHigh +expr: rate(hyperfleet_sentinel_api_errors_total[5m]) > 0.1 +for: 5m +labels: + severity: critical + component: sentinel +annotations: + summary: "High API error rate in Sentinel" + description: "Sentinel is experiencing {{ $value }} API errors/sec for resource_type {{ $labels.resource_type }}. Check HyperFleet API availability." +``` +**Impact**: Unable to fetch resource status, reconciliation decisions based on stale data. + +**Response**: Check HyperFleet API service health and network connectivity. + +#### SentinelBrokerErrorRateHigh +```yaml +alert: SentinelBrokerErrorRateHigh +expr: rate(hyperfleet_sentinel_broker_errors_total[5m]) > 0.05 +for: 5m +labels: + severity: critical + component: sentinel +annotations: + summary: "High broker error rate in Sentinel" + description: "Sentinel is experiencing {{ $value }} broker errors/sec for resource_type {{ $labels.resource_type }}. Check message broker connectivity." +``` +**Impact**: Events not reaching adapters, reconciliation loops broken. + +**Response**: Check message broker health and Sentinel broker configuration. + +#### SentinelPollStale +```yaml +alert: SentinelPollStale +expr: | + hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 0 + and time() - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 60 +for: 1m +labels: + severity: critical + component: sentinel +annotations: + summary: "Sentinel poll loop is stale" + description: "Sentinel has not completed a successful poll cycle in over 60 seconds. The service may be hung or unable to poll." +``` +**Impact**: Complete polling failure, no reconciliation events generated. + +**Response**: Check Sentinel logs and restart if necessary. + +### Warning Alerts + +#### SentinelSlowPolling +```yaml +alert: SentinelSlowPolling +expr: histogram_quantile(0.95, rate(hyperfleet_sentinel_poll_duration_seconds_bucket[5m])) > 5 +for: 10m +labels: + severity: warning + component: sentinel +annotations: + summary: "Sentinel polling cycles are slow" + description: "95th percentile poll duration is {{ $value }}s for {{ $labels.resource_type }}. This may indicate API latency or processing issues." +``` +**Impact**: Delayed reconciliation, potentially missing max age intervals. + +**Response**: Check resource count growth and API performance. + +#### SentinelNoEventsPublished +```yaml +alert: SentinelNoEventsPublished +expr: | + hyperfleet_sentinel_pending_resources > 0 + unless on(resource_type, resource_selector) + rate(hyperfleet_sentinel_events_published_total[15m]) > 0 +for: 15m +labels: + severity: warning + component: sentinel +annotations: + summary: "Sentinel not publishing events" + description: "Sentinel has pending resources but hasn't published any events in 15 minutes. Service may be stuck." +``` +**Impact**: Resources may be stuck without reconciliation events. + +**Response**: Check decision engine logic and adapter status updates. + +#### SentinelHighPendingResources +```yaml +alert: SentinelHighPendingResources +expr: sum(hyperfleet_sentinel_pending_resources) > 100 +for: 10m +labels: + severity: warning + component: sentinel +annotations: + summary: "High number of pending resources in Sentinel" + description: "{{ $value }} resources are pending reconciliation for more than 10 minutes. This may indicate processing bottleneck or API issues." +``` +**Impact**: May indicate capacity issues or API problems. + +**Response**: Check resource count growth and consider horizontal scaling. + +### Informational Alerts + +#### SentinelHighSkipRatio +```yaml +alert: SentinelHighSkipRatio +expr: | + ( + rate(hyperfleet_sentinel_resources_skipped_total[10m]) / + (rate(hyperfleet_sentinel_resources_skipped_total[10m]) + + rate(hyperfleet_sentinel_events_published_total[10m])) + ) > 0.95 +for: 30m +labels: + severity: info + component: sentinel +annotations: + summary: "High resource skip ratio in Sentinel" + description: "{{ $value | humanizePercentage }} of resources are being skipped. This may indicate max_age configuration issues." +``` +**Impact**: May indicate max age intervals too long or adapter status update issues. + +**Response**: Review max age configuration and adapter health. + +--- + +### Alerting with Google Cloud Managed Prometheus + +**Note**: Google Cloud Managed Prometheus uses Google Cloud Alerting for alert management, not PrometheusRule CRDs. The alerting rules below are provided as PromQL expressions that you can configure in Google Cloud Console → Monitoring → Alerting. + +For Prometheus Operator compatibility, the Helm chart can optionally deploy a PrometheusRule resource when `monitoring.prometheusRule.enabled=true` (disabled by default for GMP). The following alert examples can be configured in either system: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: sentinel-alerts + namespace: hyperfleet-system +spec: + groups: + - name: sentinel.rules + interval: 30s + rules: + - alert: SentinelHighPendingResources + expr: | + sum(hyperfleet_sentinel_pending_resources) > 100 + for: 10m + labels: + severity: warning + annotations: + summary: "High number of pending resources" + description: "{{ $value }} resources are pending reconciliation for more than 10 minutes." + + - alert: SentinelAPIErrorRateHigh + expr: | + rate(hyperfleet_sentinel_api_errors_total[5m]) > 0.1 + for: 5m + labels: + severity: critical + annotations: + summary: "High API error rate detected" + description: "API error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." + + - alert: SentinelBrokerErrorRateHigh + expr: | + rate(hyperfleet_sentinel_broker_errors_total[5m]) > 0.05 + for: 5m + labels: + severity: critical + annotations: + summary: "High broker error rate detected" + description: "Broker error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." + + - alert: SentinelSlowPolling + expr: | + histogram_quantile(0.95, + rate(hyperfleet_sentinel_poll_duration_seconds_bucket[5m])) > 5 + for: 10m + labels: + severity: warning + annotations: + summary: "Polling cycles are slow" + description: "95th percentile poll duration is {{ $value | humanize }}s for {{ $labels.resource_type }}." + + - alert: SentinelNoEventsPublished + expr: | + rate(hyperfleet_sentinel_events_published_total[15m]) == 0 + AND hyperfleet_sentinel_pending_resources > 0 + for: 15m + labels: + severity: warning + annotations: + summary: "No events published despite pending resources" + description: "Sentinel has pending resources but hasn't published events in 15 minutes." + + - alert: SentinelPollStale + expr: | + hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 0 + and time() - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 60 + for: 1m + labels: + severity: critical + annotations: + summary: "Sentinel poll loop is stale" + description: "Sentinel has not completed a successful poll cycle in over 60 seconds." +``` + +To configure these alerts in **Google Cloud Console**: +1. Navigate to **Monitoring → Alerting** +2. Click **Create Policy** +3. Use the PromQL expressions above in the condition configuration +4. Configure notification channels and documentation + +For Prometheus Operator users, enable PrometheusRule in values.yaml and verify: + +```bash +kubectl get prometheusrule -n hyperfleet-system +``` diff --git a/docs/metrics.md b/docs/metrics.md index 501d205..12bbadf 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -1,5 +1,11 @@ # HyperFleet Sentinel Metrics +**Status**: Active +**Owner**: HyperFleet Team +**Last Updated**: 2026-03-12 + +> **Audience:** Developers and SREs setting up monitoring for HyperFleet Sentinel. + This document describes the Prometheus metrics exposed by the HyperFleet Sentinel service for monitoring and observability. ## Metrics Overview @@ -270,225 +276,6 @@ sum by (error_type) (rate(hyperfleet_broker_errors_total{component="sentinel"}[5 --- -## Recommended Alerting Rules - -### Alerting with Google Cloud Managed Prometheus - -**Note**: Google Cloud Managed Prometheus uses Google Cloud Alerting for alert management, not PrometheusRule CRDs. The alerting rules below are provided as PromQL expressions that you can configure in Google Cloud Console → Monitoring → Alerting. - -For Prometheus Operator compatibility, the Helm chart can optionally deploy a PrometheusRule resource when `monitoring.prometheusRule.enabled=true` (disabled by default for GMP). The following alert examples can be configured in either system: - -```yaml -apiVersion: monitoring.coreos.com/v1 -kind: PrometheusRule -metadata: - name: sentinel-alerts - namespace: hyperfleet-system -spec: - groups: - - name: sentinel.rules - interval: 30s - rules: - - alert: SentinelHighPendingResources - expr: | - sum(hyperfleet_sentinel_pending_resources) > 100 - for: 10m - labels: - severity: warning - annotations: - summary: "High number of pending resources" - description: "{{ $value }} resources are pending reconciliation for more than 10 minutes." - - - alert: SentinelAPIErrorRateHigh - expr: | - rate(hyperfleet_sentinel_api_errors_total[5m]) > 0.1 - for: 5m - labels: - severity: critical - annotations: - summary: "High API error rate detected" - description: "API error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." - - - alert: SentinelBrokerErrorRateHigh - expr: | - rate(hyperfleet_sentinel_broker_errors_total[5m]) > 0.05 - for: 5m - labels: - severity: critical - annotations: - summary: "High broker error rate detected" - description: "Broker error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." - - - alert: SentinelSlowPolling - expr: | - histogram_quantile(0.95, - rate(hyperfleet_sentinel_poll_duration_seconds_bucket[5m])) > 5 - for: 10m - labels: - severity: warning - annotations: - summary: "Polling cycles are slow" - description: "95th percentile poll duration is {{ $value | humanize }}s for {{ $labels.resource_type }}." - - - alert: SentinelNoEventsPublished - expr: | - rate(hyperfleet_sentinel_events_published_total[15m]) == 0 - AND hyperfleet_sentinel_pending_resources > 0 - for: 15m - labels: - severity: warning - annotations: - summary: "No events published despite pending resources" - description: "Sentinel has pending resources but hasn't published events in 15 minutes." - - - alert: SentinelPollStale - expr: | - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 0 - and time() - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 60 - for: 1m - labels: - severity: critical - annotations: - summary: "Sentinel poll loop is stale" - description: "Sentinel has not completed a successful poll cycle in over 60 seconds." -``` - -To configure these alerts in **Google Cloud Console**: -1. Navigate to **Monitoring → Alerting** -2. Click **Create Policy** -3. Use the PromQL expressions above in the condition configuration -4. Configure notification channels and documentation - -For Prometheus Operator users, enable PrometheusRule in values.yaml and verify: - -```bash -kubectl get prometheusrule -n hyperfleet-system -``` - -### Individual Alert Rules (YAML Format) - -For reference, here are the individual alert rules in YAML format: - -#### High Pending Resources - -Alert when the number of pending resources exceeds a threshold for an extended period. - -```yaml -- alert: SentinelHighPendingResources - expr: | - sum(hyperfleet_sentinel_pending_resources) > 100 - for: 10m - labels: - severity: warning - annotations: - summary: "High number of pending resources" - description: "{{ $value }} resources are pending reconciliation for more than 10 minutes." -``` - -### API Error Rate High - -Alert when the API error rate exceeds acceptable limits. - -```yaml -- alert: SentinelAPIErrorRateHigh - expr: | - rate(hyperfleet_sentinel_api_errors_total[5m]) > 0.1 - for: 5m - labels: - severity: critical - annotations: - summary: "High API error rate detected" - description: "API error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." -``` - -### Broker Error Rate High - -Alert when broker errors indicate message delivery issues. - -```yaml -- alert: SentinelBrokerErrorRateHigh - expr: | - rate(hyperfleet_sentinel_broker_errors_total[5m]) > 0.05 - for: 5m - labels: - severity: critical - annotations: - summary: "High broker error rate detected" - description: "Broker error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." -``` - -### Slow Poll Duration - -Alert when polling cycles are taking too long, indicating performance issues. - -```yaml -- alert: SentinelSlowPolling - expr: | - histogram_quantile(0.95, - rate(hyperfleet_sentinel_poll_duration_seconds_bucket[5m])) > 5 - for: 10m - labels: - severity: warning - annotations: - summary: "Polling cycles are slow" - description: "95th percentile poll duration is {{ $value | humanize }}s for {{ $labels.resource_type }}." -``` - -### No Events Published - -Alert when no events have been published recently, which may indicate a stuck sentinel. - -```yaml -- alert: SentinelNoEventsPublished - expr: | - rate(hyperfleet_sentinel_events_published_total[15m]) == 0 - AND hyperfleet_sentinel_pending_resources > 0 - for: 15m - labels: - severity: warning - annotations: - summary: "No events published despite pending resources" - description: "Sentinel has {{ $value }} pending resources but hasn't published events in 15 minutes." -``` - -### High Skip Ratio - -Alert when too many resources are being skipped, which may indicate configuration issues. - -```yaml -- alert: SentinelHighSkipRatio - expr: | - rate(hyperfleet_sentinel_resources_skipped_total[10m]) / - (rate(hyperfleet_sentinel_resources_skipped_total[10m]) + - rate(hyperfleet_sentinel_events_published_total[10m])) > 0.95 - for: 30m - labels: - severity: info - annotations: - summary: "High resource skip ratio" - description: "{{ $value | humanizePercentage }} of resources are being skipped." -``` - -### Poll Stale - -Alert when Sentinel has not completed a successful poll cycle recently, indicating the service may be hung. - -```yaml -- alert: SentinelPollStale - expr: | - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 0 - and time() - hyperfleet_sentinel_last_successful_poll_timestamp_seconds > 60 - for: 1m - labels: - severity: critical - component: sentinel - annotations: - summary: "Sentinel poll loop is stale" - description: "Sentinel has not completed a successful poll cycle in over 60 seconds." - runbook_url: "https://github.com/openshift-hyperfleet/hyperfleet-sentinel/blob/main/docs/metrics.md#poll-stale" -``` ---- - ## Grafana Dashboard A pre-built Grafana dashboard is available at `deployments/dashboards/sentinel-metrics.json`. The dashboard includes: diff --git a/docs/multi-instance-deployment.md b/docs/multi-instance-deployment.md index 75e850e..641ab35 100644 --- a/docs/multi-instance-deployment.md +++ b/docs/multi-instance-deployment.md @@ -1,4 +1,8 @@ # Deploying Multiple Sentinel Instances +**Status**: Active +**Owner**: HyperFleet Team +**Last Updated**: 2026-03-12 +> **Audience:** Operations teams deploying Sentinel at scale. Sentinel supports horizontal scaling through multiple dimensions: by resource type (separate instances for clusters vs nodepools) and by label-based resource filtering within the same resource type. Deploy multiple Sentinel instances with different `resource_selector` values to distribute the workload. @@ -110,6 +114,151 @@ config: Scale to multiple instances as your cluster count grows or when you need regional isolation. +--- + +## PodDisruptionBudget + +**What**: Ensures minimum Sentinel availability during cluster maintenance. + +**Configuration for Single-Replica Deployments** (typical topology): +```yaml +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: sentinel-pdb + namespace: hyperfleet-system +spec: + minAvailable: 1 + selector: + matchLabels: + app.kubernetes.io/name: hyperfleet-sentinel +``` + +**Operational Impact**: +- **Single replica protection**: `minAvailable: 1` blocks voluntary pod eviction when only 1 replica exists +- **Maintenance blocking**: Node drains will be delayed until Sentinel pods are manually drained or scaled up +- **Multiple Sentinels**: Each Sentinel deployment (per resource selector) can have its own PDB +- **Trade-off**: Maintenance operations may require manual intervention for single-replica Sentinels + +> **Note**: Cluster maintenance operations respect Sentinel availability requirements. + +--- + +## Operational Guidance + +### Resource Requirements + +#### Production Recommendations +```yaml +resources: + requests: + cpu: 100m # Baseline for polling every 5s + memory: 128Mi # Baseline for ~1000 resources + limits: + cpu: 500m # Handle traffic spikes + memory: 512Mi # Memory for large resource sets +``` + +> **Note**: Resource requirements will be validated and updated based on actual consumption profiling in HYPERFLEET-556. + +#### Scaling Guidelines + +**CPU Scaling**: +- **Base load**: 50-100m for basic polling +- **Per 1000 resources**: Additional 50m CPU +- **High churn environments**: Additional 100m for frequent events + +**Memory Scaling**: +- **Base load**: 64Mi for service overhead +- **Per 1000 resources**: Additional 32Mi memory +- **Complex resource selectors**: Additional 16Mi per selector rule + +**Example Calculation**: +``` +5000 resources + complex selectors: +CPU: 100m + (5 × 50m) + 100m = 450m +Memory: 64Mi + (5 × 32Mi) + 16Mi = 240Mi +``` + +### Scaling Strategy + +#### Horizontal Scaling (Label Partitioning) + +**Approach**: Deploy multiple Sentinel instances with different `resource_selector` configurations. + +**Benefits**: +- Linear performance scaling +- Fault isolation (one failure doesn't affect all resources) +- Regional deployment (Sentinel near managed resources) +- Different configurations per environment + +**Example Multi-Instance Deployment**: +``` + ┌───────────────────┐ + │ HyperFleet API │ + └─────────┬─────────┘ + │ + Step 1: fetch resources + │ + ▼ +┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ +│ Sentinel US-East │ │ Sentinel US-West │ │ Sentinel EU-West │ +│ resource_selector: │ │ resource_selector: │ │ resource_selector: │ +│ - label: region │ │ - label: region │ │ - label: region │ +│ value: us-east │ │ value: us-west │ │ value: eu-west │ +│ max_age_ready=30m │ │ max_age_ready=1h │ │ max_age_ready=45m │ +└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘ + │ │ │ + │ ▼ │ + └────────────► Step 2: publish events ◄───────────┘ + │ + ▼ + ┌───────────────────┐ + │ Message Broker │ + └───────────────────┘ +``` + +**Important**: This is **NOT leader election**. Multiple Sentinels can overlap resource selectors if needed. Operators must ensure appropriate coverage. + +#### Resource Selector Strategies + +**Regional Partitioning**: +```yaml +# Sentinel A +resource_selector: + - label: region + value: us-east + +# Sentinel B +resource_selector: + - label: region + value: us-west +``` + +**Environment Partitioning**: +```yaml +# Production Sentinel +resource_selector: + - label: environment + value: production + +# Development Sentinel +resource_selector: + - label: environment + value: development +``` + +**Hybrid Partitioning**: +```yaml +# Production US-East +resource_selector: + - label: region + value: us-east + - label: environment + value: production +``` + + ## Architecture Reference For more details on the Sentinel architecture and resource filtering design, see the [architecture documentation](https://github.com/openshift-hyperfleet/architecture/tree/main/hyperfleet/components/sentinel). diff --git a/docs/runbook.md b/docs/runbook.md new file mode 100644 index 0000000..6575744 --- /dev/null +++ b/docs/runbook.md @@ -0,0 +1,266 @@ +# HyperFleet Sentinel Runbook + +**Status**: Active +**Owner**: HyperFleet Team +**Last Updated**: 2026-03-12 + +**Audience:** **Platform Operations teams** and **SREs** responsible for HyperFleet Sentinel deployments + +--- + +## Purpose + +This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for: + +--- + +## Reliability Features + +The Sentinel service is designed with multiple layers of reliability to ensure continuous reconciliation of HyperFleet resources. + +### Stateless Design + +**What**: Sentinel maintains no persistent state between polling cycles. + +**Implementation**: +- All reconciliation decisions are made based on current resource state from the HyperFleet API +- No local databases or persistent storage requirements +- Configuration loaded once at startup from YAML files and environment variables +- Each polling cycle starts fresh from API data + +**Benefits**: +- Simple horizontal scaling (no state coordination needed) +- Fast recovery after restarts (no state reconstruction) +- Eliminates state corruption issues +- Simplified deployment (no persistent volumes) + +**Operational Impact**: Sentinel instances can be stopped/started without data loss. Resource reconciliation continues from the last adapter-reported status. + +### Graceful Shutdown + +**What**: Sentinel responds to SIGTERM/SIGINT signals with controlled shutdown. + +**Implementation**: +- Listens for termination signals during main polling loop +- Completes current polling cycle before exiting +- Maximum shutdown time: 20 seconds for HTTP server shutdown +- Publishes any pending events before shutdown +- Cleans up broker connections gracefully + +**Configuration**: +```yaml +spec: + template: + spec: + terminationGracePeriodSeconds: 30 +``` + +**Operational Impact**: Graceful shutdown minimizes event loss by attempting to publish pending events before exit, subject to the grace period. + +### API Retry Logic + +**What**: Automatic retry with exponential backoff for HyperFleet API calls. + +**Implementation**: +- **Timeout**: 5 seconds per API call (configurable via `hyperfleet_api.timeout`) +- **Initial interval**: 500ms (first retry after 500ms) +- **Max interval**: 8 seconds (maximum retry interval) +- **Multiplier**: 2.0 (doubles interval each retry: 500ms → 1s → 2s → 4s → 8s) +- **Randomization**: 10% jitter added to prevent thundering herd +- **Max elapsed time**: 30 seconds total (time-based retry, not attempt-based) +- **Failure handling**: Logs errors, continues with next resource after max elapsed time + +**Configuration**: +```yaml +hyperfleet_api: + endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000 + timeout: 5s +``` + +**Metrics**: Failed API calls tracked via `hyperfleet_sentinel_api_errors_total` metric. + +**Operational Impact**: Transient API issues don't stop reconciliation. Service continues polling after API recovery. + +### Broker Publish Retry + +**What**: Automatic retry for message broker publishing failures. + +**Implementation**: +- **External library**: Retry behavior handled by `hyperfleet-broker` library +- **Broker support**: GCP Pub/Sub and RabbitMQ with library-managed retry logic +- **Failure isolation**: Failed events logged but don't stop processing of other resources +- **Error handling**: Log error, record metric, continue to next resource + +> **Note**: Specific retry parameters (attempts, timeouts, backoff strategy) are implemented in the external [hyperfleet-broker](https://github.com/openshift-hyperfleet/hyperfleet-broker) library and not configurable at the Sentinel level. + +**Configuration Example (GCP Pub/Sub)**: +```yaml +# Via environment variables or ConfigMap +BROKER_TYPE: "pubsub" +BROKER_PROJECT_ID: "hyperfleet-prod" +``` + +**Metrics**: Publishing failures tracked via `hyperfleet_sentinel_broker_errors_total` metric. + +**Operational Impact**: Temporary broker outages don't cause event loss. Events are retried by the broker library, but durability depends on broker availability and Sentinel remaining active. + +### Per-Resource Error Isolation + +**What**: Failures processing one resource don't affect processing of other resources. + +**Implementation**: +- Each resource evaluated independently in the polling loop +- Decision engine errors logged but processing continues +- Event publishing failures logged but don't stop the polling cycle +- API errors for specific resources don't abort the entire fetch operation + +**Example Flow**: +``` +Polling Cycle: +├── Fetch 100 clusters from API +├── Process cluster-1 → Event published +├── Process cluster-2 → Log error, continue +├── Process cluster-3 → Event published +└── Complete cycle, sleep, repeat +``` + +**Operational Impact**: Problematic resources (e.g., malformed data) don't prevent reconciliation of healthy resources. + +## Health Checks + +**What**: Kubernetes readiness and liveness probes that verify actual service functionality. + +**Implementation**: + +**Liveness Probe** (`/healthz`): +- Verifies main polling goroutine is running +- Checks broker connection status +- Returns 200 OK if service can perform reconciliation +- **Failure threshold**: 3 consecutive failures +- **Period**: 20 seconds + +**Readiness Probe** (`/readyz`): +- Verifies configuration loaded successfully +- Validates HyperFleet API connectivity +- Confirms broker configuration is valid +- Returns 200 OK when ready to process traffic +- **Period**: 10 seconds + +**Configuration**: +```yaml +livenessProbe: + httpGet: + path: /healthz + port: 8080 + initialDelaySeconds: 15 + periodSeconds: 20 +readinessProbe: + httpGet: + path: /readyz + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 10 +``` + +**Operational Impact**: Kubernetes automatically restarts unhealthy pods and removes unready pods from service. + +## Common Failure Modes and Recovery Procedures + +### 1. Sentinel Pod Crash Loop +**Symptoms**: Pod restart count increasing, CrashLoopBackOff status + +**Diagnosis**: Check pod logs, resource constraints, configuration errors + +**Recovery**: +1. Check logs: `kubectl logs -l app.kubernetes.io/name=sentinel --previous` +2. Verify resource limits: `kubectl describe pods -l app.kubernetes.io/name=sentinel` +3. Validate configuration: `kubectl get configmap -l app.kubernetes.io/name=sentinel -o yaml` + +**Alternative commands for specific deployment:** +```bash +# If you know the Helm release name (e.g., "my-sentinel") +kubectl logs deployment/my-sentinel-sentinel --previous +kubectl get configmap my-sentinel-sentinel-config -o yaml +``` + +### 2. API Connectivity Loss +**Symptoms**: High API error rate, no events published + +**Diagnosis**: API health, network connectivity, authentication + +**Recovery**: +1. Test API connectivity: `kubectl exec -l app.kubernetes.io/name=sentinel -- curl hyperfleet-api:8000/health` +2. Check API pod status: `kubectl get pods -l app.kubernetes.io/name=hyperfleet-api` +3. Verify service endpoints: `kubectl get endpoints hyperfleet-api` +4. Check API service: `kubectl get service hyperfleet-api` + +**Note**: API endpoint uses port 8000 as configured in values.yaml + +### 3. Broker Publishing Failures +**Symptoms**: High broker error rate, events not reaching adapters + +**Diagnosis**: Broker connectivity, credentials, topic configuration + +**Recovery**: +1. Check broker credentials: `kubectl get secret -l app.kubernetes.io/name=sentinel -o yaml` +2. Test RabbitMQ connectivity: `kubectl exec -l app.kubernetes.io/name=sentinel -- nslookup rabbitmq.hyperfleet-system.svc.cluster.local` +3. Check broker health: `kubectl get pods -l app.kubernetes.io/name=rabbitmq` +4. Validate broker config: `kubectl exec -l app.kubernetes.io/name=sentinel -- cat /etc/sentinel/broker.yaml` + +**For specific secret (if you know the Helm release name):** +```bash +kubectl get secret my-sentinel-sentinel-broker-credentials -o yaml +``` + +## Additional Diagnostic Commands + +### Check Overall Sentinel Health +```bash +# View all Sentinel resources +kubectl get all -l app.kubernetes.io/name=sentinel + +# Check health endpoints +kubectl exec -l app.kubernetes.io/name=sentinel -- wget -qO- http://localhost:8080/healthz +kubectl exec -l app.kubernetes.io/name=sentinel -- wget -qO- http://localhost:8080/readyz + +# View metrics endpoint +kubectl exec -l app.kubernetes.io/name=sentinel -- curl -s http://localhost:9090/metrics +``` + +### Monitor Sentinel Activity +```bash +# Follow logs in real-time +kubectl logs -l app.kubernetes.io/name=sentinel -f + +# Check recent polling cycles +kubectl logs -l app.kubernetes.io/name=sentinel --tail=50 | grep -E "(polling|published|error)" + +# Monitor resource usage +kubectl top pods -l app.kubernetes.io/name=sentinel +``` + +### Validate Configuration +```bash +# Check mounted config +kubectl exec -l app.kubernetes.io/name=sentinel -- cat /etc/sentinel/config.yaml + +# Check broker configuration +kubectl exec -l app.kubernetes.io/name=sentinel -- cat /etc/sentinel/broker.yaml + +# Verify environment variables +kubectl exec -l app.kubernetes.io/name=sentinel -- env | grep -E "(OTEL|TRACING|BROKER)" +``` + +### Network Connectivity Tests +```bash +# Test HyperFleet API connectivity +kubectl exec -l app.kubernetes.io/name=sentinel -- wget -qO- http://hyperfleet-api:8000/health + +# Test DNS resolution +kubectl exec -l app.kubernetes.io/name=sentinel -- nslookup hyperfleet-api +kubectl exec -l app.kubernetes.io/name=sentinel -- nslookup rabbitmq.hyperfleet-system.svc.cluster.local + +# Check cluster networking +kubectl get endpoints hyperfleet-api +kubectl get service hyperfleet-api +``` diff --git a/docs/running-sentinel.md b/docs/running-sentinel.md index 2ef753c..fdc8137 100644 --- a/docs/running-sentinel.md +++ b/docs/running-sentinel.md @@ -1,4 +1,8 @@ # Running Sentinel +**Status**: Active +**Owner**: HyperFleet Team +**Last Updated**: 2026-03-12 +> **Audience:** Developers running Sentinel for development and testing purposes. > **IMPORTANT**: This documentation covers running Sentinel for **development and testing purposes**. Production deployments are handled via CI/CD pipelines. @@ -22,6 +26,10 @@ This guide enables developers to run Sentinel both locally (for development) and - [Helm Deployment](#6-helm-deployment) - [Verification Steps](#7-verification-steps) - [Cleanup](#8-cleanup) +- [Deployment Configuration](#deployment-configuration) + - [Basic Production Configuration](#basic-production-configuration) + - [Multi-Region Configuration](#multi-region-configuration) + - [Development Environment Configuration](#development-environment-configuration) - [Troubleshooting](#troubleshooting) --- @@ -514,6 +522,79 @@ gcloud projects remove-iam-policy-binding ${GCP_PROJECT} \ --role="roles/pubsub.publisher" \ --member="principal://iam.googleapis.com/projects/${GCP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/${GCP_PROJECT}.svc.id.goog/subject/ns/${NAMESPACE}/sa/sentinel-test" ``` +--- + + +## Deployment Configuration + +### Basic Production Configuration +```yaml +# sentinel-config.yaml +resource_type: clusters +poll_interval: 5s +max_age_not_ready: 10s +max_age_ready: 30m + +# Watch all clusters (no filtering) +resource_selector: [] + +hyperfleet_api: + endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000 + timeout: 5s + +# CloudEvent data payload using CEL expressions +message_data: + resource_id: "resource.id" # CEL expression accessing resource.id field + resource_type: "resource.kind" # CEL expression accessing resource.kind field + generation: "resource.generation" # CEL expression accessing resource.generation field + region: "resource.labels.region" # CEL expression accessing nested labels.region field +``` + +### Multi-Region Configuration +```yaml +# sentinel-us-east-config.yaml +resource_type: clusters +poll_interval: 5s +max_age_not_ready: 10s +max_age_ready: 30m + +resource_selector: + - label: region + value: us-east + +hyperfleet_api: + endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000 + timeout: 5s + +message_data: + resource_id: "resource.id" + resource_type: "resource.kind" + generation: "resource.generation" + region: "resource.labels.region" +``` + +### Development Environment Configuration +```yaml +# sentinel-dev-config.yaml +resource_type: clusters +poll_interval: 10s # Slower polling for dev +max_age_not_ready: 30s # Longer intervals for dev +max_age_ready: 2h + +resource_selector: + - label: environment + value: development + +hyperfleet_api: + endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000 + timeout: 5s + +message_data: + resource_id: "resource.id" + resource_type: "resource.kind" + generation: "resource.generation" + environment: "resource.labels.environment" +``` --- diff --git a/docs/sentinel-operator-guide.md b/docs/sentinel-operator-guide.md index 0cdb0f4..1bf0ae3 100644 --- a/docs/sentinel-operator-guide.md +++ b/docs/sentinel-operator-guide.md @@ -1,4 +1,8 @@ # HyperFleet Sentinel Operator Guide +**Status**: Active +**Owner**: HyperFleet Team +**Last Updated**: 2026-03-12 +> **Audience:** Operators deploying and configuring Sentinel service. This comprehensive guide teaches operators how to deploy, configure, and operate the HyperFleet Sentinel service—a polling-based event publisher that drives cluster lifecycle orchestration. From ee12151a6a3f01d3f60e69af591bcb91e6990685 Mon Sep 17 00:00:00 2001 From: tithakka Date: Thu, 12 Mar 2026 13:26:33 -0500 Subject: [PATCH 2/3] HYPERFLEET-557: Document Remove unnecessary section --- docs/runbook.md | 53 ------------------------------------------------- 1 file changed, 53 deletions(-) diff --git a/docs/runbook.md b/docs/runbook.md index 6575744..6964f35 100644 --- a/docs/runbook.md +++ b/docs/runbook.md @@ -211,56 +211,3 @@ kubectl get configmap my-sentinel-sentinel-config -o yaml ```bash kubectl get secret my-sentinel-sentinel-broker-credentials -o yaml ``` - -## Additional Diagnostic Commands - -### Check Overall Sentinel Health -```bash -# View all Sentinel resources -kubectl get all -l app.kubernetes.io/name=sentinel - -# Check health endpoints -kubectl exec -l app.kubernetes.io/name=sentinel -- wget -qO- http://localhost:8080/healthz -kubectl exec -l app.kubernetes.io/name=sentinel -- wget -qO- http://localhost:8080/readyz - -# View metrics endpoint -kubectl exec -l app.kubernetes.io/name=sentinel -- curl -s http://localhost:9090/metrics -``` - -### Monitor Sentinel Activity -```bash -# Follow logs in real-time -kubectl logs -l app.kubernetes.io/name=sentinel -f - -# Check recent polling cycles -kubectl logs -l app.kubernetes.io/name=sentinel --tail=50 | grep -E "(polling|published|error)" - -# Monitor resource usage -kubectl top pods -l app.kubernetes.io/name=sentinel -``` - -### Validate Configuration -```bash -# Check mounted config -kubectl exec -l app.kubernetes.io/name=sentinel -- cat /etc/sentinel/config.yaml - -# Check broker configuration -kubectl exec -l app.kubernetes.io/name=sentinel -- cat /etc/sentinel/broker.yaml - -# Verify environment variables -kubectl exec -l app.kubernetes.io/name=sentinel -- env | grep -E "(OTEL|TRACING|BROKER)" -``` - -### Network Connectivity Tests -```bash -# Test HyperFleet API connectivity -kubectl exec -l app.kubernetes.io/name=sentinel -- wget -qO- http://hyperfleet-api:8000/health - -# Test DNS resolution -kubectl exec -l app.kubernetes.io/name=sentinel -- nslookup hyperfleet-api -kubectl exec -l app.kubernetes.io/name=sentinel -- nslookup rabbitmq.hyperfleet-system.svc.cluster.local - -# Check cluster networking -kubectl get endpoints hyperfleet-api -kubectl get service hyperfleet-api -``` From f075f949af3f3d3f467ec903bbad071cf47b025a Mon Sep 17 00:00:00 2001 From: tithakka Date: Thu, 12 Mar 2026 17:28:03 -0500 Subject: [PATCH 3/3] HYPERFLEET-557: Remove incorrect healthz and authz statement, fix promQL query and add component to each alert --- docs/alerts.md | 34 ++++++++++++++++++++++++++++++++-- docs/runbook.md | 19 ++++++++++--------- 2 files changed, 42 insertions(+), 11 deletions(-) diff --git a/docs/alerts.md b/docs/alerts.md index a992a19..c3e1cb3 100644 --- a/docs/alerts.md +++ b/docs/alerts.md @@ -187,6 +187,7 @@ spec: for: 10m labels: severity: warning + component: sentinel annotations: summary: "High number of pending resources" description: "{{ $value }} resources are pending reconciliation for more than 10 minutes." @@ -197,6 +198,7 @@ spec: for: 5m labels: severity: critical + component: sentinel annotations: summary: "High API error rate detected" description: "API error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." @@ -207,6 +209,7 @@ spec: for: 5m labels: severity: critical + component: sentinel annotations: summary: "High broker error rate detected" description: "Broker error rate is {{ $value | humanize }} errors/sec for resource_type {{ $labels.resource_type }}." @@ -218,17 +221,20 @@ spec: for: 10m labels: severity: warning + component: sentinel annotations: summary: "Polling cycles are slow" description: "95th percentile poll duration is {{ $value | humanize }}s for {{ $labels.resource_type }}." - alert: SentinelNoEventsPublished expr: | - rate(hyperfleet_sentinel_events_published_total[15m]) == 0 - AND hyperfleet_sentinel_pending_resources > 0 + hyperfleet_sentinel_pending_resources > 0 + unless on(resource_type, resource_selector) + rate(hyperfleet_sentinel_events_published_total[15m]) > 0s for: 15m labels: severity: warning + component: sentinel annotations: summary: "No events published despite pending resources" description: "Sentinel has pending resources but hasn't published events in 15 minutes." @@ -240,9 +246,33 @@ spec: for: 1m labels: severity: critical + component: sentinel annotations: summary: "Sentinel poll loop is stale" description: "Sentinel has not completed a successful poll cycle in over 60 seconds." + - alert: SentinelDown + expr: absent(up{service="sentinel"}) or up{service="sentinel"} == 0 + for: 5m + labels: + severity: critical + component: sentinel + annotations: + summary: "Sentinel service is down" + description: "Sentinel metrics endpoint is not responding. Service may be down or unreachable." + - alert: SentinelHighSkipRatio + expr: | + ( + rate(hyperfleet_sentinel_resources_skipped_total[10m]) / + (rate(hyperfleet_sentinel_resources_skipped_total[10m]) + + rate(hyperfleet_sentinel_events_published_total[10m])) + ) > 0.95 + for: 30m + labels: + severity: info + component: sentinel + annotations: + summary: "High resource skip ratio in Sentinel" + description: "{{ $value | humanizePercentage }} of resources are being skipped. This may indicate max_age configuration issues." ``` To configure these alerts in **Google Cloud Console**: diff --git a/docs/runbook.md b/docs/runbook.md index 6964f35..67f98cd 100644 --- a/docs/runbook.md +++ b/docs/runbook.md @@ -11,7 +11,9 @@ ## Purpose This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for: - +- Understanding built-in reliability features +- Configuring health probes and monitoring +- Diagnosing and recovering from common failure modes --- ## Reliability Features @@ -44,7 +46,6 @@ The Sentinel service is designed with multiple layers of reliability to ensure c - Listens for termination signals during main polling loop - Completes current polling cycle before exiting - Maximum shutdown time: 20 seconds for HTTP server shutdown -- Publishes any pending events before shutdown - Cleans up broker connections gracefully **Configuration**: @@ -112,7 +113,7 @@ BROKER_PROJECT_ID: "hyperfleet-prod" - Each resource evaluated independently in the polling loop - Decision engine errors logged but processing continues - Event publishing failures logged but don't stop the polling cycle -- API errors for specific resources don't abort the entire fetch operation +- A single API fetch failure affects the entire cycle, but the next cycle retries automatically **Example Flow**: ``` @@ -133,16 +134,16 @@ Polling Cycle: **Implementation**: **Liveness Probe** (`/healthz`): -- Verifies main polling goroutine is running -- Checks broker connection status -- Returns 200 OK if service can perform reconciliation +- Checks poll staleness (dead man's switch) +- Returns 200 OK if last successful poll is within threshold (3 × poll_interval) +- Returns 200 OK before first poll completes (grace period) - **Failure threshold**: 3 consecutive failures - **Period**: 20 seconds **Readiness Probe** (`/readyz`): -- Verifies configuration loaded successfully -- Validates HyperFleet API connectivity -- Confirms broker configuration is valid +- Checks broker connection health +- Verifies at least one successful poll cycle has completed +- Returns 200 OK when both checks pass - Returns 200 OK when ready to process traffic - **Period**: 10 seconds