Hyperfleet-557 : Document Sentinel Reliability and Observability#73
Hyperfleet-557 : Document Sentinel Reliability and Observability#73tirthct wants to merge 3 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
||
| **Implementation**: | ||
|
|
||
| **Liveness Probe** (`/healthz`): |
There was a problem hiding this comment.
The health check descriptions here don't match the actual implementation:
/healthz: The code (internal/health/health.go) only checks poll staleness (time since last
successful poll vs 3 * PollInterval). It does not check broker connection status — that check
is registered on /readyz. Suggestion:
- Checks poll staleness (dead man's switch)
- Returns 200 OK if last successful poll is within threshold (3 × poll_interval)
- Returns 200 OK before first poll completes (grace period)There was a problem hiding this comment.
Fixed the health check description
| - **Failure threshold**: 3 consecutive failures | ||
| - **Period**: 20 seconds | ||
|
|
||
| **Readiness Probe** (`/readyz`): |
There was a problem hiding this comment.
/readyz: The code only registers two checks: "broker" (calls pub.Health(ctx)) and
"sentinel_poll" (verifies at least one successful poll). There is no explicit "configuration
loaded" or "API connectivity" check. Suggestion:
- Checks broker connection health
- Verifies at least one successful poll cycle has completed
- Returns 200 OK when both checks passThere was a problem hiding this comment.
Ah, yes. Modified the description
|
|
||
| **What**: Sentinel responds to SIGTERM/SIGINT signals with controlled shutdown. | ||
|
|
||
| **Implementation**: |
There was a problem hiding this comment.
Priority: Bug
Line 46: "Publishes any pending events before shutdown" — this doesn't match the shutdown
implementation. The signal handler calls cancel() on the context immediately, then shuts down
the HTTP servers with a 20s timeout. There is no explicit event flush/drain step.
If a polling cycle is mid-execution when shutdown is triggered, in-flight publishes may
complete (since cancel() doesn't interrupt running goroutines mid-function), but they may also
fail because downstream calls receive a cancelled context. This is best-effort, not
guaranteed.
Suggested replacement:
- Listens for termination signals during main polling loop
- If a polling cycle is in progress, in-flight operations may complete (best-effort)
- Maximum shutdown time: 20 seconds for HTTP server shutdown
- Cleans up broker connections gracefullyThere was a problem hiding this comment.
Removed "Publishes any pending events before shutdown"
docs/alerts.md
Outdated
| - alert: SentinelNoEventsPublished | ||
| expr: | | ||
| rate(hyperfleet_sentinel_events_published_total[15m]) == 0 | ||
| AND hyperfleet_sentinel_pending_resources > 0 |
There was a problem hiding this comment.
| AND hyperfleet_sentinel_pending_resources > 0 | |
|
|
||
| ## Alert Rules Reference | ||
|
|
||
| The following 8 alert rules provide comprehensive monitoring for production Sentinel deployments. |
There was a problem hiding this comment.
The doc says "8 alert rules" but the PrometheusRule manifest only includes 6. Missing:
SentinelDown and SentinelHighSkipRatio. An operator deploying from this manifest will have
incomplete coverage.
There was a problem hiding this comment.
I thought this snippet from the document served as "alert examples" (mentioned above) as we have alerts yaml defined under charts/templates/prometheusrule.yaml but yes, this can be confusing if someone tries to use the same manifest. Added the remaining rules here
|
|
||
| ## Purpose | ||
|
|
||
| This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for: |
There was a problem hiding this comment.
Priority: Pattern
The Purpose section has a dangling colon with no list items — looks like the content got
truncated. Either add the intended list or rephrase the sentence:
| This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for: | |
| This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for: | |
| - Understanding built-in reliability features | |
| - Configuring health probes and monitoring | |
| - Diagnosing and recovering from common failure modes |
There was a problem hiding this comment.
Added the description
…mQL query and add component to each alert
| region: "resource.labels.region" # CEL expression accessing nested labels.region field | ||
| ``` | ||
|
|
||
| ### Multi-Region Configuration |
There was a problem hiding this comment.
As the audience for this document is for developers that want to run it locally, I think we can drop or move this "Multi-region configuration" to maybe sentinel-operator-guide.md.
I think lines 553 to 573 of sentinel-operator.guide.md cover it
There was a problem hiding this comment.
In fact I see that this is already documented in multi-instance-deployment.md.
Maybe it worths a note in sentinel-operator-guide.md:
### Multi-Region Configuration
For multi-region deployment examples using `resource_selector`, see [Resource Selector
Strategies](multi-instance-deployment.md#resource-selector-strategies).
Changed