-
Notifications
You must be signed in to change notification settings - Fork 36
Frame server analysis output using USE/RED/Golden Signals vocabulary #691
Description
Summary
PerformanceMonitor already collects the data that maps to the three established monitoring frameworks (USE, RED, Golden Signals), but doesn't organize or present findings using their vocabulary. Framing the analysis output this way would make the tool's insights immediately recognizable to anyone trained in SRE/DevOps practices and would provide a structured way to ensure coverage across all resource and service dimensions.
The Frameworks
USE Method (Brendan Gregg) — for every resource
| CPU | Memory | Disk I/O | Workers | TempDB | |
|---|---|---|---|---|---|
| Utilization | CPU % | Buffer pool vs max memory | File latency | Active vs max workers | Space used |
| Saturation | SOS_SCHEDULER_YIELD, runnable tasks | RESOURCE_SEMAPHORE waits, grant queuing | PAGEIOLATCH waits | Thread exhaustion | PAGELATCH allocation waits |
| Errors | Non-yielding schedulers | OOM, memory broker shrinks | 15-second I/O warnings | Timeout errors | Out-of-space |
RED Method (Tom Wilkie) — for the database as a service
| Signal | SQL Server Metric |
|---|---|
| Rate | Batch Requests/sec (perfmon) |
| Errors | Failed queries, deadlocks, timeouts, connection errors |
| Duration | Query execution time distribution |
Four Golden Signals (Google SRE) — the combined view
| Signal | SQL Server Metric |
|---|---|
| Latency | Query duration (distinguish successful vs error latency) |
| Traffic | Batch Requests/sec, Transactions/sec |
| Errors | Deadlocks, timeouts, failed queries |
| Saturation | Wait stats (the entire wait statistics framework is fundamentally a saturation measurement system) |
Where This Applies
Analysis Engine Output (both Dashboard and Lite)
The analyze_server inference engine produces findings with severity and evidence. These findings could be tagged or grouped by framework category:
- "CPU Saturation (USE): SOS_SCHEDULER_YIELD elevated, runnable task queue depth > 10"
- "Latency (Golden Signals): p95 query duration increased 3x vs baseline"
Progressive Summary View (issue #689)
The server landing summary could organize its ranked problems using USE categories for resource issues and RED/Golden Signals for workload issues. This gives users a mental model, not just a list.
MCP Tool Output
The analyze_server and get_analysis_findings MCP tools could include framework tags in their output, making AI-assisted investigation framework-aware.
Design Notes
- This is primarily a framing/presentation change, not a new data collection effort
- All underlying data already exists in current collectors
- The frameworks provide a checklist for completeness: if the analysis engine never reports on "Memory Errors," that's a gap worth noticing
- The vocabulary is widely taught in SRE and DevOps contexts — using it makes PerformanceMonitor's output portable to incident reports, postmortems, and runbooks
- Applies to both Dashboard and Lite, plus MCP tool output