Skip to content

Frame server analysis output using USE/RED/Golden Signals vocabulary #691

@erikdarlingdata

Description

@erikdarlingdata

Summary

PerformanceMonitor already collects the data that maps to the three established monitoring frameworks (USE, RED, Golden Signals), but doesn't organize or present findings using their vocabulary. Framing the analysis output this way would make the tool's insights immediately recognizable to anyone trained in SRE/DevOps practices and would provide a structured way to ensure coverage across all resource and service dimensions.

The Frameworks

USE Method (Brendan Gregg) — for every resource

CPU Memory Disk I/O Workers TempDB
Utilization CPU % Buffer pool vs max memory File latency Active vs max workers Space used
Saturation SOS_SCHEDULER_YIELD, runnable tasks RESOURCE_SEMAPHORE waits, grant queuing PAGEIOLATCH waits Thread exhaustion PAGELATCH allocation waits
Errors Non-yielding schedulers OOM, memory broker shrinks 15-second I/O warnings Timeout errors Out-of-space

RED Method (Tom Wilkie) — for the database as a service

Signal SQL Server Metric
Rate Batch Requests/sec (perfmon)
Errors Failed queries, deadlocks, timeouts, connection errors
Duration Query execution time distribution

Four Golden Signals (Google SRE) — the combined view

Signal SQL Server Metric
Latency Query duration (distinguish successful vs error latency)
Traffic Batch Requests/sec, Transactions/sec
Errors Deadlocks, timeouts, failed queries
Saturation Wait stats (the entire wait statistics framework is fundamentally a saturation measurement system)

Where This Applies

Analysis Engine Output (both Dashboard and Lite)

The analyze_server inference engine produces findings with severity and evidence. These findings could be tagged or grouped by framework category:

  • "CPU Saturation (USE): SOS_SCHEDULER_YIELD elevated, runnable task queue depth > 10"
  • "Latency (Golden Signals): p95 query duration increased 3x vs baseline"

Progressive Summary View (issue #689)

The server landing summary could organize its ranked problems using USE categories for resource issues and RED/Golden Signals for workload issues. This gives users a mental model, not just a list.

MCP Tool Output

The analyze_server and get_analysis_findings MCP tools could include framework tags in their output, making AI-assisted investigation framework-aware.

Design Notes

  • This is primarily a framing/presentation change, not a new data collection effort
  • All underlying data already exists in current collectors
  • The frameworks provide a checklist for completeness: if the analysis engine never reports on "Memory Errors," that's a gap worth noticing
  • The vocabulary is widely taught in SRE and DevOps contexts — using it makes PerformanceMonitor's output portable to incident reports, postmortems, and runbooks
  • Applies to both Dashboard and Lite, plus MCP tool output

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions