You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PerformanceMonitor's inference engine uses rule-based analysis: collect facts, score them with amplifiers, traverse relationship graphs to produce findings. This works well for known patterns but can miss novel problems. Statistical anomaly detection complements rule-based analysis by answering a different question: rules say "this pattern means X," anomaly detection says "this metric is behaving unusually" — even when no rule exists for that specific situation.
The industry consensus is that rule-based analysis + statistical anomaly detection outperforms either approach alone.
How It Would Work
Anomaly detection runs as an additional scoring input to the existing analysis engine, not as a replacement:
Collect metric snapshots (already happening)
Compare current values against historical baselines (ties into dynamic baselines — issue tracking that separately)
Score the degree of deviation (z-score, percentile rank, or similar)
Feed anomaly scores into the inference engine as facts alongside the existing rule-based facts
Amplify rule-based findings when anomaly detection confirms the deviation is statistically unusual
Example Flow
Rule detects: "CXPACKET waits are > 30% of total waits" (known pattern, scores Medium)
Anomaly detection adds: "CXPACKET waits are 4.2 standard deviations above the baseline for this time window" (statistically unusual)
Combined: The finding's severity is amplified because both the rule and the anomaly detector agree this is significant
vs: "CXPACKET waits are > 30% of total waits" but anomaly score is low (this is normal for this workload) → severity stays Medium or is dampened
What to Detect Anomalies On
Metric-level anomalies (is this value unusual?)
CPU utilization
Wait stat proportions and absolute values
Batch requests/sec (sudden drops or spikes)
Query duration aggregates
Memory utilization
I/O latency
Session counts
Blocking/deadlock event counts
Pattern-level anomalies (is this combination unusual?)
CPU high + batch requests low = something is stuck (not just busy)
Wait profile shift: top wait type changed from SOS_SCHEDULER_YIELD to LCK_M_X
Query mix change: execution count distribution across databases shifted
Statistical Approaches (start simple)
Phase 1: Z-score against rolling baseline
For each metric, maintain a rolling mean and standard deviation (last 30 days, same hour-of-day, same day-of-week)
Current z-score = (current_value - mean) / std_dev
Flag anything beyond ±3σ as anomalous
This is simple, interpretable, and effective for most time-series metrics
Phase 2 (future): More sophisticated methods
Seasonal decomposition (STL) for metrics with strong weekly patterns
Isolation Forest for detecting multivariate anomalies (unusual combinations of metrics)
CUSUM or exponentially weighted moving average for detecting gradual drifts
Integration Points
Analysis Engine (both Dashboard and Lite)
Add an `AnomalyScorer` that runs alongside existing fact collectors
Produce `AnomalyFact` objects with the metric name, current value, baseline, z-score, and severity
Existing amplifier system can use anomaly scores to boost or dampen finding severity
MCP Tools
`analyze_server` output could include anomaly context: "CPU utilization 92% (4.1σ above baseline for Tuesday 2pm)"
`get_analysis_facts` could expose anomaly scores alongside rule-based scores
New tool or parameter: `get_anomalies` to list all currently anomalous metrics
Trend charts could mark anomalous data points (dot color change or marker)
Pairs with baseline bands (from dynamic baselines issue) — anomalies are the points outside the band
Design Notes
This is an enhancement to the existing engine architecture, not a new system
Phase 1 (z-score) requires only basic statistics — no ML libraries needed
The key value is combining anomaly scores with rule-based findings, not replacing rules
False positive management: anomaly detection will flag things that are unusual but not problematic. The rule engine provides the "is this actually bad?" judgment. Anomalies without matching rules should be surfaced at lower severity (informational)
Requires sufficient historical data (2-4 weeks minimum) — same constraint as dynamic baselines
Applies to both Dashboard and Lite, plus MCP analysis tools
Summary
PerformanceMonitor's inference engine uses rule-based analysis: collect facts, score them with amplifiers, traverse relationship graphs to produce findings. This works well for known patterns but can miss novel problems. Statistical anomaly detection complements rule-based analysis by answering a different question: rules say "this pattern means X," anomaly detection says "this metric is behaving unusually" — even when no rule exists for that specific situation.
The industry consensus is that rule-based analysis + statistical anomaly detection outperforms either approach alone.
How It Would Work
Anomaly detection runs as an additional scoring input to the existing analysis engine, not as a replacement:
Example Flow
What to Detect Anomalies On
Metric-level anomalies (is this value unusual?)
Pattern-level anomalies (is this combination unusual?)
Statistical Approaches (start simple)
Phase 1: Z-score against rolling baseline
Phase 2 (future): More sophisticated methods
Integration Points
Analysis Engine (both Dashboard and Lite)
MCP Tools
UI (both Dashboard and Lite)
Design Notes
References