Problem
RegressionTracker.get_test_stats() returns aggregate statistics (avg, min, max, pass_rate) for a test over N days. But it cannot answer: "Is this test's score consistently getting worse over the last 10 runs?"
A test can have:
- avg=0.72, min=0.65, max=0.80 (looks stable)
- But scores ordered chronologically: 0.80, 0.77, 0.74, 0.71, 0.68 (monotonic decline)
The min/max/avg misses the direction entirely. A monotonic regression showing up as "avg=0.72" is the silent killer in long-running agent deployments.
Proposed Solution
Add ScoreTrend to the stats output using OLS linear regression (stdlib only — no new deps):
from statistics import linear_regression
from dataclasses import dataclass
from typing import Optional
@dataclass
class ScoreTrend:
slope: float # score change per run (negative = degrading)
direction: str # "improving" | "worsening" | "stable"
significant: bool # |slope| > threshold (e.g. 0.01/run)
run_count: int # number of data points used
def _compute_trend(scores_ordered: list[float], threshold: float = 0.01) -> ScoreTrend:
"""Compute OLS trend over chronologically ordered scores."""
if len(scores_ordered) < 3:
return ScoreTrend(slope=0.0, direction="stable", significant=False, run_count=len(scores_ordered))
xs = list(range(len(scores_ordered)))
slope, _ = linear_regression(xs, scores_ordered)
direction = "improving" if slope > threshold else ("worsening" if slope < -threshold else "stable")
return ScoreTrend(slope=round(slope, 4), direction=direction, significant=abs(slope) > threshold, run_count=len(scores_ordered))
Updated get_test_stats() return:
return {
...existing fields...,
"score": {
"current": scores[0],
"avg": ...,
"min": ...,
"max": ...,
"trend": {
"slope": trend.slope,
"direction": trend.direction, # "worsening" | "improving" | "stable"
"significant": trend.significant,
"run_count": trend.run_count,
}
}
}
Note: history in get_test_stats() is returned newest-first from db.get_test_history(). Scores should be reversed before regression (oldest→newest) for correct slope sign.
CI Integration
This enables a new failure mode in CI:
if [ "$(evalview check --json | jq '.trend.direction')" = "worsening" ]; then
echo "⚠️ Score declining — investigate before merge"
exit 1
fi
Scope
- Modify
get_test_stats() in evalview/tracking/regression.py
- Add
ScoreTrend dataclass (5 lines)
- Add
_compute_trend() helper (10 lines)
- Zero new dependencies (
statistics.linear_regression is Python 3.10+ stdlib)
- Non-breaking: new
trend key added to existing score dict
- 2-3 new tests in
tests/tracking/
Reference
This gap is documented across multiple agent evaluation/observability tools in PDR in Production (DOI: 10.5281/zenodo.19362461) — the pattern appears in every tool that captures per-run snapshots without longitudinal slope analysis.
Problem
RegressionTracker.get_test_stats()returns aggregate statistics (avg, min, max, pass_rate) for a test over N days. But it cannot answer: "Is this test's score consistently getting worse over the last 10 runs?"A test can have:
The min/max/avg misses the direction entirely. A monotonic regression showing up as "avg=0.72" is the silent killer in long-running agent deployments.
Proposed Solution
Add
ScoreTrendto the stats output using OLS linear regression (stdlib only — no new deps):Updated
get_test_stats()return:Note:
historyinget_test_stats()is returned newest-first fromdb.get_test_history(). Scores should be reversed before regression (oldest→newest) for correct slope sign.CI Integration
This enables a new failure mode in CI:
Scope
get_test_stats()inevalview/tracking/regression.pyScoreTrenddataclass (5 lines)_compute_trend()helper (10 lines)statistics.linear_regressionis Python 3.10+ stdlib)trendkey added to existingscoredicttests/tracking/Reference
This gap is documented across multiple agent evaluation/observability tools in PDR in Production (DOI: 10.5281/zenodo.19362461) — the pattern appears in every tool that captures per-run snapshots without longitudinal slope analysis.