Skip to content

feat(tracking): add score trend slope to get_test_stats() for monotonic drift detection #148

@nanookclaw

Description

@nanookclaw

Problem

RegressionTracker.get_test_stats() returns aggregate statistics (avg, min, max, pass_rate) for a test over N days. But it cannot answer: "Is this test's score consistently getting worse over the last 10 runs?"

A test can have:

  • avg=0.72, min=0.65, max=0.80 (looks stable)
  • But scores ordered chronologically: 0.80, 0.77, 0.74, 0.71, 0.68 (monotonic decline)

The min/max/avg misses the direction entirely. A monotonic regression showing up as "avg=0.72" is the silent killer in long-running agent deployments.

Proposed Solution

Add ScoreTrend to the stats output using OLS linear regression (stdlib only — no new deps):

from statistics import linear_regression
from dataclasses import dataclass
from typing import Optional

@dataclass
class ScoreTrend:
    slope: float           # score change per run (negative = degrading)
    direction: str         # "improving" | "worsening" | "stable"
    significant: bool      # |slope| > threshold (e.g. 0.01/run)
    run_count: int         # number of data points used

def _compute_trend(scores_ordered: list[float], threshold: float = 0.01) -> ScoreTrend:
    """Compute OLS trend over chronologically ordered scores."""
    if len(scores_ordered) < 3:
        return ScoreTrend(slope=0.0, direction="stable", significant=False, run_count=len(scores_ordered))
    xs = list(range(len(scores_ordered)))
    slope, _ = linear_regression(xs, scores_ordered)
    direction = "improving" if slope > threshold else ("worsening" if slope < -threshold else "stable")
    return ScoreTrend(slope=round(slope, 4), direction=direction, significant=abs(slope) > threshold, run_count=len(scores_ordered))

Updated get_test_stats() return:

return {
    ...existing fields...,
    "score": {
        "current": scores[0],
        "avg": ...,
        "min": ...,
        "max": ...,
        "trend": {
            "slope": trend.slope,
            "direction": trend.direction,   # "worsening" | "improving" | "stable"
            "significant": trend.significant,
            "run_count": trend.run_count,
        }
    }
}

Note: history in get_test_stats() is returned newest-first from db.get_test_history(). Scores should be reversed before regression (oldest→newest) for correct slope sign.

CI Integration

This enables a new failure mode in CI:

if [ "$(evalview check --json | jq '.trend.direction')" = "worsening" ]; then
  echo "⚠️ Score declining — investigate before merge"
  exit 1
fi

Scope

  • Modify get_test_stats() in evalview/tracking/regression.py
  • Add ScoreTrend dataclass (5 lines)
  • Add _compute_trend() helper (10 lines)
  • Zero new dependencies (statistics.linear_regression is Python 3.10+ stdlib)
  • Non-breaking: new trend key added to existing score dict
  • 2-3 new tests in tests/tracking/

Reference

This gap is documented across multiple agent evaluation/observability tools in PDR in Production (DOI: 10.5281/zenodo.19362461) — the pattern appears in every tool that captures per-run snapshots without longitudinal slope analysis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions