feat(tracking): add score trend slope to get_test_stats() for monotonic drift detection

## Problem

`RegressionTracker.get_test_stats()` returns aggregate statistics (avg, min, max, pass_rate) for a test over N days. But it cannot answer: **"Is this test's score consistently getting worse over the last 10 runs?"**

A test can have:
- avg=0.72, min=0.65, max=0.80 (looks stable)
- But scores ordered chronologically: 0.80, 0.77, 0.74, 0.71, 0.68 (monotonic decline)

The min/max/avg misses the direction entirely. A monotonic regression showing up as "avg=0.72" is the silent killer in long-running agent deployments.

## Proposed Solution

Add `ScoreTrend` to the stats output using OLS linear regression (stdlib only — no new deps):

```python
from statistics import linear_regression
from dataclasses import dataclass
from typing import Optional

@dataclass
class ScoreTrend:
    slope: float           # score change per run (negative = degrading)
    direction: str         # "improving" | "worsening" | "stable"
    significant: bool      # |slope| > threshold (e.g. 0.01/run)
    run_count: int         # number of data points used

def _compute_trend(scores_ordered: list[float], threshold: float = 0.01) -> ScoreTrend:
    """Compute OLS trend over chronologically ordered scores."""
    if len(scores_ordered) < 3:
        return ScoreTrend(slope=0.0, direction="stable", significant=False, run_count=len(scores_ordered))
    xs = list(range(len(scores_ordered)))
    slope, _ = linear_regression(xs, scores_ordered)
    direction = "improving" if slope > threshold else ("worsening" if slope < -threshold else "stable")
    return ScoreTrend(slope=round(slope, 4), direction=direction, significant=abs(slope) > threshold, run_count=len(scores_ordered))
```

Updated `get_test_stats()` return:

```python
return {
    ...existing fields...,
    "score": {
        "current": scores[0],
        "avg": ...,
        "min": ...,
        "max": ...,
        "trend": {
            "slope": trend.slope,
            "direction": trend.direction,   # "worsening" | "improving" | "stable"
            "significant": trend.significant,
            "run_count": trend.run_count,
        }
    }
}
```

Note: `history` in `get_test_stats()` is returned newest-first from `db.get_test_history()`. Scores should be reversed before regression (oldest→newest) for correct slope sign.

## CI Integration

This enables a new failure mode in CI:

```bash
if [ "$(evalview check --json | jq '.trend.direction')" = "worsening" ]; then
  echo "⚠️ Score declining — investigate before merge"
  exit 1
fi
```

## Scope

- Modify `get_test_stats()` in `evalview/tracking/regression.py`
- Add `ScoreTrend` dataclass (5 lines)
- Add `_compute_trend()` helper (10 lines)
- Zero new dependencies (`statistics.linear_regression` is Python 3.10+ stdlib)
- Non-breaking: new `trend` key added to existing `score` dict
- 2-3 new tests in `tests/tracking/`

## Reference

This gap is documented across multiple agent evaluation/observability tools in [PDR in Production (DOI: 10.5281/zenodo.19362461)](https://zenodo.org/records/19362461) — the pattern appears in every tool that captures per-run snapshots without longitudinal slope analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tracking): add score trend slope to get_test_stats() for monotonic drift detection #148

Problem

Proposed Solution

CI Integration

Scope

Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(tracking): add score trend slope to get_test_stats() for monotonic drift detection #148

Description

Problem

Proposed Solution

CI Integration

Scope

Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions