Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions documentation/CICs/MetricCatalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE)
- **Red:** `tests/test_metric_catalog.py` — unknown metrics, unimplemented metrics, missing params, None values, unknown overrides.
- **Red (bounds):** `tests/test_metric_catalog.py::TestResolveMetricParamsBoundsRed` — 7 tests for out-of-range alpha/quantile and crossed QIS quantiles.
- **Correctness:** `tests/test_metric_calculators.py::TestGoldenValues` — 17 golden-value tests for all implemented metrics.
- **Correctness (Brier/QS):** `tests/test_metric_calculators.py::TestBrierScore` + `TestQuantileScore` — 10 golden-value tests for the 3 Brier variants and 2 QS variants.

---

Expand All @@ -122,14 +123,15 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE)
- The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth.
- Profile structure is stable; new profiles are added by creating a new file in `profiles/`.
- Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering.
- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contract doc explains the Brier metric rename, but other documentation in the repo still references the legacy names (e.g. documentation/standards/physical_architecture_standard.md lists Brier_point/Brier_sample). Please update those references (or add an explicit note/link here) so the documentation set stays internally consistent after the breaking rename.

Suggested change
- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25.
- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit canonical variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Legacy-name mapping: `Brier_point` => `Brier_cls_point`; `Brier_sample` => `Brier_cls_sample` for classification-sample contexts and `Brier_rgs_sample` for regression-sample contexts. Note: older repository documents may still mention `Brier_point`/`Brier_sample`; interpret those references using this mapping until all docs are updated. Catalog size: 24 → 25.

Copilot uses AI. Check for mistakes.

---

## 12. Known Deviations

- **No profile completeness validation:** There is no mechanism to verify that a profile provides values for all metrics with non-empty genomes. A profile missing a metric's params will only fail at evaluation time, not at profile registration.
- **Golden-value coverage complete:** 17 tests in `tests/test_metric_calculators.py::TestGoldenValues` plus 8 Brier/QS golden-value tests cover all implemented metrics (C-07 closed 2026-04-02).
- **Breaking rename:** The legacy `Brier` metric (unimplemented placeholder) was replaced by `Brier_sample` and `Brier_point` (implemented). The field in `ClassificationSampleEvaluationMetrics` was renamed from `Brier` to `Brier_sample`. External consumers accessing `.Brier` on classification sample results must update to `.Brier_sample`.
- **Golden-value coverage complete:** 17 tests in `tests/test_metric_calculators.py::TestGoldenValues` plus 10 Brier/QS golden-value tests cover all implemented metrics (C-07 closed 2026-04-02).
- **Breaking rename (2026-04-09):** `Brier_sample` and `Brier_point` were replaced by `Brier_cls_point`, `Brier_cls_sample`, and `Brier_rgs_sample`. Dataclass fields in `ClassificationPointEvaluationMetrics`, `ClassificationSampleEvaluationMetrics`, and `RegressionSampleEvaluationMetrics` were renamed/added accordingly. External consumers accessing `.Brier_sample` or `.Brier_point` must update.

---

Expand Down
106 changes: 70 additions & 36 deletions tests/test_metric_calculators.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@
calculate_mean_interval_score_native,
calculate_mtd_native,
calculate_mcr_native,
calculate_brier_sample_native,
calculate_brier_point_native,
calculate_brier_cls_point_native,
calculate_brier_cls_sample_native,
calculate_brier_rgs_sample_native,
calculate_qs_sample_native,
calculate_qs_point_native,
)
Expand Down Expand Up @@ -145,14 +146,14 @@ def test_metric_membership_classification_point():
"""METRIC_MEMBERSHIP contains expected classification point metrics."""
members = METRIC_MEMBERSHIP[("classification", "point")]
assert "AP" in members
assert "Brier_point" in members
assert "Brier_cls_point" in members
assert "RMSLE" not in members


def test_metric_membership_classification_sample():
"""METRIC_MEMBERSHIP contains expected classification sample metrics."""
members = METRIC_MEMBERSHIP[("classification", "sample")]
for m in ["CRPS", "twCRPS", "Brier_sample", "Jeffreys"]:
for m in ["CRPS", "twCRPS", "Brier_cls_sample", "Jeffreys"]:
assert m in members
assert "RMSLE" not in members

Expand Down Expand Up @@ -579,38 +580,55 @@ def test_qis_symmetric_equals_mis(self):

class TestBrierScore:

def test_brier_sample_golden_value(self):
"""Hand-computed Brier sample: threshold=1, mixed binary outcomes."""
def test_brier_rgs_sample_golden_value(self):
"""Hand-computed Brier rgs_sample: threshold=1, mixed binary outcomes."""
y_true = np.array([0.0, 2.0, 5.0])
y_pred = np.array([[0.5, 1.5], [0.5, 1.5], [4.0, 6.0]])
# y_binary = [0, 1, 1] (0 < 1, 2 > 1, 5 > 1)
# p_hat = [0.5, 0.5, 1.0] (fraction of ensemble > threshold)
# Brier = mean([(0.5-0)^2, (0.5-1)^2, (1.0-1)^2]) = mean([0.25, 0.25, 0]) = 1/6
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert result == pytest.approx(1.0 / 6.0, abs=1e-10)

def test_brier_point_golden_value(self):
"""Hand-computed Brier point: threshold=1, probabilities vs binary outcomes."""
def test_brier_cls_point_golden_value(self):
"""Hand-computed Brier cls_point: threshold=1, probabilities vs binary outcomes."""
y_true = np.array([0.0, 2.0, 5.0])
y_pred = np.array([[0.1], [0.7], [0.9]])
# y_binary = [0, 1, 1]
# p_hat = [0.1, 0.7, 0.9] (point prediction as probability)
# Brier = mean([(0.1-0)^2, (0.7-1)^2, (0.9-1)^2]) = mean([0.01, 0.09, 0.01]) = 11/300
result = calculate_brier_point_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_cls_point_native(y_true, y_pred, threshold=1.0)
assert result == pytest.approx(11.0 / 300.0, abs=1e-10)

def test_brier_sample_perfect(self):
def test_brier_cls_sample_golden_value(self):
"""Hand-computed Brier cls_sample: average probability samples, threshold=0."""
y_true = np.array([1.0, 0.0])
y_pred = np.array([[0.9, 0.8], [0.1, 0.2]])
# y_binary = [1, 0] (1 > 0, 0 not > 0)
# p_hat = [mean(0.9, 0.8), mean(0.1, 0.2)] = [0.85, 0.15]
# Brier = mean([(0.85-1)^2, (0.15-0)^2]) = mean([0.0225, 0.0225]) = 0.0225
result = calculate_brier_cls_sample_native(y_true, y_pred, threshold=0.0)
assert result == pytest.approx(0.0225, abs=1e-10)

def test_brier_rgs_sample_perfect(self):
"""All above threshold, all ensemble members above → p_hat=1, y_binary=1, Brier=0."""
y_true = np.array([5.0, 10.0])
y_pred = np.array([[2.0, 3.0], [2.0, 3.0]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert result == pytest.approx(0.0, abs=1e-10)

def test_brier_point_perfect(self):
def test_brier_cls_point_perfect(self):
"""p_hat matches y_binary exactly → Brier=0."""
y_true = np.array([0.0, 2.0]) # binary=[0, 1] at threshold=1
y_pred = np.array([[0.0], [1.0]]) # perfect probability predictions
result = calculate_brier_point_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_cls_point_native(y_true, y_pred, threshold=1.0)
assert result == pytest.approx(0.0, abs=1e-10)

def test_brier_cls_sample_perfect(self):
"""Perfect probability samples → Brier=0."""
y_true = np.array([1.0, 0.0])
y_pred = np.array([[1.0, 1.0], [0.0, 0.0]])
result = calculate_brier_cls_sample_native(y_true, y_pred, threshold=0.0)
assert result == pytest.approx(0.0, abs=1e-10)


Expand Down Expand Up @@ -757,40 +775,49 @@ def test_large_alpha(self):

class TestBrierScoreBeige:

def test_single_observation(self):
"""Brier handles N=1, S=1 without error."""
result = calculate_brier_sample_native(np.array([2.0]), np.array([[3.0]]), threshold=1.0)
def test_rgs_single_observation(self):
"""Brier rgs_sample handles N=1, S=1 without error."""
result = calculate_brier_rgs_sample_native(np.array([2.0]), np.array([[3.0]]), threshold=1.0)
assert np.isfinite(result)

def test_large_ensemble_stable(self):
"""Brier is stable with S=1000 samples."""
def test_rgs_large_ensemble_stable(self):
"""Brier rgs_sample is stable with S=1000 samples."""
rng = np.random.default_rng(42)
y_true = np.array([0.0, 5.0, 10.0])
y_pred = rng.normal(loc=y_true[:, None], scale=2.0, size=(3, 1000))
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert np.isfinite(result)
assert 0 <= result <= 1 # Brier is bounded [0, 1]

def test_threshold_at_exact_data_value(self):
def test_rgs_threshold_at_exact_data_value(self):
"""Threshold equals an observation — no crash."""
y_true = np.array([5.0, 5.0])
y_pred = np.array([[4.0, 6.0], [4.0, 6.0]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=5.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=5.0)
assert np.isfinite(result)

def test_all_above_threshold(self):
def test_rgs_all_above_threshold(self):
"""All y_true above threshold — y_binary all 1, finite result."""
y_true = np.array([10.0, 20.0])
y_pred = np.array([[0.5, 1.5], [0.5, 1.5]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert np.isfinite(result)

def test_all_below_threshold(self):
def test_rgs_all_below_threshold(self):
"""All y_true below threshold — y_binary all 0, finite result."""
y_true = np.array([0.0, 0.5])
y_pred = np.array([[0.5, 1.5], [0.5, 1.5]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert np.isfinite(result)

def test_cls_sample_large_ensemble_stable(self):
"""Brier cls_sample is stable with S=1000 probability samples."""
rng = np.random.default_rng(42)
y_true = np.array([1.0, 0.0, 1.0])
y_pred = rng.beta(a=2, b=2, size=(3, 1000)) # probabilities in [0, 1]
result = calculate_brier_cls_sample_native(y_true, y_pred, threshold=0.0)
assert np.isfinite(result)
assert 0 <= result <= 1


class TestQuantileScoreBeige:
Expand Down Expand Up @@ -974,7 +1001,7 @@ def test_negative_y_true_valid(self):

class TestBrierScoreRed:

def test_nan_in_y_true_swallowed_by_comparison(self):
def test_rgs_nan_in_y_true_swallowed_by_comparison(self):
"""NaN in y_true is swallowed by '>' comparison (NaN > x → False).

Unlike arithmetic metrics, Brier's binarization step converts NaN to
Expand All @@ -984,24 +1011,31 @@ def test_nan_in_y_true_swallowed_by_comparison(self):
"""
y_true = np.array([np.nan, 1.0])
y_pred = np.array([[1.0], [1.0]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
# NaN is treated as below-threshold (False), so result is finite, not NaN
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert np.isfinite(result)

def test_nan_in_y_pred_swallowed_by_comparison(self):
def test_rgs_nan_in_y_pred_swallowed_by_comparison(self):
"""NaN in y_pred is swallowed by '>' comparison in p_hat computation."""
y_true = np.array([1.0, 1.0])
y_pred = np.array([[np.nan], [1.0]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=1.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1.0)
assert np.isfinite(result)

def test_negative_threshold_accepted(self):
"""Negative threshold is mathematically valid."""
def test_rgs_negative_threshold_accepted(self):
"""Negative threshold is mathematically valid for regression Brier."""
y_true = np.array([1.0, 2.0])
y_pred = np.array([[1.0, 2.0], [2.0, 3.0]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=-5.0)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=-5.0)
assert np.isfinite(result)

def test_cls_sample_nan_in_y_pred_propagates(self):
"""NaN in probability samples propagates via mean() — not swallowed."""
y_true = np.array([1.0, 1.0])
y_pred = np.array([[np.nan], [0.5]])
result = calculate_brier_cls_sample_native(y_true, y_pred, threshold=0.0)
# mean([nan]) = nan, (nan - 1)^2 = nan → result is nan
assert np.isnan(result)


class TestQuantileScoreRed:

Expand Down Expand Up @@ -1040,11 +1074,11 @@ def test_crps_large_ensemble_values(self):
assert np.isfinite(result)
assert result >= 0

def test_brier_extreme_threshold(self):
def test_brier_rgs_extreme_threshold(self):
"""Threshold at 1e300: all values below → y_binary all 0, p_hat all 0, Brier = 0."""
y_true = np.array([1.0, 2.0])
y_pred = np.array([[0.5, 1.5], [0.5, 1.5]])
result = calculate_brier_sample_native(y_true, y_pred, threshold=1e300)
result = calculate_brier_rgs_sample_native(y_true, y_pred, threshold=1e300)
assert result == pytest.approx(0.0, abs=1e-10)

def test_coverage_tiny_ensemble_spread(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/test_metric_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ def test_evaluator_rejects_unknown_profile(self):

def test_registry_snapshot_integrity(self):
"""Registries have expected sizes — catches accidental mutation or deletion."""
assert len(METRIC_CATALOG) == 24
assert len(METRIC_CATALOG) == 25
assert len(METRIC_MEMBERSHIP) == 4
assert len(PROFILES) >= 2

Expand Down
14 changes: 7 additions & 7 deletions tests/test_native_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,11 +289,11 @@ def test_classification_target(self):
assert 'month100' in report.to_dict()['schemas']['month']

def test_classification_sample_brier(self):
"""Brier_sample and CRPS work for classification sample predictions."""
"""Brier_cls_sample and CRPS work for classification sample predictions."""
n = 6
ef = EvaluationFrame(
y_true=np.array([0.0, 1.0, 0.0, 1.0, 0.0, 1.0]),
y_pred=np.random.default_rng(42).uniform(0, 2, size=(n, 20)),
y_pred=np.random.default_rng(42).uniform(0, 1, size=(n, 20)), # probability samples in [0,1]
identifiers={
'time': np.array([100, 100, 101, 101, 102, 102]),
'unit': np.array([1, 2, 1, 2, 1, 2]),
Expand All @@ -305,17 +305,17 @@ def test_classification_sample_brier(self):
config = {
'steps': [1, 2, 3],
'classification_targets': ['by_sb_best'],
'classification_sample_metrics': ['Brier_sample', 'CRPS'],
'classification_sample_metrics': ['Brier_cls_sample', 'CRPS'],
}
report = NativeEvaluator(config).evaluate(ef)
assert report.task == 'classification'
assert report.pred_type == 'sample'
d = report.to_dict()['schemas']
assert 'Brier_sample' in d['month']['month100']
assert 'Brier_cls_sample' in d['month']['month100']
assert 'CRPS' in d['month']['month100']

def test_classification_point_brier(self):
"""AP and Brier_point work together for classification point predictions."""
"""AP and Brier_cls_point work together for classification point predictions."""
n = 6
ef = EvaluationFrame(
y_true=np.array([0.0, 1.0, 0.0, 1.0, 0.0, 1.0]),
Expand All @@ -331,12 +331,12 @@ def test_classification_point_brier(self):
config = {
'steps': [1, 2, 3],
'classification_targets': ['by_sb_best'],
'classification_point_metrics': ['AP', 'Brier_point'],
'classification_point_metrics': ['AP', 'Brier_cls_point'],
}
report = NativeEvaluator(config).evaluate(ef)
d = report.to_dict()['schemas']
assert 'AP' in d['step']['step01']
assert 'Brier_point' in d['step']['step01']
assert 'Brier_cls_point' in d['step']['step01']

def test_evaluate_twice_produces_identical_results(self):
"""NativeEvaluator is stateless — same input yields same output."""
Expand Down
18 changes: 10 additions & 8 deletions views_evaluation/evaluation/metric_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,9 @@
calculate_coverage_native,
calculate_mean_interval_score_native,
calculate_ignorance_score_native,
calculate_brier_sample_native,
calculate_brier_point_native,
calculate_brier_cls_point_native,
calculate_brier_cls_sample_native,
calculate_brier_rgs_sample_native,
calculate_qs_sample_native,
calculate_qs_point_native,
calculate_sd_native,
Expand Down Expand Up @@ -91,9 +92,10 @@ class MetricSpec:
"QS_sample": MetricSpec(function=calculate_qs_sample_native, genome=("quantile",)),
"QS_point": MetricSpec(function=calculate_qs_point_native, genome=("quantile",)),

# ── Brier Score ───────────────────────────────────────────────────────
"Brier_sample": MetricSpec(function=calculate_brier_sample_native, genome=("threshold",)),
"Brier_point": MetricSpec(function=calculate_brier_point_native, genome=("threshold",)),
# ── Brier Score (3 explicit variants for the 2×2 matrix) ────────────
"Brier_cls_point": MetricSpec(function=calculate_brier_cls_point_native, genome=("threshold",)),
"Brier_cls_sample": MetricSpec(function=calculate_brier_cls_sample_native, genome=("threshold",)),
"Brier_rgs_sample": MetricSpec(function=calculate_brier_rgs_sample_native, genome=("threshold",)),

# ── Classification ────────────────────────────────────────────────────
"AP": MetricSpec(function=calculate_ap_native, genome=()),
Expand All @@ -107,9 +109,9 @@ class MetricSpec:
"SD", "pEMDiv", "Variogram"},
("regression", "sample"): {"CRPS", "twCRPS", "MIS", "QIS", "QS_sample",
"Coverage", "Ignorance",
"y_hat_bar", "MCR_sample"},
("classification", "point"): {"AP", "Brier_point"},
("classification", "sample"): {"CRPS", "twCRPS", "Brier_sample", "Jeffreys"},
"y_hat_bar", "MCR_sample", "Brier_rgs_sample"},
("classification", "point"): {"AP", "Brier_cls_point"},
("classification", "sample"): {"CRPS", "twCRPS", "Brier_cls_sample", "Jeffreys"},
}


Expand Down
15 changes: 8 additions & 7 deletions views_evaluation/evaluation/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,20 +131,21 @@ class RegressionSampleEvaluationMetrics(BaseEvaluationMetrics):
Coverage: Optional[float] = None
Ignorance: Optional[float] = None
y_hat_bar: Optional[float] = None
MCR_sample: Optional[float] = None
MCR_sample: Optional[float] = None
Brier_rgs_sample: Optional[float] = None


@dataclass
class ClassificationPointEvaluationMetrics(BaseEvaluationMetrics):
"""Metrics for classification targets evaluated with point (probability) predictions."""
AP: Optional[float] = None
Brier_point: Optional[float] = None
AP: Optional[float] = None
Brier_cls_point: Optional[float] = None


@dataclass
class ClassificationSampleEvaluationMetrics(BaseEvaluationMetrics):
"""Metrics for classification targets evaluated with sample-based predictions."""
CRPS: Optional[float] = None
twCRPS: Optional[float] = None
Brier_sample: Optional[float] = None
Jeffreys: Optional[float] = None
CRPS: Optional[float] = None
twCRPS: Optional[float] = None
Brier_cls_sample: Optional[float] = None
Jeffreys: Optional[float] = None
Loading
Loading