feat: explicit Brier score variants for 2×2 evaluation matrix#18
feat: explicit Brier score variants for 2×2 evaluation matrix#18Polichinel merged 2 commits intodevelopmentfrom
Conversation
Replace Brier_sample/Brier_point with three task-explicit variants: - Brier_cls_point: classification point (y_pred is probability) - Brier_cls_sample: classification sample (average MC Dropout probabilities) - Brier_rgs_sample: regression sample (binarise count samples at threshold) Brier_rgs_point intentionally omitted — regression point estimates are not probabilities. The _cls_/_rgs_ infix makes the task context self-documenting. Critical fix: Brier_cls_sample uses mean(y_pred) instead of mean(y_pred > threshold), which was broken for probability samples (all probabilities > 0 → p_hat ≈ 1.0). Profile defaults: classification Brier threshold=0.0 (PDF: event y > 0), regression Brier threshold=1.0 (binarise at 1 fatality). Catalog size 24 → 25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors Brier score handling in views_evaluation to make the metric names explicit across the (task × prediction-type) matrix, and corrects classification-sample Brier computation for probability-sample inputs (e.g., MC Dropout).
Changes:
- Replaces
Brier_sample/Brier_pointwithBrier_cls_point,Brier_cls_sample, andBrier_rgs_sampleacross the metric catalog, memberships, profiles, and result dataclasses. - Fixes
Brier_cls_sampleto compute event probability asmean(y_pred)(probability samples) rather thanmean(y_pred > threshold). - Updates tests and CIC documentation to reflect the renamed/added metrics and updated registry size.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| views_evaluation/profiles/hydranet_ucdp.py | Updates profile docstring to reflect inheritance behavior (incl. Brier params). |
| views_evaluation/profiles/base.py | Updates base-profile genome defaults for the new Brier metric variants. |
| views_evaluation/evaluation/native_metric_calculators.py | Introduces the three explicit native Brier calculators and fixes cls-sample computation. |
| views_evaluation/evaluation/metrics.py | Renames/adds dataclass fields to match new metric names. |
| views_evaluation/evaluation/metric_catalog.py | Registers new metric names/functions and updates task/pred-type memberships. |
| tests/test_native_evaluator.py | Updates evaluator integration tests for renamed Brier metrics and probability-sample semantics. |
| tests/test_metric_catalog.py | Updates registry snapshot size assertion (24 → 25). |
| tests/test_metric_calculators.py | Updates imports and expands/renames Brier golden/edge-case tests for new variants. |
| documentation/CICs/MetricCatalog.md | Updates contract doc with rename note and updated Brier/QS test counts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth. | ||
| - Profile structure is stable; new profiles are added by creating a new file in `profiles/`. | ||
| - Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering. | ||
| - Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25. |
There was a problem hiding this comment.
This contract doc explains the Brier metric rename, but other documentation in the repo still references the legacy names (e.g. documentation/standards/physical_architecture_standard.md lists Brier_point/Brier_sample). Please update those references (or add an explicit note/link here) so the documentation set stays internally consistent after the breaking rename.
| - Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25. | |
| - Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit canonical variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Legacy-name mapping: `Brier_point` => `Brier_cls_point`; `Brier_sample` => `Brier_cls_sample` for classification-sample contexts and `Brier_rgs_sample` for regression-sample contexts. Note: older repository documents may still mention `Brier_point`/`Brier_sample`; interpret those references using this mapping until all docs are updated. Catalog size: 24 → 25. |
All three Brier variants now default to threshold=0.0 in the base profile, matching the Pre-Release Note 05 definition: Brier evaluates the binary event "any fatality occurred" (y > 0). On integer-valued UCDP data, y > 0 and y >= 1 are equivalent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Brier_sample/Brier_pointwith three task-explicit variants:Brier_cls_point,Brier_cls_sample,Brier_rgs_sampleBrier_cls_samplefor MC Dropout probability samples — usesmean(y_pred)instead ofmean(y_pred > threshold)which destroyed discriminationthreshold=0.0(Pre-Release Note 05: event y > 0)Brier_rgs_sampleto regression sample membership for count-data binarisation use caseTest plan
🤖 Generated with Claude Code