feat: explicit Brier score variants for 2×2 evaluation matrix by Polichinel · Pull Request #18 · views-platform/views-evaluation

Polichinel · 2026-04-09T06:34:45Z

Summary

Replaces Brier_sample/Brier_point with three task-explicit variants: Brier_cls_point, Brier_cls_sample, Brier_rgs_sample
Fixes broken Brier_cls_sample for MC Dropout probability samples — uses mean(y_pred) instead of mean(y_pred > threshold) which destroyed discrimination
Sets classification Brier profile default to threshold=0.0 (Pre-Release Note 05: event y > 0)
Adds Brier_rgs_sample to regression sample membership for count-data binarisation use case
Updates CIC MetricCatalog.md with new naming convention, test counts, and breaking rename notice

Test plan

249 tests passing (net +4 new Brier golden-value and edge-case tests)
Registry snapshot integrity updated (24 → 25 metrics)
NativeEvaluator integration tests updated for new metric names
CIC documentation aligned with code changes

🤖 Generated with Claude Code

Replace Brier_sample/Brier_point with three task-explicit variants: - Brier_cls_point: classification point (y_pred is probability) - Brier_cls_sample: classification sample (average MC Dropout probabilities) - Brier_rgs_sample: regression sample (binarise count samples at threshold) Brier_rgs_point intentionally omitted — regression point estimates are not probabilities. The _cls_/_rgs_ infix makes the task context self-documenting. Critical fix: Brier_cls_sample uses mean(y_pred) instead of mean(y_pred > threshold), which was broken for probability samples (all probabilities > 0 → p_hat ≈ 1.0). Profile defaults: classification Brier threshold=0.0 (PDF: event y > 0), regression Brier threshold=1.0 (binarise at 1 fatality). Catalog size 24 → 25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR refactors Brier score handling in views_evaluation to make the metric names explicit across the (task × prediction-type) matrix, and corrects classification-sample Brier computation for probability-sample inputs (e.g., MC Dropout).

Changes:

Replaces Brier_sample / Brier_point with Brier_cls_point, Brier_cls_sample, and Brier_rgs_sample across the metric catalog, memberships, profiles, and result dataclasses.
Fixes Brier_cls_sample to compute event probability as mean(y_pred) (probability samples) rather than mean(y_pred > threshold).
Updates tests and CIC documentation to reflect the renamed/added metrics and updated registry size.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
views_evaluation/profiles/hydranet_ucdp.py	Updates profile docstring to reflect inheritance behavior (incl. Brier params).
views_evaluation/profiles/base.py	Updates base-profile genome defaults for the new Brier metric variants.
views_evaluation/evaluation/native_metric_calculators.py	Introduces the three explicit native Brier calculators and fixes cls-sample computation.
views_evaluation/evaluation/metrics.py	Renames/adds dataclass fields to match new metric names.
views_evaluation/evaluation/metric_catalog.py	Registers new metric names/functions and updates task/pred-type memberships.
tests/test_native_evaluator.py	Updates evaluator integration tests for renamed Brier metrics and probability-sample semantics.
tests/test_metric_catalog.py	Updates registry snapshot size assertion (24 → 25).
tests/test_metric_calculators.py	Updates imports and expands/renames Brier golden/edge-case tests for new variants.
documentation/CICs/MetricCatalog.md	Updates contract doc with rename note and updated Brier/QS test counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T06:42:35Z

documentation/CICs/MetricCatalog.md

 - The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth.
 - Profile structure is stable; new profiles are added by creating a new file in `profiles/`.
 - Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering.
+- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25.


This contract doc explains the Brier metric rename, but other documentation in the repo still references the legacy names (e.g. documentation/standards/physical_architecture_standard.md lists Brier_point/Brier_sample). Please update those references (or add an explicit note/link here) so the documentation set stays internally consistent after the breaking rename.

Suggested change

- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25.

- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit canonical variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Legacy-name mapping: `Brier_point` => `Brier_cls_point`; `Brier_sample` => `Brier_cls_sample` for classification-sample contexts and `Brier_rgs_sample` for regression-sample contexts. Note: older repository documents may still mention `Brier_point`/`Brier_sample`; interpret those references using this mapping until all docs are updated. Catalog size: 24 → 25.

All three Brier variants now default to threshold=0.0 in the base profile, matching the Pre-Release Note 05 definition: Brier evaluates the binary event "any fatality occurred" (y > 0). On integer-valued UCDP data, y > 0 and y >= 1 are equivalent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Polichinel requested a review from Copilot April 9, 2026 06:37

Copilot started reviewing on behalf of Polichinel April 9, 2026 06:38 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Polichinel merged commit 2534075 into development Apr 9, 2026
4 checks passed

Polichinel deleted the feature/brier_variants branch April 9, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: explicit Brier score variants for 2×2 evaluation matrix#18

feat: explicit Brier score variants for 2×2 evaluation matrix#18
Polichinel merged 2 commits intodevelopmentfrom
feature/brier_variants

Polichinel commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Polichinel commented Apr 9, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants