Skip to content

feat: explicit Brier score variants for 2×2 evaluation matrix#18

Merged
Polichinel merged 2 commits intodevelopmentfrom
feature/brier_variants
Apr 9, 2026
Merged

feat: explicit Brier score variants for 2×2 evaluation matrix#18
Polichinel merged 2 commits intodevelopmentfrom
feature/brier_variants

Conversation

@Polichinel
Copy link
Copy Markdown
Collaborator

Summary

  • Replaces Brier_sample/Brier_point with three task-explicit variants: Brier_cls_point, Brier_cls_sample, Brier_rgs_sample
  • Fixes broken Brier_cls_sample for MC Dropout probability samples — uses mean(y_pred) instead of mean(y_pred > threshold) which destroyed discrimination
  • Sets classification Brier profile default to threshold=0.0 (Pre-Release Note 05: event y > 0)
  • Adds Brier_rgs_sample to regression sample membership for count-data binarisation use case
  • Updates CIC MetricCatalog.md with new naming convention, test counts, and breaking rename notice

Test plan

  • 249 tests passing (net +4 new Brier golden-value and edge-case tests)
  • Registry snapshot integrity updated (24 → 25 metrics)
  • NativeEvaluator integration tests updated for new metric names
  • CIC documentation aligned with code changes

🤖 Generated with Claude Code

Replace Brier_sample/Brier_point with three task-explicit variants:
- Brier_cls_point: classification point (y_pred is probability)
- Brier_cls_sample: classification sample (average MC Dropout probabilities)
- Brier_rgs_sample: regression sample (binarise count samples at threshold)

Brier_rgs_point intentionally omitted — regression point estimates are not
probabilities. The _cls_/_rgs_ infix makes the task context self-documenting.

Critical fix: Brier_cls_sample uses mean(y_pred) instead of mean(y_pred > threshold),
which was broken for probability samples (all probabilities > 0 → p_hat ≈ 1.0).

Profile defaults: classification Brier threshold=0.0 (PDF: event y > 0),
regression Brier threshold=1.0 (binarise at 1 fatality). Catalog size 24 → 25.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Brier score handling in views_evaluation to make the metric names explicit across the (task × prediction-type) matrix, and corrects classification-sample Brier computation for probability-sample inputs (e.g., MC Dropout).

Changes:

  • Replaces Brier_sample / Brier_point with Brier_cls_point, Brier_cls_sample, and Brier_rgs_sample across the metric catalog, memberships, profiles, and result dataclasses.
  • Fixes Brier_cls_sample to compute event probability as mean(y_pred) (probability samples) rather than mean(y_pred > threshold).
  • Updates tests and CIC documentation to reflect the renamed/added metrics and updated registry size.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
views_evaluation/profiles/hydranet_ucdp.py Updates profile docstring to reflect inheritance behavior (incl. Brier params).
views_evaluation/profiles/base.py Updates base-profile genome defaults for the new Brier metric variants.
views_evaluation/evaluation/native_metric_calculators.py Introduces the three explicit native Brier calculators and fixes cls-sample computation.
views_evaluation/evaluation/metrics.py Renames/adds dataclass fields to match new metric names.
views_evaluation/evaluation/metric_catalog.py Registers new metric names/functions and updates task/pred-type memberships.
tests/test_native_evaluator.py Updates evaluator integration tests for renamed Brier metrics and probability-sample semantics.
tests/test_metric_catalog.py Updates registry snapshot size assertion (24 → 25).
tests/test_metric_calculators.py Updates imports and expands/renames Brier golden/edge-case tests for new variants.
documentation/CICs/MetricCatalog.md Updates contract doc with rename note and updated Brier/QS test counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth.
- Profile structure is stable; new profiles are added by creating a new file in `profiles/`.
- Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering.
- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contract doc explains the Brier metric rename, but other documentation in the repo still references the legacy names (e.g. documentation/standards/physical_architecture_standard.md lists Brier_point/Brier_sample). Please update those references (or add an explicit note/link here) so the documentation set stays internally consistent after the breaking rename.

Suggested change
- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Catalog size: 24 → 25.
- Explicit Brier variants added (2026-04-09): `Brier_sample`/`Brier_point` replaced by three task-explicit canonical variants: `Brier_cls_point`, `Brier_cls_sample`, `Brier_rgs_sample`. The `_cls_`/`_rgs_` infix denotes the task context (classification vs. regression). `Brier_rgs_point` is intentionally omitted — a regression point estimate is not a probability. `Brier_cls_sample` averages probability samples (`mean(y_pred)`); `Brier_rgs_sample` binarises count samples (`mean(y_pred > threshold)`). Legacy-name mapping: `Brier_point` => `Brier_cls_point`; `Brier_sample` => `Brier_cls_sample` for classification-sample contexts and `Brier_rgs_sample` for regression-sample contexts. Note: older repository documents may still mention `Brier_point`/`Brier_sample`; interpret those references using this mapping until all docs are updated. Catalog size: 24 → 25.

Copilot uses AI. Check for mistakes.
All three Brier variants now default to threshold=0.0 in the base profile,
matching the Pre-Release Note 05 definition: Brier evaluates the binary
event "any fatality occurred" (y > 0). On integer-valued UCDP data,
y > 0 and y >= 1 are equivalent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Polichinel Polichinel merged commit 2534075 into development Apr 9, 2026
4 checks passed
@Polichinel Polichinel deleted the feature/brier_variants branch April 9, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants