Skip to content

feat!: threshold metrics, Phase 3 purge, and governance adoption#16

Merged
Polichinel merged 14 commits intodevelopmentfrom
feature/thresholds00
Apr 2, 2026
Merged

feat!: threshold metrics, Phase 3 purge, and governance adoption#16
Polichinel merged 14 commits intodevelopmentfrom
feature/thresholds00

Conversation

@Polichinel
Copy link
Copy Markdown
Collaborator

Summary

  • New metrics: Brier Score (sample/point) and Quantile Score (sample/point) — 4 threshold-dependent metrics with full catalog registration, profile defaults, and 22 dedicated tests
  • Phase 3 executed: Removed EvaluationManager, PandasAdapter, pandas runtime dependency, and all legacy dispatch dicts (-2,984 lines). Pure Math Engine achieved.
  • Governance adoption: base_docs ADR template, standardized headers on all 17 ADRs, CIC sections 9-12 (Known Deviations, Test Alignment), technical risk register (ADR-023), hardened protocol, physical architecture standard
  • Test gaps closed: 25 new tests from test-review audit (golden values, classification evaluation, NaN defense-in-depth, extreme values)
  • Tech debt cleanup: Dead code removal, y_pred shape invariant enforcement, stale reference cleanup

Breaking Changes

  • EvaluationManager and PandasAdapter removed from public API
  • pandas no longer a runtime dependency
  • legacy_compatibility default flipped to False in NativeEvaluator.evaluate()
  • Brier field renamed to Brier_sample in ClassificationSampleEvaluationMetrics

Risk Register

6 concerns closed (C-01, C-03, C-04, C-06, C-08, C-09). 3 remain open:

  • C-02: NativeEvaluator config validation at init (design decision needed)
  • C-05: sklearn/scipy in pure-math core (future work)
  • C-07: Golden-value coverage (partially addressed)

Test plan

  • 228 tests passing (conda run --name views_pipeline pytest tests/ -v)
  • 0 lint errors (conda run --name views_pipeline ruff check .)
  • validate_docs.sh passes
  • All 4 model repos verified for transformation handling (r2darts2, stepshifter, baseline, hydranet)
  • views-pipeline-core Phase 2 confirmed complete (EvaluationAdapter mirrored, parity verified)

🤖 Generated with Claude Code

Polichinel and others added 14 commits March 14, 2026 18:31
…nd guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sk register, hardened protocol

- Standardize all 17 ADR headers to base_docs format (Status/Date/Deciders/Consulted/Informed)
- Remove silicon-based agents from Deciders across all ADRs
- Convert table-format headers (030-042) to YAML-style format
- Replace ADR template with decision-focused base_docs template
- Add sections 9-12 (Incorrect Usage, Test Alignment, Evolution, Known Deviations) to 4 existing CICs
- Create MetricCatalog CIC documenting genome registry and resolver
- Create ADR-023 (Technical Risk Register) with tier/trigger/source format
- Add hardened protocol for numerical evaluation contributors
- Add physical architecture standard with critical bundling assessment
- Add INSTANTIATION_CHECKLIST.md and validate_docs.sh
- Update ADR and CIC READMEs with governance structure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics with full test coverage

Add 4 new threshold-dependent metric functions to the evaluation framework:
- Brier_sample: binary classification metric for ensemble predictions
- Brier_point: binary classification metric for point probability predictions
- QS_sample: quantile score (pinball loss) for ensemble predictions
- QS_point: quantile score (pinball loss) for point predictions

All metrics registered in MetricCatalog with genome declarations, added to
METRIC_MEMBERSHIP and legacy dispatch dicts, with BASE_PROFILE defaults
(threshold=1.0, quantile=0.99).

Test coverage: 22 new tests (8 golden-value, 9 beige, 5 red) including a
finding that Brier's comparison-based binarization swallows NaN rather
than propagating it (documented in red tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ariant

- Remove deprecation_msgs.py (dead code: raise_legacy_scale_msg never called)
- Remove legacy PointEvaluationMetrics and SampleEvaluationMetrics (unused,
  replaced by 2×2 typed dataclasses)
- Add y_pred.ndim != 2 validation to EvaluationFrame._validate() (closes C-03)
- Add tests: test_y_pred_1d_raises, test_y_pred_3d_raises
- Fix lint: remove unused variable in TestQuantileScoreBeige

Risk register: C-03 and C-09 closed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…name note

- F1: Add probability-range note to Brier_point docstring (y_pred should be [0,1])
- F2: Fix Brier_sample docstring: "on regression targets" → "binarized at a threshold"
- F3: Add NaN-swallowing note to both Brier docstrings (NumPy comparison semantics)
- F5: Document Brier → Brier_sample breaking rename in MetricCatalog CIC

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efense-in-depth, extreme values

Close 7 gaps identified by test-review audit:

Step 1 [Critical/Green]: 15 golden-value tests for MSE, MSLE, RMSLE, EMD,
  Pearson, MTD, MCR, Coverage, MIS, CRPS, twCRPS, QIS in TestGoldenValues.
Step 2 [High/Beige]: Classification evaluation tests — Brier_sample, Brier_point,
  AP+Brier_point combined, classification sample with profile resolution.
Step 3 [High/Red]: NaN/Inf defense-in-depth integration tests proving
  EvaluationFrame rejects corrupted data before Brier's NaN-swallowing executes.
Step 4 [Medium/Beige]: Multi-target evaluation test (regression + classification
  in same config, evaluated via separate EvaluationFrames).
Step 5 [Medium/Green]: Stateless execution test — evaluate() twice produces
  identical results.
Step 6 [Medium/Red]: Extreme-value tests near float64 limits for MSE, CRPS,
  Brier, Coverage.
Step 7 [Low/Green]: Migrate 14 module-level tests from DataFrame fixtures to
  raw NumPy arrays. Remove pandas import from test_metric_calculators.py.

Test count: 266 → 291 (+25 new tests). Lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Manager

- Remove filter_step_wise_evaluation() — defined but never called (-30 lines)
- Remove aggregate_month_wise_evaluation() — defined but never called (-83 lines)
- Remove unused BaseEvaluationMetrics import (was only used by aggregate)
- Remove vestigial self.is_sample assignment (set but never read)
- Retain self.actual/self.predictions (still tested by
  test_documentation_contracts.py reflective test)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dependency

BREAKING CHANGE: EvaluationManager and PandasAdapter have been removed.
Use NativeEvaluator with EvaluationFrame directly. Adapters belong in
the calling repository (e.g. views-pipeline-core's EvaluationAdapter).

Source deletions:
- views_evaluation/evaluation/evaluation_manager.py (607 lines)
- views_evaluation/adapters/pandas.py (150 lines)
- Legacy dispatch dicts and calculate_ap alias from native_metric_calculators.py

Test deletions (10 files, ~1800 lines):
- test_evaluation_manager.py, test_evaluation_schemas.py
- test_parity_green.py, test_parity_beige.py, test_parity_red.py
- test_parity_adapter_transfer.py, test_data_contract.py
- test_documentation_contracts.py, test_metric_correctness.py
- conftest.py (legacy fixtures)

Test migrations:
- test_adversarial_inputs.py: removed legacy TestAdversarialInputs class,
  kept TestAdversarialNativeInputs (9 tests)
- test_metric_calculators.py: replaced dispatch dict assertions with
  METRIC_MEMBERSHIP assertions; removed pandas import
- test_metric_catalog.py: removed dispatch dict sync test (single source
  of truth now)

Config:
- Removed pandas from pyproject.toml runtime dependencies
- Flipped legacy_compatibility default to False in NativeEvaluator.evaluate()

Documentation:
- Deleted CICs/PandasAdapter.md
- Updated README quick-start to native-only API
- Updated physical architecture standard (removed PHASE-3-DELETE entries)
- Updated ADR-042 (dispatch dicts note)

Preconditions confirmed:
- views-pipeline-core has EvaluationAdapter (mirrored PandasAdapter)
- Shadow parity verified and scaffolding removed (commit 84a997b)
- All model repos handle own inverse transformations (r2darts2, stepshifter,
  baseline, hydranet verified)

Result: 228 tests passing, 0 lint errors. Pure Math Engine achieved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xamples

- Update examples/using_native_api.py and evaluate_native_prototype.py to
  use EvaluationFrame directly (removed PandasAdapter imports)
- Delete examples/quickstart.ipynb (entirely EvaluationManager-based)
- Update integration_guide.md: remove legacy API section, update architecture
  diagram, update code example to native-only path
- Update CIC Known Deviations: remove resolved C-01 references from
  NativeEvaluator.md and MetricCatalog.md
- Update risk register: close C-01, C-04, C-06, C-08 (3 open concerns remain)
- Update README: remove EvaluationManager from component table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address all findings from review-base-docs audit:

M1: ADR-011 — update 3 EvaluationManager references to "Pipeline Core (external)"
M2: ADR-040 — rename PandasAdapter section, update to EvaluationFrame construction
M3: evaluation_concepts.md — "EvaluationManager assesses" → "evaluation framework assesses"
L1: EvaluationFrame CIC — 3 PandasAdapter references → "external adapters"
L1: EvaluationReport CIC — remove EvaluationManager from consumer list
L2: logging standard — remove EvaluationManager from orchestration example
L3: checklist — "(PHASE-3-DELETE)" → "(removed in Phase 3)"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion, add empty-config red test

- Ignorance: hand-computed golden value with known bin distribution (log2(8/3))
- AP: oracle test using sklearn.metrics.average_precision_score
- Fix NumPy deprecation: float(np.quantile(..., axis=1)) → .item() in QIS test
- Empty config red test: documents C-02 gap — NativeEvaluator({}) accepted at
  init but fails at evaluate() time

231 tests passing, 0 warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NativeEvaluator CIC: replace deleted parity test refs with adversarial test refs
- NativeEvaluator CIC: drop EvaluationManager comparison in Known Deviations
- ADR-012: "and PandasAdapter" → "and external adapters"
- ADR-014: "in EvaluationManager or Adapters" → "in EvaluationFrame constructor or NativeEvaluator"
- ADR-021: replace PandasAdapter with EvaluationReport in example list
- Update Last reviewed dates on 3 CICs to 2026-04-02

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump version 0.4.0 → 0.5.0 for Phase 3 breaking changes (closes C-11)
- Add pandas as optional dependency: `pip install views_evaluation[dataframe]`
  for to_dataframe() support (closes C-12)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Polichinel Polichinel merged commit 1b0b549 into development Apr 2, 2026
3 of 4 checks passed
@Polichinel Polichinel deleted the feature/thresholds00 branch April 2, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant