feat!: threshold metrics, Phase 3 purge, and governance adoption by Polichinel · Pull Request #16 · views-platform/views-evaluation

Polichinel · 2026-04-02T01:36:12Z

Summary

New metrics: Brier Score (sample/point) and Quantile Score (sample/point) — 4 threshold-dependent metrics with full catalog registration, profile defaults, and 22 dedicated tests
Phase 3 executed: Removed EvaluationManager, PandasAdapter, pandas runtime dependency, and all legacy dispatch dicts (-2,984 lines). Pure Math Engine achieved.
Governance adoption: base_docs ADR template, standardized headers on all 17 ADRs, CIC sections 9-12 (Known Deviations, Test Alignment), technical risk register (ADR-023), hardened protocol, physical architecture standard
Test gaps closed: 25 new tests from test-review audit (golden values, classification evaluation, NaN defense-in-depth, extreme values)
Tech debt cleanup: Dead code removal, y_pred shape invariant enforcement, stale reference cleanup

Breaking Changes

EvaluationManager and PandasAdapter removed from public API
pandas no longer a runtime dependency
legacy_compatibility default flipped to False in NativeEvaluator.evaluate()
Brier field renamed to Brier_sample in ClassificationSampleEvaluationMetrics

Risk Register

6 concerns closed (C-01, C-03, C-04, C-06, C-08, C-09). 3 remain open:

C-02: NativeEvaluator config validation at init (design decision needed)
C-05: sklearn/scipy in pure-math core (future work)
C-07: Golden-value coverage (partially addressed)

Test plan

228 tests passing (conda run --name views_pipeline pytest tests/ -v)
0 lint errors (conda run --name views_pipeline ruff check .)
validate_docs.sh passes
All 4 model repos verified for transformation handling (r2darts2, stepshifter, baseline, hydranet)
views-pipeline-core Phase 2 confirmed complete (EvaluationAdapter mirrored, parity verified)

🤖 Generated with Claude Code

…nd guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sk register, hardened protocol - Standardize all 17 ADR headers to base_docs format (Status/Date/Deciders/Consulted/Informed) - Remove silicon-based agents from Deciders across all ADRs - Convert table-format headers (030-042) to YAML-style format - Replace ADR template with decision-focused base_docs template - Add sections 9-12 (Incorrect Usage, Test Alignment, Evolution, Known Deviations) to 4 existing CICs - Create MetricCatalog CIC documenting genome registry and resolver - Create ADR-023 (Technical Risk Register) with tier/trigger/source format - Add hardened protocol for numerical evaluation contributors - Add physical architecture standard with critical bundling assessment - Add INSTANTIATION_CHECKLIST.md and validate_docs.sh - Update ADR and CIC READMEs with governance structure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ics with full test coverage Add 4 new threshold-dependent metric functions to the evaluation framework: - Brier_sample: binary classification metric for ensemble predictions - Brier_point: binary classification metric for point probability predictions - QS_sample: quantile score (pinball loss) for ensemble predictions - QS_point: quantile score (pinball loss) for point predictions All metrics registered in MetricCatalog with genome declarations, added to METRIC_MEMBERSHIP and legacy dispatch dicts, with BASE_PROFILE defaults (threshold=1.0, quantile=0.99). Test coverage: 22 new tests (8 golden-value, 9 beige, 5 red) including a finding that Brier's comparison-based binarization swallows NaN rather than propagating it (documented in red tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ariant - Remove deprecation_msgs.py (dead code: raise_legacy_scale_msg never called) - Remove legacy PointEvaluationMetrics and SampleEvaluationMetrics (unused, replaced by 2×2 typed dataclasses) - Add y_pred.ndim != 2 validation to EvaluationFrame._validate() (closes C-03) - Add tests: test_y_pred_1d_raises, test_y_pred_3d_raises - Fix lint: remove unused variable in TestQuantileScoreBeige Risk register: C-03 and C-09 closed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…name note - F1: Add probability-range note to Brier_point docstring (y_pred should be [0,1]) - F2: Fix Brier_sample docstring: "on regression targets" → "binarized at a threshold" - F3: Add NaN-swallowing note to both Brier docstrings (NumPy comparison semantics) - F5: Document Brier → Brier_sample breaking rename in MetricCatalog CIC Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…efense-in-depth, extreme values Close 7 gaps identified by test-review audit: Step 1 [Critical/Green]: 15 golden-value tests for MSE, MSLE, RMSLE, EMD, Pearson, MTD, MCR, Coverage, MIS, CRPS, twCRPS, QIS in TestGoldenValues. Step 2 [High/Beige]: Classification evaluation tests — Brier_sample, Brier_point, AP+Brier_point combined, classification sample with profile resolution. Step 3 [High/Red]: NaN/Inf defense-in-depth integration tests proving EvaluationFrame rejects corrupted data before Brier's NaN-swallowing executes. Step 4 [Medium/Beige]: Multi-target evaluation test (regression + classification in same config, evaluated via separate EvaluationFrames). Step 5 [Medium/Green]: Stateless execution test — evaluate() twice produces identical results. Step 6 [Medium/Red]: Extreme-value tests near float64 limits for MSE, CRPS, Brier, Coverage. Step 7 [Low/Green]: Migrate 14 module-level tests from DataFrame fixtures to raw NumPy arrays. Remove pandas import from test_metric_calculators.py. Test count: 266 → 291 (+25 new tests). Lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Manager - Remove filter_step_wise_evaluation() — defined but never called (-30 lines) - Remove aggregate_month_wise_evaluation() — defined but never called (-83 lines) - Remove unused BaseEvaluationMetrics import (was only used by aggregate) - Remove vestigial self.is_sample assignment (set but never read) - Retain self.actual/self.predictions (still tested by test_documentation_contracts.py reflective test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dependency BREAKING CHANGE: EvaluationManager and PandasAdapter have been removed. Use NativeEvaluator with EvaluationFrame directly. Adapters belong in the calling repository (e.g. views-pipeline-core's EvaluationAdapter). Source deletions: - views_evaluation/evaluation/evaluation_manager.py (607 lines) - views_evaluation/adapters/pandas.py (150 lines) - Legacy dispatch dicts and calculate_ap alias from native_metric_calculators.py Test deletions (10 files, ~1800 lines): - test_evaluation_manager.py, test_evaluation_schemas.py - test_parity_green.py, test_parity_beige.py, test_parity_red.py - test_parity_adapter_transfer.py, test_data_contract.py - test_documentation_contracts.py, test_metric_correctness.py - conftest.py (legacy fixtures) Test migrations: - test_adversarial_inputs.py: removed legacy TestAdversarialInputs class, kept TestAdversarialNativeInputs (9 tests) - test_metric_calculators.py: replaced dispatch dict assertions with METRIC_MEMBERSHIP assertions; removed pandas import - test_metric_catalog.py: removed dispatch dict sync test (single source of truth now) Config: - Removed pandas from pyproject.toml runtime dependencies - Flipped legacy_compatibility default to False in NativeEvaluator.evaluate() Documentation: - Deleted CICs/PandasAdapter.md - Updated README quick-start to native-only API - Updated physical architecture standard (removed PHASE-3-DELETE entries) - Updated ADR-042 (dispatch dicts note) Preconditions confirmed: - views-pipeline-core has EvaluationAdapter (mirrored PandasAdapter) - Shadow parity verified and scaffolding removed (commit 84a997b) - All model repos handle own inverse transformations (r2darts2, stepshifter, baseline, hydranet verified) Result: 228 tests passing, 0 lint errors. Pure Math Engine achieved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…xamples - Update examples/using_native_api.py and evaluate_native_prototype.py to use EvaluationFrame directly (removed PandasAdapter imports) - Delete examples/quickstart.ipynb (entirely EvaluationManager-based) - Update integration_guide.md: remove legacy API section, update architecture diagram, update code example to native-only path - Update CIC Known Deviations: remove resolved C-01 references from NativeEvaluator.md and MetricCatalog.md - Update risk register: close C-01, C-04, C-06, C-08 (3 open concerns remain) - Update README: remove EvaluationManager from component table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address all findings from review-base-docs audit: M1: ADR-011 — update 3 EvaluationManager references to "Pipeline Core (external)" M2: ADR-040 — rename PandasAdapter section, update to EvaluationFrame construction M3: evaluation_concepts.md — "EvaluationManager assesses" → "evaluation framework assesses" L1: EvaluationFrame CIC — 3 PandasAdapter references → "external adapters" L1: EvaluationReport CIC — remove EvaluationManager from consumer list L2: logging standard — remove EvaluationManager from orchestration example L3: checklist — "(PHASE-3-DELETE)" → "(removed in Phase 3)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ion, add empty-config red test - Ignorance: hand-computed golden value with known bin distribution (log2(8/3)) - AP: oracle test using sklearn.metrics.average_precision_score - Fix NumPy deprecation: float(np.quantile(..., axis=1)) → .item() in QIS test - Empty config red test: documents C-02 gap — NativeEvaluator({}) accepted at init but fails at evaluate() time 231 tests passing, 0 warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- NativeEvaluator CIC: replace deleted parity test refs with adversarial test refs - NativeEvaluator CIC: drop EvaluationManager comparison in Known Deviations - ADR-012: "and PandasAdapter" → "and external adapters" - ADR-014: "in EvaluationManager or Adapters" → "in EvaluationFrame constructor or NativeEvaluator" - ADR-021: replace PandasAdapter with EvaluationReport in example list - Update Last reviewed dates on 3 CICs to 2026-04-02 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Bump version 0.4.0 → 0.5.0 for Phase 3 breaking changes (closes C-11) - Add pandas as optional dependency: `pip install views_evaluation[dataframe]` for to_dataframe() support (closes C-12) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Polichinel and others added 14 commits March 14, 2026 18:31

fix: correct corrupted .gitignore and wrong import paths in example a…

c4f3642

…nd guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused variable in example to pass ruff lint

236b1c8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Polichinel merged commit 1b0b549 into development Apr 2, 2026
3 of 4 checks passed

Polichinel deleted the feature/thresholds00 branch April 2, 2026 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: threshold metrics, Phase 3 purge, and governance adoption#16

feat!: threshold metrics, Phase 3 purge, and governance adoption#16
Polichinel merged 14 commits intodevelopmentfrom
feature/thresholds00

Polichinel commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Polichinel commented Apr 2, 2026

Summary

Breaking Changes

Risk Register

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant