feat!: threshold metrics, Phase 3 purge, and governance adoption#16
Merged
Polichinel merged 14 commits intodevelopmentfrom Apr 2, 2026
Merged
feat!: threshold metrics, Phase 3 purge, and governance adoption#16Polichinel merged 14 commits intodevelopmentfrom
Polichinel merged 14 commits intodevelopmentfrom
Conversation
…nd guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sk register, hardened protocol - Standardize all 17 ADR headers to base_docs format (Status/Date/Deciders/Consulted/Informed) - Remove silicon-based agents from Deciders across all ADRs - Convert table-format headers (030-042) to YAML-style format - Replace ADR template with decision-focused base_docs template - Add sections 9-12 (Incorrect Usage, Test Alignment, Evolution, Known Deviations) to 4 existing CICs - Create MetricCatalog CIC documenting genome registry and resolver - Create ADR-023 (Technical Risk Register) with tier/trigger/source format - Add hardened protocol for numerical evaluation contributors - Add physical architecture standard with critical bundling assessment - Add INSTANTIATION_CHECKLIST.md and validate_docs.sh - Update ADR and CIC READMEs with governance structure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics with full test coverage Add 4 new threshold-dependent metric functions to the evaluation framework: - Brier_sample: binary classification metric for ensemble predictions - Brier_point: binary classification metric for point probability predictions - QS_sample: quantile score (pinball loss) for ensemble predictions - QS_point: quantile score (pinball loss) for point predictions All metrics registered in MetricCatalog with genome declarations, added to METRIC_MEMBERSHIP and legacy dispatch dicts, with BASE_PROFILE defaults (threshold=1.0, quantile=0.99). Test coverage: 22 new tests (8 golden-value, 9 beige, 5 red) including a finding that Brier's comparison-based binarization swallows NaN rather than propagating it (documented in red tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ariant - Remove deprecation_msgs.py (dead code: raise_legacy_scale_msg never called) - Remove legacy PointEvaluationMetrics and SampleEvaluationMetrics (unused, replaced by 2×2 typed dataclasses) - Add y_pred.ndim != 2 validation to EvaluationFrame._validate() (closes C-03) - Add tests: test_y_pred_1d_raises, test_y_pred_3d_raises - Fix lint: remove unused variable in TestQuantileScoreBeige Risk register: C-03 and C-09 closed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…name note - F1: Add probability-range note to Brier_point docstring (y_pred should be [0,1]) - F2: Fix Brier_sample docstring: "on regression targets" → "binarized at a threshold" - F3: Add NaN-swallowing note to both Brier docstrings (NumPy comparison semantics) - F5: Document Brier → Brier_sample breaking rename in MetricCatalog CIC Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efense-in-depth, extreme values Close 7 gaps identified by test-review audit: Step 1 [Critical/Green]: 15 golden-value tests for MSE, MSLE, RMSLE, EMD, Pearson, MTD, MCR, Coverage, MIS, CRPS, twCRPS, QIS in TestGoldenValues. Step 2 [High/Beige]: Classification evaluation tests — Brier_sample, Brier_point, AP+Brier_point combined, classification sample with profile resolution. Step 3 [High/Red]: NaN/Inf defense-in-depth integration tests proving EvaluationFrame rejects corrupted data before Brier's NaN-swallowing executes. Step 4 [Medium/Beige]: Multi-target evaluation test (regression + classification in same config, evaluated via separate EvaluationFrames). Step 5 [Medium/Green]: Stateless execution test — evaluate() twice produces identical results. Step 6 [Medium/Red]: Extreme-value tests near float64 limits for MSE, CRPS, Brier, Coverage. Step 7 [Low/Green]: Migrate 14 module-level tests from DataFrame fixtures to raw NumPy arrays. Remove pandas import from test_metric_calculators.py. Test count: 266 → 291 (+25 new tests). Lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Manager - Remove filter_step_wise_evaluation() — defined but never called (-30 lines) - Remove aggregate_month_wise_evaluation() — defined but never called (-83 lines) - Remove unused BaseEvaluationMetrics import (was only used by aggregate) - Remove vestigial self.is_sample assignment (set but never read) - Retain self.actual/self.predictions (still tested by test_documentation_contracts.py reflective test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dependency BREAKING CHANGE: EvaluationManager and PandasAdapter have been removed. Use NativeEvaluator with EvaluationFrame directly. Adapters belong in the calling repository (e.g. views-pipeline-core's EvaluationAdapter). Source deletions: - views_evaluation/evaluation/evaluation_manager.py (607 lines) - views_evaluation/adapters/pandas.py (150 lines) - Legacy dispatch dicts and calculate_ap alias from native_metric_calculators.py Test deletions (10 files, ~1800 lines): - test_evaluation_manager.py, test_evaluation_schemas.py - test_parity_green.py, test_parity_beige.py, test_parity_red.py - test_parity_adapter_transfer.py, test_data_contract.py - test_documentation_contracts.py, test_metric_correctness.py - conftest.py (legacy fixtures) Test migrations: - test_adversarial_inputs.py: removed legacy TestAdversarialInputs class, kept TestAdversarialNativeInputs (9 tests) - test_metric_calculators.py: replaced dispatch dict assertions with METRIC_MEMBERSHIP assertions; removed pandas import - test_metric_catalog.py: removed dispatch dict sync test (single source of truth now) Config: - Removed pandas from pyproject.toml runtime dependencies - Flipped legacy_compatibility default to False in NativeEvaluator.evaluate() Documentation: - Deleted CICs/PandasAdapter.md - Updated README quick-start to native-only API - Updated physical architecture standard (removed PHASE-3-DELETE entries) - Updated ADR-042 (dispatch dicts note) Preconditions confirmed: - views-pipeline-core has EvaluationAdapter (mirrored PandasAdapter) - Shadow parity verified and scaffolding removed (commit 84a997b) - All model repos handle own inverse transformations (r2darts2, stepshifter, baseline, hydranet verified) Result: 228 tests passing, 0 lint errors. Pure Math Engine achieved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xamples - Update examples/using_native_api.py and evaluate_native_prototype.py to use EvaluationFrame directly (removed PandasAdapter imports) - Delete examples/quickstart.ipynb (entirely EvaluationManager-based) - Update integration_guide.md: remove legacy API section, update architecture diagram, update code example to native-only path - Update CIC Known Deviations: remove resolved C-01 references from NativeEvaluator.md and MetricCatalog.md - Update risk register: close C-01, C-04, C-06, C-08 (3 open concerns remain) - Update README: remove EvaluationManager from component table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address all findings from review-base-docs audit: M1: ADR-011 — update 3 EvaluationManager references to "Pipeline Core (external)" M2: ADR-040 — rename PandasAdapter section, update to EvaluationFrame construction M3: evaluation_concepts.md — "EvaluationManager assesses" → "evaluation framework assesses" L1: EvaluationFrame CIC — 3 PandasAdapter references → "external adapters" L1: EvaluationReport CIC — remove EvaluationManager from consumer list L2: logging standard — remove EvaluationManager from orchestration example L3: checklist — "(PHASE-3-DELETE)" → "(removed in Phase 3)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion, add empty-config red test
- Ignorance: hand-computed golden value with known bin distribution (log2(8/3))
- AP: oracle test using sklearn.metrics.average_precision_score
- Fix NumPy deprecation: float(np.quantile(..., axis=1)) → .item() in QIS test
- Empty config red test: documents C-02 gap — NativeEvaluator({}) accepted at
init but fails at evaluate() time
231 tests passing, 0 warnings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NativeEvaluator CIC: replace deleted parity test refs with adversarial test refs - NativeEvaluator CIC: drop EvaluationManager comparison in Known Deviations - ADR-012: "and PandasAdapter" → "and external adapters" - ADR-014: "in EvaluationManager or Adapters" → "in EvaluationFrame constructor or NativeEvaluator" - ADR-021: replace PandasAdapter with EvaluationReport in example list - Update Last reviewed dates on 3 CICs to 2026-04-02 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump version 0.4.0 → 0.5.0 for Phase 3 breaking changes (closes C-11) - Add pandas as optional dependency: `pip install views_evaluation[dataframe]` for to_dataframe() support (closes C-12) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EvaluationManager,PandasAdapter, pandas runtime dependency, and all legacy dispatch dicts (-2,984 lines). Pure Math Engine achieved.Breaking Changes
EvaluationManagerandPandasAdapterremoved from public APIpandasno longer a runtime dependencylegacy_compatibilitydefault flipped toFalseinNativeEvaluator.evaluate()Brierfield renamed toBrier_sampleinClassificationSampleEvaluationMetricsRisk Register
6 concerns closed (C-01, C-03, C-04, C-06, C-08, C-09). 3 remain open:
Test plan
conda run --name views_pipeline pytest tests/ -v)conda run --name views_pipeline ruff check .)validate_docs.shpasses🤖 Generated with Claude Code