release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants by Polichinel · Pull Request #19 · views-platform/views-evaluation

Polichinel · 2026-04-09T08:24:43Z

⚠️ COORDINATION REQUIRED

DO NOT MERGE until views-pipeline-core development branch is also ready to merge to main. This release removes EvaluationManager which pipeline-core's main branch still imports. Both repos must be released together.

Merge order:

Merge views-pipeline-core development → main (contains NativeEvaluator migration)
Merge this PR (views-evaluation development → main)
Verify cross-repo integration

Summary

67 commits spanning the full EvaluationFrame refactor, metric catalog adoption, and Phase 3 purge.

Breaking changes

EvaluationManager and PandasAdapter deleted — replaced by NativeEvaluator + EvaluationReport
pandas is now an optional dependency ([dataframe] extra)
Brier_sample/Brier_point replaced by Brier_cls_point, Brier_cls_sample, Brier_rgs_sample
Version: 0.4.0 → 0.5.0

New features

EvaluationFrame: pure-NumPy validated container (Level 0 core)
NativeEvaluator: stateless evaluation engine with 3-schema support
MetricCatalog (ADR-042): genome registry + Chain of Responsibility resolver
Named evaluation profiles (base, hydranet_ucdp)
25 metrics in catalog (17 implemented), including 3 Brier variants, 2 QS variants, MCR, MIS, QIS, Coverage, Ignorance
EvaluationReport with to_dict() and to_dataframe() export

Governance

ADR suite renumbered (000–042)
CICs for EvaluationFrame, NativeEvaluator, MetricCatalog, EvaluationReport, PandasAdapter
Technical risk register: 3 open (C-02, C-05, C-13), 15 closed

Tests

249 tests passing (up from ~77 pre-refactor)
Red/Beige/Green taxonomy (ADR-020)
Golden-value tests for all implemented metrics

Test plan

249 tests passing locally
CI passes (pandas importorskip fix in place)
Cross-repo integration verified with updated views-pipeline-core

🤖 Generated with Claude Code

Adds a new test suite in tests/test_documentation_contracts.py to verify the contracts and claims made in the project's documentation. These tests treat the documentation as hypotheses and verify them against the actual behavior of the EvaluationManager. Key findings from the tests: - The EvaluationManager implicitly converts raw float point predictions to single-element numpy arrays, which contradicts the documentation's claim that this would cause an error. The documentation has been updated to reflect this behavior: - eval_lib_imp.md is updated to clarify the implicit conversion and change the 'Mandatory Reconciliation Step' to 'Recommended'. - stepshifter_full_imp_report.md is updated with a final conclusion clarifying the EvaluationManager's actual behavior. Also organizes the analysis reports into a new reports/ directory.

Adds a new document outlining the comprehensive plan for Phase 4: Non-Functional & Operational Readiness testing. This includes detailed sections on Performance & Scalability Benchmarking, Logging and Observability Verification, Memory Profiling, and Concurrency/Parallelism Safety as a future consideration. This plan aims to ensure the library's suitability for critical infrastructure environments.

Adds a robust test suite covering adversarial inputs and metric correctness, and generates a technical debt backlog document. Phase 2 (Adversarial & Edge-Case Testing) findings: - The is not robust to non-finite numbers (, ), crashing with from 's validation. - It crashes with on empty lists (from ). - It crashes with on empty DataFrames. - It crashes with on non-overlapping indices (from ). - This highlights a lack of internal input validation and graceful error handling. Phase 3 (Data-Centric & Metric-Specific Validation) findings: - Verified numerical correctness of with golden datasets. - Confirmed metric correctly uses kwarg. - Verified for both point and uncertainty predictions against . A document has been created, detailing these fragilities and recommending future improvements for robustness. Moved fixture to for shared access.

Updates the (VIEWS Evaluation Technical Integration Guide) to incorporate critical findings from adversarial testing (Phase 2), providing a clearer picture of the library's behavior and limitations. This includes: - A new section (3.5) detailing 'Robustness Limitations & Input Validation Responsibility', highlighting the library's fragility to non-finite numbers and malformed structural data, and emphasizing consumer responsibility for pre-validation. - Enhanced Section 3.4 on 'Data-State Coherency' to clarify that the applies transformations without validating mathematical appropriateness. - A cross-reference to for a comprehensive list of known issues. Updates the (Forensic Analysis of views-r2darts2 Evaluation Interface) with minor contextual notes: - A clarification in Section 4 acknowledging that has since been updated. - A clarification in Section 5, Point 2, regarding 'Point Prediction Format Ambiguity', reflecting that implicitly converts raw floats, making strict consumer-side reconciliation less critical for runtime.

Addressed linting errors in and . - : Replaced / with / for boolean comparisons. - : Removed unused variable assignments for , , , , , , and . These changes ensure adherence to linting standards within the test suite.

Removed from as it was an unused import, identified by the ruff linter.

Applied automated changes to files outside the directory after confirming all tests pass. - : Removed unused import and fixed f-string formatting. - : Removed unused and typing imports. - : Removed unused , , and typing imports. These minor changes ensure code quality and adherence to linting standards throughout the project.

This commit introduces comprehensive documentation and a rigorous test suite to clarify and verify the core concepts of the views-evaluation library. - Adds `documentation/evaluation_concepts.md` to clearly explain the differences between partitions, sets, and the three evaluation schemas (time-series-wise, step-wise, and month-wise). - Adds `documentation/integration_guide.md`, a step-by-step guide for developers on how to format their data and integrate a new model with the library. - Adds `tests/test_evaluation_schemas.py`, a permanent and rigorous test suite that programmatically verifies the grouping logic of the three evaluation schemas against the documentation. - Fixes test pollution issues discovered during development by isolating mocks within the new test suite, ensuring the stability of the entire test run.

Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.

…n tests - Hardens `EvaluationManager.validate_predictions` to strictly enforce the "exactly one column" contract, preventing crashes from duplicate or extra columns. - Adds `tests/test_data_contract.py` to verify the single-target and single-column requirements. - Updates `documentation/integration_guide.md` with a "Common Pitfalls" section to clarify MultiIndex and column usage. - Updates `reports/technical_debt_backlog.md` to reflect resolved validation issues. - Includes recent verification reports and drafts.

Mean tweedie deviance

…ob and _raw

… evaluation proposal

…ntation-verification-suite

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… (v0.4.0) - EvaluationManager now dispatches on {regression,classification} x {point,uncertainty} - Task type declared explicitly in config; prediction type detected from data shape - Config schema: regression_targets, regression_point_metrics, regression_uncertainty_metrics, classification_targets, classification_point_metrics, classification_uncertainty_metrics - Legacy config keys (targets, metrics) accepted with loud deprecation warning - _normalise_config() and _validate_config() enforce fail-loud-fail-fast contract - calculate_ap() no longer applies internal threshold; expects pre-binarised actuals - AP moved to CLASSIFICATION_POINT_METRIC_FUNCTIONS only - CRPS moved to uncertainty dicts only (regression and classification) - Four new metric dataclasses mirror the four dispatch dicts - transform_data() crash on unknown prefix replaced with logger.warning + identity - EvaluationManager.__init__ no longer accepts metrics_list (breaking change) - 70 tests passing, ruff clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rtem

- Renames all 'uncertainty' metrics and classes to 'sample' (e.g., regression_sample_metrics) - Updates EvaluationManager to use the new terminology while maintaining legacy aliases - Adds a prominent Migration Notice and Configuration Schema to README.md - Updates all tests and documentation to align with the new ontology - Adds MIT License

…fication-suite Feature/documentation verification suite

- Defined EvaluationFrame contract and pure-numpy logic. - Implemented PandasAdapter for backward-compatible alignment. - Created Parity Test Campaign (Green/Beige/Red teams). - Documented performance scaling (14x speedup for sample metrics). - Identified and documented legacy bugs (step-wise truncation).

Add Magnitude Calibration Ratio (MCR_point, MCR_sample) metric with full catalog/dispatch/dataclass integration. Add hydranet_ucdp evaluation profile. Fix EvaluationManager._validate_config to accept sample-only models. Add beige+red tests for twCRPS, QIS, MIS, and MCR per ADR-020. Remove dead code from NativeEvaluator (unused metrics_map and dispatch dict imports). Sync Brier/Jeffreys fields into RegressionSampleEvaluationMetrics dataclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… cleanup - Fix EvaluationReport CIC (constructor signature, to_json→to_dict, FM1 guard) - Fix integration guide broken API example (adapt→from_dataframes) - Remap all stale ADR references (old flat 001-008 → grouped 010-042) - Update PandasAdapter CIC to reflect deprecated status and silent-skip behavior - Add config keys, properties, and ADR-042 to EvaluationFrame/NativeEvaluator CICs - Remove stale CIC README entries (ModelRunner, VolumeHandler, BoundaryValidator) - Overhaul README: add Quick Start, 2×2 metrics tables, 3-layer architecture, evaluation profiles, updated project structure - Add EvaluationConfig TypedDict (config_schema.py), deprecate to_dataframe('raw') - Remove Brier/Jeffreys from regression sample dispatch, remove 17 unused aliases - Add structural consistency test (metric membership ↔ dataclass fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Feature/samples for fao

…nd guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sk register, hardened protocol - Standardize all 17 ADR headers to base_docs format (Status/Date/Deciders/Consulted/Informed) - Remove silicon-based agents from Deciders across all ADRs - Convert table-format headers (030-042) to YAML-style format - Replace ADR template with decision-focused base_docs template - Add sections 9-12 (Incorrect Usage, Test Alignment, Evolution, Known Deviations) to 4 existing CICs - Create MetricCatalog CIC documenting genome registry and resolver - Create ADR-023 (Technical Risk Register) with tier/trigger/source format - Add hardened protocol for numerical evaluation contributors - Add physical architecture standard with critical bundling assessment - Add INSTANTIATION_CHECKLIST.md and validate_docs.sh - Update ADR and CIC READMEs with governance structure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ics with full test coverage Add 4 new threshold-dependent metric functions to the evaluation framework: - Brier_sample: binary classification metric for ensemble predictions - Brier_point: binary classification metric for point probability predictions - QS_sample: quantile score (pinball loss) for ensemble predictions - QS_point: quantile score (pinball loss) for point predictions All metrics registered in MetricCatalog with genome declarations, added to METRIC_MEMBERSHIP and legacy dispatch dicts, with BASE_PROFILE defaults (threshold=1.0, quantile=0.99). Test coverage: 22 new tests (8 golden-value, 9 beige, 5 red) including a finding that Brier's comparison-based binarization swallows NaN rather than propagating it (documented in red tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ariant - Remove deprecation_msgs.py (dead code: raise_legacy_scale_msg never called) - Remove legacy PointEvaluationMetrics and SampleEvaluationMetrics (unused, replaced by 2×2 typed dataclasses) - Add y_pred.ndim != 2 validation to EvaluationFrame._validate() (closes C-03) - Add tests: test_y_pred_1d_raises, test_y_pred_3d_raises - Fix lint: remove unused variable in TestQuantileScoreBeige Risk register: C-03 and C-09 closed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…name note - F1: Add probability-range note to Brier_point docstring (y_pred should be [0,1]) - F2: Fix Brier_sample docstring: "on regression targets" → "binarized at a threshold" - F3: Add NaN-swallowing note to both Brier docstrings (NumPy comparison semantics) - F5: Document Brier → Brier_sample breaking rename in MetricCatalog CIC Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…efense-in-depth, extreme values Close 7 gaps identified by test-review audit: Step 1 [Critical/Green]: 15 golden-value tests for MSE, MSLE, RMSLE, EMD, Pearson, MTD, MCR, Coverage, MIS, CRPS, twCRPS, QIS in TestGoldenValues. Step 2 [High/Beige]: Classification evaluation tests — Brier_sample, Brier_point, AP+Brier_point combined, classification sample with profile resolution. Step 3 [High/Red]: NaN/Inf defense-in-depth integration tests proving EvaluationFrame rejects corrupted data before Brier's NaN-swallowing executes. Step 4 [Medium/Beige]: Multi-target evaluation test (regression + classification in same config, evaluated via separate EvaluationFrames). Step 5 [Medium/Green]: Stateless execution test — evaluate() twice produces identical results. Step 6 [Medium/Red]: Extreme-value tests near float64 limits for MSE, CRPS, Brier, Coverage. Step 7 [Low/Green]: Migrate 14 module-level tests from DataFrame fixtures to raw NumPy arrays. Remove pandas import from test_metric_calculators.py. Test count: 266 → 291 (+25 new tests). Lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Manager - Remove filter_step_wise_evaluation() — defined but never called (-30 lines) - Remove aggregate_month_wise_evaluation() — defined but never called (-83 lines) - Remove unused BaseEvaluationMetrics import (was only used by aggregate) - Remove vestigial self.is_sample assignment (set but never read) - Retain self.actual/self.predictions (still tested by test_documentation_contracts.py reflective test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dependency BREAKING CHANGE: EvaluationManager and PandasAdapter have been removed. Use NativeEvaluator with EvaluationFrame directly. Adapters belong in the calling repository (e.g. views-pipeline-core's EvaluationAdapter). Source deletions: - views_evaluation/evaluation/evaluation_manager.py (607 lines) - views_evaluation/adapters/pandas.py (150 lines) - Legacy dispatch dicts and calculate_ap alias from native_metric_calculators.py Test deletions (10 files, ~1800 lines): - test_evaluation_manager.py, test_evaluation_schemas.py - test_parity_green.py, test_parity_beige.py, test_parity_red.py - test_parity_adapter_transfer.py, test_data_contract.py - test_documentation_contracts.py, test_metric_correctness.py - conftest.py (legacy fixtures) Test migrations: - test_adversarial_inputs.py: removed legacy TestAdversarialInputs class, kept TestAdversarialNativeInputs (9 tests) - test_metric_calculators.py: replaced dispatch dict assertions with METRIC_MEMBERSHIP assertions; removed pandas import - test_metric_catalog.py: removed dispatch dict sync test (single source of truth now) Config: - Removed pandas from pyproject.toml runtime dependencies - Flipped legacy_compatibility default to False in NativeEvaluator.evaluate() Documentation: - Deleted CICs/PandasAdapter.md - Updated README quick-start to native-only API - Updated physical architecture standard (removed PHASE-3-DELETE entries) - Updated ADR-042 (dispatch dicts note) Preconditions confirmed: - views-pipeline-core has EvaluationAdapter (mirrored PandasAdapter) - Shadow parity verified and scaffolding removed (commit 84a997b) - All model repos handle own inverse transformations (r2darts2, stepshifter, baseline, hydranet verified) Result: 228 tests passing, 0 lint errors. Pure Math Engine achieved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…xamples - Update examples/using_native_api.py and evaluate_native_prototype.py to use EvaluationFrame directly (removed PandasAdapter imports) - Delete examples/quickstart.ipynb (entirely EvaluationManager-based) - Update integration_guide.md: remove legacy API section, update architecture diagram, update code example to native-only path - Update CIC Known Deviations: remove resolved C-01 references from NativeEvaluator.md and MetricCatalog.md - Update risk register: close C-01, C-04, C-06, C-08 (3 open concerns remain) - Update README: remove EvaluationManager from component table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address all findings from review-base-docs audit: M1: ADR-011 — update 3 EvaluationManager references to "Pipeline Core (external)" M2: ADR-040 — rename PandasAdapter section, update to EvaluationFrame construction M3: evaluation_concepts.md — "EvaluationManager assesses" → "evaluation framework assesses" L1: EvaluationFrame CIC — 3 PandasAdapter references → "external adapters" L1: EvaluationReport CIC — remove EvaluationManager from consumer list L2: logging standard — remove EvaluationManager from orchestration example L3: checklist — "(PHASE-3-DELETE)" → "(removed in Phase 3)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ion, add empty-config red test - Ignorance: hand-computed golden value with known bin distribution (log2(8/3)) - AP: oracle test using sklearn.metrics.average_precision_score - Fix NumPy deprecation: float(np.quantile(..., axis=1)) → .item() in QIS test - Empty config red test: documents C-02 gap — NativeEvaluator({}) accepted at init but fails at evaluate() time 231 tests passing, 0 warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- NativeEvaluator CIC: replace deleted parity test refs with adversarial test refs - NativeEvaluator CIC: drop EvaluationManager comparison in Known Deviations - ADR-012: "and PandasAdapter" → "and external adapters" - ADR-014: "in EvaluationManager or Adapters" → "in EvaluationFrame constructor or NativeEvaluator" - ADR-021: replace PandasAdapter with EvaluationReport in example list - Update Last reviewed dates on 3 CICs to 2026-04-02 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Bump version 0.4.0 → 0.5.0 for Phase 3 breaking changes (closes C-11) - Add pandas as optional dependency: `pip install views_evaluation[dataframe]` for to_dataframe() support (closes C-12) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat!: threshold metrics, Phase 3 purge, and governance adoption

…ntext, step sentinel Address C-16, C-17, C-18 identified by risk register review with TDD: - C-16: Wrap metric function calls in _calculate_metrics() with try/except that re-raises as ValueError naming the metric, task, and pred_type - C-17: Replace hardcoded max_allowed_step=999 with float('inf') so steps >= 1000 are not silently dropped - C-18: Add bounds validation in resolve_metric_params() for alpha, quantile, lower_quantile, upper_quantile — all must be in (0, 1). Cross-validation for QIS lower_quantile < upper_quantile Also: update CICs (MetricCatalog, NativeEvaluator) and ADRs (011, 014) with Known Deviations sections documenting C-02 and C-05. Close C-14 (stale editable install metadata). Upgrade C-02 from Tier 3 to Tier 2. 9 new tests, 240 total passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s-schema consistency, C-10 - Add object-dtype rejection to EvaluationFrame._validate() (ADR-011 Pure NumPy contract) - Remove 22 lines of dead pandas/object-dtype branches from _guard_shapes (closes C-10) - Add 5 new tests: object-dtype rejection (2), malformed report dict (1), NaN metric detectability (1), cross-schema MSE consistency (1) - 245 tests passing, risk register: 3 open concerns remain Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test_evaluation_report.py imported pandas at module level, causing a collection error in CI where pandas is not installed (optional dependency via [dataframe] extra). Use pytest.importorskip to skip gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: close risk register concerns C-16, C-17, C-18

Replace Brier_sample/Brier_point with three task-explicit variants: - Brier_cls_point: classification point (y_pred is probability) - Brier_cls_sample: classification sample (average MC Dropout probabilities) - Brier_rgs_sample: regression sample (binarise count samples at threshold) Brier_rgs_point intentionally omitted — regression point estimates are not probabilities. The _cls_/_rgs_ infix makes the task context self-documenting. Critical fix: Brier_cls_sample uses mean(y_pred) instead of mean(y_pred > threshold), which was broken for probability samples (all probabilities > 0 → p_hat ≈ 1.0). Profile defaults: classification Brier threshold=0.0 (PDF: event y > 0), regression Brier threshold=1.0 (binarise at 1 fatality). Catalog size 24 → 25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All three Brier variants now default to threshold=0.0 in the base profile, matching the Pre-Release Note 05 definition: Brier evaluates the binary event "any fatality occurred" (y > 0). On integer-valued UCDP data, y > 0 and y >= 1 are equivalent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: explicit Brier score variants for 2×2 evaluation matrix

Copilot

Pull request overview

Release v0.5.0 introduces the new EvaluationFrame-based “native” evaluation architecture, a MetricCatalog + named profiles for metric hyperparameters, and updated reporting/export APIs, while making pandas optional.

Changes:

Added EvaluationFrame + NativeEvaluator + EvaluationReport as the core native evaluation path with schema regrouping and catalog-driven metric dispatch.
Introduced MetricCatalog (genome + Chain-of-Responsibility param resolver) and named evaluation profiles (base, hydranet_ucdp).
Updated packaging/docs/tests for the new architecture (pandas optional extra, new/updated guides, extensive tests, removed outdated notebook/ADRs).

Reviewed changes

Copilot reviewed 88 out of 90 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
views_evaluation/profiles/hydranet_ucdp.py	Adds HydraNet/UCDP profile overrides on top of base metric hyperparameters.
views_evaluation/profiles/base.py	Defines the baseline, system-wide metric hyperparameters used by the catalog resolver.
views_evaluation/profiles/init.py	Registers named evaluation profiles for config selection.
views_evaluation/evaluation/native_evaluator.py	Implements schema regrouping (month/sequence/step) and metric dispatch using catalog + profiles.
views_evaluation/evaluation/metrics.py	Refactors legacy metric dataclasses into a 2×2 task/prediction-type matrix; lazy pandas import for DataFrame export.
views_evaluation/evaluation/metric_catalog.py	Adds metric registry, membership, and hyperparameter resolution + validation logic.
views_evaluation/evaluation/evaluation_report.py	Adds structured report container with schema access + dict/dataframe export.
views_evaluation/evaluation/evaluation_frame.py	Adds validated, pure-NumPy container + grouping/selection utilities.
views_evaluation/evaluation/config_schema.py	Adds TypedDict documenting expected config keys for NativeEvaluator.
views_evaluation/adapters/init.py	Keeps adapters package placeholder for future framework bridges.
views_evaluation/init.py	Exposes the new public API surface (`EvaluationFrame`, `NativeEvaluator`, `EvaluationReport`, catalog utilities, profiles).
tests/test_evaluation_report.py	Adds direct unit coverage for EvaluationReport APIs and edge/failure modes.
tests/test_adversarial_inputs.py	Adds adversarial tests ensuring fail-loud behavior at the EvaluationFrame boundary and evaluator dispatch.
reports/technical_debt_backlog.md	Adds/updates technical debt tracking and status notes.
reports/stepshifter_full_imp_report.md	Adds forensic analysis of upstream evaluation interface expectations.
reports/proposal_manifest_driven_evaluation.md	Adds proposal document for manifest-driven orchestration architecture.
reports/post_mortems/post_mortem_report.md	Adds post-mortem documenting documentation/code verification work.
reports/post_mortems/2026-02-23_evaluation_ontology_liberation_post_mortem.md	Adds post-mortem capturing ontology/design decisions and migration learnings.
reports/post_mortem_multi_target_investigation.md	Adds post-mortem on multi-target support limits and contract hardening.
reports/phase_4_plan.md	Adds operational readiness plan (benchmarking/logging/memory profiling).
reports/phase_2_adversarial_testing_report.md	Adds adversarial testing findings and recommendations.
reports/documentation_discrepancy_report.md	Adds documentation discrepancy summary and follow-up recommendations.
reports/2026-02-25_evaluation_frame_refactor/10_orchestrator_migration_plan.md	Documents migration plan for moving alignment/adaptation upstream.
reports/2026-02-25_evaluation_frame_refactor/09_post_refactor_status.md	Documents post-refactor status and next steps.
reports/2026-02-25_evaluation_frame_refactor/07_implementation_plan.md	Documents phased implementation plan for EvaluationFrame migration.
reports/2026-02-25_evaluation_frame_refactor/06_investigation_summary.md	Documents investigation findings and recommendations for the new boundary.
reports/2026-02-25_evaluation_frame_refactor/05_probabilistic_scaling_benchmark.md	Adds benchmark results motivating dense NumPy representation.
reports/2026-02-25_evaluation_frame_refactor/04_parity_investigation_log.md	Adds parity investigation log and identified legacy behaviors/bugs.
reports/2026-02-25_evaluation_frame_refactor/03_evaluation_frame_contract.md	Adds EvaluationFrame contract specification document.
reports/2026-02-25_evaluation_frame_refactor/02_current_alignment_semantics.md	Documents legacy alignment/regrouping semantics to preserve/replace.
reports/2026-02-25_evaluation_frame_refactor/01_investigation_plan.md	Adds initial investigation plan for the refactor.
pyproject.toml	Bumps version to 0.5.0; makes pandas optional via `[dataframe]` extra; moves dev deps into dev group.
LICENSE	Adds MIT license text and copyright.
examples/using_native_api.py	Adds example showing how to use the new native API.
examples/quickstart.ipynb	Removes outdated notebook using the legacy API.
examples/evaluate_native_prototype.py	Adds prototype/demo script for grouping semantics on EvaluationFrame.
examples/benchmark_probabilistic_scaling.py	Adds benchmarking script comparing legacy vs native representations.
documentation/validate_docs.sh	Adds a doc consistency validation script for governance artifacts.
documentation/standards/physical_architecture_standard.md	Adds/updates physical architecture and layering/file-structure standard.
documentation/standards/logging_and_observability_standard.md	Adds/updates logging/observability standard and scope guidance.
documentation/integration_guide.md	Adds/updates integration guidance for the native API and data contract.
documentation/INSTANTIATION_CHECKLIST.md	Adds adoption checklist for governance artifacts/standards.
documentation/evaluation_concepts.md	Adds conceptual guide (schemas/parallelogram, partitions/sets).
documentation/contributor_protocols/silicon_based_agents.md	Adds protocol governing AI-assisted changes and safety constraints.
documentation/contributor_protocols/hardened_protocol_template.md	Adds hardened contributor protocol specific to numerical evaluation work.
documentation/contributor_protocols/carbon_based_agents.md	Adds protocol defining human contributor responsibilities.
documentation/CICs/README.md	Adds index for active Class Intent Contracts.
documentation/CICs/NativeEvaluator.md	Adds intent contract for NativeEvaluator responsibilities/failure modes.
documentation/CICs/MetricCatalog.md	Adds intent contract for MetricCatalog responsibilities/failure modes.
documentation/CICs/EvaluationReport.md	Adds intent contract for EvaluationReport responsibilities/failure modes.
documentation/CICs/EvaluationFrame.md	Adds intent contract for EvaluationFrame responsibilities/failure modes.
documentation/CICs/cic_template.md	Adds CIC template for future intent contracts.
documentation/ADRs/README.md	Reworks ADR index/numbering scheme and contribution guidance.
documentation/ADRs/adr_template.md	Replaces ADR template with expanded structure and guidance.
documentation/ADRs/042_metric_catalog.md	Adds/updates ADR for MetricCatalog + named profiles decision.
documentation/ADRs/041_evaluation_output_schema.md	Adds/updates ADR for output schema direction and responsibilities.
documentation/ADRs/040_evaluation_input_schema.md	Adds/updates ADR for input schema (native path) and identifier requirements.
documentation/ADRs/032_metric_calculation_schemas.md	Adds/updates ADR describing month/step/sequence evaluation schemas.
documentation/ADRs/031_evaluation_metrics.md	Adds/updates ADR describing metric set and implementation status notes.
documentation/ADRs/030_evaluation_strategy.md	Adds/updates ADR for rolling-origin evaluation strategy.
documentation/ADRs/023_technical_risk_register.md	Adds ADR formalizing the technical risk register artifact.
documentation/ADRs/022_evolution_and_stability.md	Adds deferred ADR placeholder for evolution/stability rules.
documentation/ADRs/021_intent_contracts_for_classes.md	Adds ADR requiring intent contracts for non-trivial classes.
documentation/ADRs/020_multi_perspective_testing.md	Adds ADR for Green/Beige/Red testing taxonomy and parity mandate.
documentation/ADRs/014_boundary_contracts_and_validation.md	Adds ADR for boundary contracts + config validation expectations (and deviations).
documentation/ADRs/013_observability_and_explicit_failure.md	Adds ADR for fail-loud + persistent observability.
documentation/ADRs/012_authority_over_inference.md	Adds ADR prohibiting semantic inference/sniffing across boundaries.
documentation/ADRs/011_topology_and_dependency_rules.md	Adds ADR for strict layering and dependency rules.
documentation/ADRs/010_ontology_of_evaluation.md	Adds ADR defining evaluation ontology and forbidden concepts.
documentation/ADRs/005_evaluation_output_schema.md	Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/004_evaluation_input_schema.md	Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/003_metric_calculation.md	Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/002_evaluation_strategy.md	Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/001_silicon_based_agent_protocol.md	Adds ADR governing silicon-based agent usage and constraints.
documentation/ADRs/001_evaluation_metrics.md	Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/000_use_of_adrs.md	Adds ADR formalizing ADR usage in this repo.
.gitignore	Updates ignore patterns (notably adds `reports/`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T08:30:51Z

views_evaluation/evaluation/config_schema.py

+    Config dict for NativeEvaluator.
+
+    All keys are optional (total=False) to match the existing .get() patterns.
+    Downstream validators (EvaluationManager._validate_config) enforce
+    required-key semantics at runtime.


The EvaluationConfig docstring still references EvaluationManager._validate_config as the runtime validator, but this PR’s architecture removes EvaluationManager. Update the docstring to reflect the current reality (e.g., validation happens in NativeEvaluator.evaluate() / at the orchestration boundary, or is a known gap tracked by the risk register).

Copilot · 2026-04-09T08:30:51Z

views_evaluation/evaluation/evaluation_frame.py

+    def _validate(y_true: np.ndarray, y_pred: np.ndarray, identifiers: Dict[str, np.ndarray]):
+        n_rows = len(y_true)
+        if y_pred.shape[0] != n_rows:
+            raise ValueError(f"y_pred rows ({y_pred.shape[0]}) mismatch y_true ({n_rows})")


EvaluationFrame._validate() doesn’t enforce that y_true is 1D. A 2D array like shape (N, 1) would currently pass length checks and can cause subtle broadcasting/metric issues later. Consider adding an explicit y_true.ndim == 1 check (and a clear error message) to enforce the (N,) contract.

Copilot · 2026-04-09T08:30:52Z

views_evaluation/evaluation/native_evaluator.py

+    def _resolve_task_and_metrics(self, ef: EvaluationFrame):
+        target = ef.metadata.get('target')
+        # Determine task from config
+        if target in self.config.get("regression_targets", []):
+            task = "regression"
+        elif target in self.config.get("classification_targets", []):
+            task = "classification"
+        else:
+            raise ValueError(f"Target {target} not found in config")
+


If ef.metadata lacks a 'target' key, target becomes None and the resulting error (Target None not found in config) is hard to diagnose. Consider explicitly validating that target is present/non-empty and raising a clearer error that mentions the required metadata key and shows available targets from the config.

Copilot · 2026-04-09T08:30:52Z

views_evaluation/evaluation/evaluation_report.py

+    def to_dataframe(self, schema: str):
+        """
+        Converts a specific schema's results into a Pandas DataFrame.
+        If schema='raw', returns the dictionary of mapped metrics dataclasses.
+        """
+        if schema == "raw":
+            warnings.warn(
+                "to_dataframe(schema='raw') is deprecated. Use to_dict()['schemas'] instead.",
+                DeprecationWarning,
+                stacklevel=2,
+            )
+            return self._results


to_dataframe()’s docstring says schema='raw' returns “the dictionary of mapped metrics dataclasses”, but the implementation returns the raw internal _results dict (nested dicts of floats). Please update the docstring to match behavior (or change the behavior if the mapped-dataclass dict is what you intended).

Polichinel and others added 30 commits January 23, 2026 11:00

Fix: Linting issues in test files

4b29e81

Addressed linting errors in and . - : Replaced / with / for boolean comparisons. - : Removed unused variable assignments for , , , , , , and . These changes ensure adherence to linting standards within the test suite.

Fix: Remove unused import in tests/test_metric_calculators.py

50bfe9c

Removed from as it was an unused import, identified by the ruff linter.

docs(ADR-001): Mark unimplemented metrics

0409129

Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.

docs(reports): Add post-mortem on multi-target investigation

8dc478b

mtd

ade69a4

rem refs

27f5a7b

tests

e8c65f7

Merge remote-tracking branch 'origin/main' into mean_tweedie_deviance

44629bb

Merge pull request #13 from views-platform/mean_tweedie_deviance

7d77a22

Mean tweedie deviance

small patch to allow for Hydranet to pass pred_taget with surffix _pr…

c7e9697

…ob and _raw

refactor(evaluation): remove hydranet patches and add manifest-driven…

8632fe9

… evaluation proposal

Merge remote-tracking branch 'origin/development' into feature/docume…

98ef932

…ntation-verification-suite

fix(linting): remove unused variable assignment flagged by ruff (F841)

5967466

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(post-mortem): add evaluation ontology liberation session post-mo…

9b631c0

…rtem

docs: update copyright holders in LICENSE

2433433

docs: add Håvard Hegre to copyright holders

ddb0542

Merge pull request #14 from views-platform/feature/documentation-veri…

fcbe9e4

…fication-suite Feature/documentation verification suite

docs: add investigation plan and initial alignment semantics analysis

cfb13ac

docs: add implementation plan for EvaluationFrame migration

8af7556

docs: reorganize investigation reports into numbered directory

a13c15f

Polichinel and others added 25 commits March 13, 2026 12:13

Merge pull request #15 from views-platform/feature/samples_for_fao

19dd0c2

Feature/samples for fao

fix: correct corrupted .gitignore and wrong import paths in example a…

c4f3642

…nd guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused variable in example to pass ruff lint

236b1c8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #16 from views-platform/feature/thresholds00

1b0b549

feat!: threshold metrics, Phase 3 purge, and governance adoption

Merge pull request #17 from views-platform/debug/cleanup03042026

cf32c76

fix: close risk register concerns C-16, C-17, C-18

Merge pull request #18 from views-platform/feature/brier_variants

2534075

feat: explicit Brier score variants for 2×2 evaluation matrix

Polichinel requested review from Copilot, lujzi05 and xiaolong0728 April 9, 2026 08:25

Copilot started reviewing on behalf of Polichinel April 9, 2026 08:26 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants#19

release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants#19
Polichinel wants to merge 67 commits intomainfrom
development

Polichinel commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Polichinel commented Apr 9, 2026

⚠️ COORDINATION REQUIRED

Summary

Breaking changes

New features

Governance

Tests

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants