release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants#19
release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants#19Polichinel wants to merge 67 commits intomainfrom
Conversation
Adds a new test suite in tests/test_documentation_contracts.py to verify the contracts and claims made in the project's documentation. These tests treat the documentation as hypotheses and verify them against the actual behavior of the EvaluationManager. Key findings from the tests: - The EvaluationManager implicitly converts raw float point predictions to single-element numpy arrays, which contradicts the documentation's claim that this would cause an error. The documentation has been updated to reflect this behavior: - eval_lib_imp.md is updated to clarify the implicit conversion and change the 'Mandatory Reconciliation Step' to 'Recommended'. - stepshifter_full_imp_report.md is updated with a final conclusion clarifying the EvaluationManager's actual behavior. Also organizes the analysis reports into a new reports/ directory.
Adds a new document outlining the comprehensive plan for Phase 4: Non-Functional & Operational Readiness testing. This includes detailed sections on Performance & Scalability Benchmarking, Logging and Observability Verification, Memory Profiling, and Concurrency/Parallelism Safety as a future consideration. This plan aims to ensure the library's suitability for critical infrastructure environments.
Adds a robust test suite covering adversarial inputs and metric correctness, and generates a technical debt backlog document. Phase 2 (Adversarial & Edge-Case Testing) findings: - The is not robust to non-finite numbers (, ), crashing with from 's validation. - It crashes with on empty lists (from ). - It crashes with on empty DataFrames. - It crashes with on non-overlapping indices (from ). - This highlights a lack of internal input validation and graceful error handling. Phase 3 (Data-Centric & Metric-Specific Validation) findings: - Verified numerical correctness of with golden datasets. - Confirmed metric correctly uses kwarg. - Verified for both point and uncertainty predictions against . A document has been created, detailing these fragilities and recommending future improvements for robustness. Moved fixture to for shared access.
Updates the (VIEWS Evaluation Technical Integration Guide) to incorporate critical findings from adversarial testing (Phase 2), providing a clearer picture of the library's behavior and limitations. This includes: - A new section (3.5) detailing 'Robustness Limitations & Input Validation Responsibility', highlighting the library's fragility to non-finite numbers and malformed structural data, and emphasizing consumer responsibility for pre-validation. - Enhanced Section 3.4 on 'Data-State Coherency' to clarify that the applies transformations without validating mathematical appropriateness. - A cross-reference to for a comprehensive list of known issues. Updates the (Forensic Analysis of views-r2darts2 Evaluation Interface) with minor contextual notes: - A clarification in Section 4 acknowledging that has since been updated. - A clarification in Section 5, Point 2, regarding 'Point Prediction Format Ambiguity', reflecting that implicitly converts raw floats, making strict consumer-side reconciliation less critical for runtime.
Addressed linting errors in and . - : Replaced / with / for boolean comparisons. - : Removed unused variable assignments for , , , , , , and . These changes ensure adherence to linting standards within the test suite.
Removed from as it was an unused import, identified by the ruff linter.
Applied automated changes to files outside the directory after confirming all tests pass. - : Removed unused import and fixed f-string formatting. - : Removed unused and typing imports. - : Removed unused , , and typing imports. These minor changes ensure code quality and adherence to linting standards throughout the project.
This commit introduces comprehensive documentation and a rigorous test suite to clarify and verify the core concepts of the views-evaluation library. - Adds `documentation/evaluation_concepts.md` to clearly explain the differences between partitions, sets, and the three evaluation schemas (time-series-wise, step-wise, and month-wise). - Adds `documentation/integration_guide.md`, a step-by-step guide for developers on how to format their data and integrate a new model with the library. - Adds `tests/test_evaluation_schemas.py`, a permanent and rigorous test suite that programmatically verifies the grouping logic of the three evaluation schemas against the documentation. - Fixes test pollution issues discovered during development by isolating mocks within the new test suite, ensuring the stability of the entire test run.
Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.
…n tests - Hardens `EvaluationManager.validate_predictions` to strictly enforce the "exactly one column" contract, preventing crashes from duplicate or extra columns. - Adds `tests/test_data_contract.py` to verify the single-target and single-column requirements. - Updates `documentation/integration_guide.md` with a "Common Pitfalls" section to clarify MultiIndex and column usage. - Updates `reports/technical_debt_backlog.md` to reflect resolved validation issues. - Includes recent verification reports and drafts.
Mean tweedie deviance
… evaluation proposal
…ntation-verification-suite
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (v0.4.0)
- EvaluationManager now dispatches on {regression,classification} x {point,uncertainty}
- Task type declared explicitly in config; prediction type detected from data shape
- Config schema: regression_targets, regression_point_metrics,
regression_uncertainty_metrics, classification_targets,
classification_point_metrics, classification_uncertainty_metrics
- Legacy config keys (targets, metrics) accepted with loud deprecation warning
- _normalise_config() and _validate_config() enforce fail-loud-fail-fast contract
- calculate_ap() no longer applies internal threshold; expects pre-binarised actuals
- AP moved to CLASSIFICATION_POINT_METRIC_FUNCTIONS only
- CRPS moved to uncertainty dicts only (regression and classification)
- Four new metric dataclasses mirror the four dispatch dicts
- transform_data() crash on unknown prefix replaced with logger.warning + identity
- EvaluationManager.__init__ no longer accepts metrics_list (breaking change)
- 70 tests passing, ruff clean
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Renames all 'uncertainty' metrics and classes to 'sample' (e.g., regression_sample_metrics) - Updates EvaluationManager to use the new terminology while maintaining legacy aliases - Adds a prominent Migration Notice and Configuration Schema to README.md - Updates all tests and documentation to align with the new ontology - Adds MIT License
…fication-suite Feature/documentation verification suite
- Defined EvaluationFrame contract and pure-numpy logic. - Implemented PandasAdapter for backward-compatible alignment. - Created Parity Test Campaign (Green/Beige/Red teams). - Documented performance scaling (14x speedup for sample metrics). - Identified and documented legacy bugs (step-wise truncation).
Add Magnitude Calibration Ratio (MCR_point, MCR_sample) metric with full catalog/dispatch/dataclass integration. Add hydranet_ucdp evaluation profile. Fix EvaluationManager._validate_config to accept sample-only models. Add beige+red tests for twCRPS, QIS, MIS, and MCR per ADR-020. Remove dead code from NativeEvaluator (unused metrics_map and dispatch dict imports). Sync Brier/Jeffreys fields into RegressionSampleEvaluationMetrics dataclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup
- Fix EvaluationReport CIC (constructor signature, to_json→to_dict, FM1 guard)
- Fix integration guide broken API example (adapt→from_dataframes)
- Remap all stale ADR references (old flat 001-008 → grouped 010-042)
- Update PandasAdapter CIC to reflect deprecated status and silent-skip behavior
- Add config keys, properties, and ADR-042 to EvaluationFrame/NativeEvaluator CICs
- Remove stale CIC README entries (ModelRunner, VolumeHandler, BoundaryValidator)
- Overhaul README: add Quick Start, 2×2 metrics tables, 3-layer architecture,
evaluation profiles, updated project structure
- Add EvaluationConfig TypedDict (config_schema.py), deprecate to_dataframe('raw')
- Remove Brier/Jeffreys from regression sample dispatch, remove 17 unused aliases
- Add structural consistency test (metric membership ↔ dataclass fields)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Feature/samples for fao
…nd guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sk register, hardened protocol - Standardize all 17 ADR headers to base_docs format (Status/Date/Deciders/Consulted/Informed) - Remove silicon-based agents from Deciders across all ADRs - Convert table-format headers (030-042) to YAML-style format - Replace ADR template with decision-focused base_docs template - Add sections 9-12 (Incorrect Usage, Test Alignment, Evolution, Known Deviations) to 4 existing CICs - Create MetricCatalog CIC documenting genome registry and resolver - Create ADR-023 (Technical Risk Register) with tier/trigger/source format - Add hardened protocol for numerical evaluation contributors - Add physical architecture standard with critical bundling assessment - Add INSTANTIATION_CHECKLIST.md and validate_docs.sh - Update ADR and CIC READMEs with governance structure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics with full test coverage Add 4 new threshold-dependent metric functions to the evaluation framework: - Brier_sample: binary classification metric for ensemble predictions - Brier_point: binary classification metric for point probability predictions - QS_sample: quantile score (pinball loss) for ensemble predictions - QS_point: quantile score (pinball loss) for point predictions All metrics registered in MetricCatalog with genome declarations, added to METRIC_MEMBERSHIP and legacy dispatch dicts, with BASE_PROFILE defaults (threshold=1.0, quantile=0.99). Test coverage: 22 new tests (8 golden-value, 9 beige, 5 red) including a finding that Brier's comparison-based binarization swallows NaN rather than propagating it (documented in red tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ariant - Remove deprecation_msgs.py (dead code: raise_legacy_scale_msg never called) - Remove legacy PointEvaluationMetrics and SampleEvaluationMetrics (unused, replaced by 2×2 typed dataclasses) - Add y_pred.ndim != 2 validation to EvaluationFrame._validate() (closes C-03) - Add tests: test_y_pred_1d_raises, test_y_pred_3d_raises - Fix lint: remove unused variable in TestQuantileScoreBeige Risk register: C-03 and C-09 closed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…name note - F1: Add probability-range note to Brier_point docstring (y_pred should be [0,1]) - F2: Fix Brier_sample docstring: "on regression targets" → "binarized at a threshold" - F3: Add NaN-swallowing note to both Brier docstrings (NumPy comparison semantics) - F5: Document Brier → Brier_sample breaking rename in MetricCatalog CIC Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efense-in-depth, extreme values Close 7 gaps identified by test-review audit: Step 1 [Critical/Green]: 15 golden-value tests for MSE, MSLE, RMSLE, EMD, Pearson, MTD, MCR, Coverage, MIS, CRPS, twCRPS, QIS in TestGoldenValues. Step 2 [High/Beige]: Classification evaluation tests — Brier_sample, Brier_point, AP+Brier_point combined, classification sample with profile resolution. Step 3 [High/Red]: NaN/Inf defense-in-depth integration tests proving EvaluationFrame rejects corrupted data before Brier's NaN-swallowing executes. Step 4 [Medium/Beige]: Multi-target evaluation test (regression + classification in same config, evaluated via separate EvaluationFrames). Step 5 [Medium/Green]: Stateless execution test — evaluate() twice produces identical results. Step 6 [Medium/Red]: Extreme-value tests near float64 limits for MSE, CRPS, Brier, Coverage. Step 7 [Low/Green]: Migrate 14 module-level tests from DataFrame fixtures to raw NumPy arrays. Remove pandas import from test_metric_calculators.py. Test count: 266 → 291 (+25 new tests). Lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Manager - Remove filter_step_wise_evaluation() — defined but never called (-30 lines) - Remove aggregate_month_wise_evaluation() — defined but never called (-83 lines) - Remove unused BaseEvaluationMetrics import (was only used by aggregate) - Remove vestigial self.is_sample assignment (set but never read) - Retain self.actual/self.predictions (still tested by test_documentation_contracts.py reflective test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dependency BREAKING CHANGE: EvaluationManager and PandasAdapter have been removed. Use NativeEvaluator with EvaluationFrame directly. Adapters belong in the calling repository (e.g. views-pipeline-core's EvaluationAdapter). Source deletions: - views_evaluation/evaluation/evaluation_manager.py (607 lines) - views_evaluation/adapters/pandas.py (150 lines) - Legacy dispatch dicts and calculate_ap alias from native_metric_calculators.py Test deletions (10 files, ~1800 lines): - test_evaluation_manager.py, test_evaluation_schemas.py - test_parity_green.py, test_parity_beige.py, test_parity_red.py - test_parity_adapter_transfer.py, test_data_contract.py - test_documentation_contracts.py, test_metric_correctness.py - conftest.py (legacy fixtures) Test migrations: - test_adversarial_inputs.py: removed legacy TestAdversarialInputs class, kept TestAdversarialNativeInputs (9 tests) - test_metric_calculators.py: replaced dispatch dict assertions with METRIC_MEMBERSHIP assertions; removed pandas import - test_metric_catalog.py: removed dispatch dict sync test (single source of truth now) Config: - Removed pandas from pyproject.toml runtime dependencies - Flipped legacy_compatibility default to False in NativeEvaluator.evaluate() Documentation: - Deleted CICs/PandasAdapter.md - Updated README quick-start to native-only API - Updated physical architecture standard (removed PHASE-3-DELETE entries) - Updated ADR-042 (dispatch dicts note) Preconditions confirmed: - views-pipeline-core has EvaluationAdapter (mirrored PandasAdapter) - Shadow parity verified and scaffolding removed (commit 84a997b) - All model repos handle own inverse transformations (r2darts2, stepshifter, baseline, hydranet verified) Result: 228 tests passing, 0 lint errors. Pure Math Engine achieved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xamples - Update examples/using_native_api.py and evaluate_native_prototype.py to use EvaluationFrame directly (removed PandasAdapter imports) - Delete examples/quickstart.ipynb (entirely EvaluationManager-based) - Update integration_guide.md: remove legacy API section, update architecture diagram, update code example to native-only path - Update CIC Known Deviations: remove resolved C-01 references from NativeEvaluator.md and MetricCatalog.md - Update risk register: close C-01, C-04, C-06, C-08 (3 open concerns remain) - Update README: remove EvaluationManager from component table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address all findings from review-base-docs audit: M1: ADR-011 — update 3 EvaluationManager references to "Pipeline Core (external)" M2: ADR-040 — rename PandasAdapter section, update to EvaluationFrame construction M3: evaluation_concepts.md — "EvaluationManager assesses" → "evaluation framework assesses" L1: EvaluationFrame CIC — 3 PandasAdapter references → "external adapters" L1: EvaluationReport CIC — remove EvaluationManager from consumer list L2: logging standard — remove EvaluationManager from orchestration example L3: checklist — "(PHASE-3-DELETE)" → "(removed in Phase 3)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion, add empty-config red test
- Ignorance: hand-computed golden value with known bin distribution (log2(8/3))
- AP: oracle test using sklearn.metrics.average_precision_score
- Fix NumPy deprecation: float(np.quantile(..., axis=1)) → .item() in QIS test
- Empty config red test: documents C-02 gap — NativeEvaluator({}) accepted at
init but fails at evaluate() time
231 tests passing, 0 warnings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NativeEvaluator CIC: replace deleted parity test refs with adversarial test refs - NativeEvaluator CIC: drop EvaluationManager comparison in Known Deviations - ADR-012: "and PandasAdapter" → "and external adapters" - ADR-014: "in EvaluationManager or Adapters" → "in EvaluationFrame constructor or NativeEvaluator" - ADR-021: replace PandasAdapter with EvaluationReport in example list - Update Last reviewed dates on 3 CICs to 2026-04-02 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump version 0.4.0 → 0.5.0 for Phase 3 breaking changes (closes C-11) - Add pandas as optional dependency: `pip install views_evaluation[dataframe]` for to_dataframe() support (closes C-12) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat!: threshold metrics, Phase 3 purge, and governance adoption
…ntext, step sentinel
Address C-16, C-17, C-18 identified by risk register review with TDD:
- C-16: Wrap metric function calls in _calculate_metrics() with try/except
that re-raises as ValueError naming the metric, task, and pred_type
- C-17: Replace hardcoded max_allowed_step=999 with float('inf') so steps
>= 1000 are not silently dropped
- C-18: Add bounds validation in resolve_metric_params() for alpha, quantile,
lower_quantile, upper_quantile — all must be in (0, 1). Cross-validation
for QIS lower_quantile < upper_quantile
Also: update CICs (MetricCatalog, NativeEvaluator) and ADRs (011, 014) with
Known Deviations sections documenting C-02 and C-05. Close C-14 (stale
editable install metadata). Upgrade C-02 from Tier 3 to Tier 2.
9 new tests, 240 total passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s-schema consistency, C-10 - Add object-dtype rejection to EvaluationFrame._validate() (ADR-011 Pure NumPy contract) - Remove 22 lines of dead pandas/object-dtype branches from _guard_shapes (closes C-10) - Add 5 new tests: object-dtype rejection (2), malformed report dict (1), NaN metric detectability (1), cross-schema MSE consistency (1) - 245 tests passing, risk register: 3 open concerns remain Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_evaluation_report.py imported pandas at module level, causing a collection error in CI where pandas is not installed (optional dependency via [dataframe] extra). Use pytest.importorskip to skip gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: close risk register concerns C-16, C-17, C-18
Replace Brier_sample/Brier_point with three task-explicit variants: - Brier_cls_point: classification point (y_pred is probability) - Brier_cls_sample: classification sample (average MC Dropout probabilities) - Brier_rgs_sample: regression sample (binarise count samples at threshold) Brier_rgs_point intentionally omitted — regression point estimates are not probabilities. The _cls_/_rgs_ infix makes the task context self-documenting. Critical fix: Brier_cls_sample uses mean(y_pred) instead of mean(y_pred > threshold), which was broken for probability samples (all probabilities > 0 → p_hat ≈ 1.0). Profile defaults: classification Brier threshold=0.0 (PDF: event y > 0), regression Brier threshold=1.0 (binarise at 1 fatality). Catalog size 24 → 25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three Brier variants now default to threshold=0.0 in the base profile, matching the Pre-Release Note 05 definition: Brier evaluates the binary event "any fatality occurred" (y > 0). On integer-valued UCDP data, y > 0 and y >= 1 are equivalent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: explicit Brier score variants for 2×2 evaluation matrix
There was a problem hiding this comment.
Pull request overview
Release v0.5.0 introduces the new EvaluationFrame-based “native” evaluation architecture, a MetricCatalog + named profiles for metric hyperparameters, and updated reporting/export APIs, while making pandas optional.
Changes:
- Added
EvaluationFrame+NativeEvaluator+EvaluationReportas the core native evaluation path with schema regrouping and catalog-driven metric dispatch. - Introduced
MetricCatalog(genome + Chain-of-Responsibility param resolver) and named evaluation profiles (base,hydranet_ucdp). - Updated packaging/docs/tests for the new architecture (pandas optional extra, new/updated guides, extensive tests, removed outdated notebook/ADRs).
Reviewed changes
Copilot reviewed 88 out of 90 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| views_evaluation/profiles/hydranet_ucdp.py | Adds HydraNet/UCDP profile overrides on top of base metric hyperparameters. |
| views_evaluation/profiles/base.py | Defines the baseline, system-wide metric hyperparameters used by the catalog resolver. |
| views_evaluation/profiles/init.py | Registers named evaluation profiles for config selection. |
| views_evaluation/evaluation/native_evaluator.py | Implements schema regrouping (month/sequence/step) and metric dispatch using catalog + profiles. |
| views_evaluation/evaluation/metrics.py | Refactors legacy metric dataclasses into a 2×2 task/prediction-type matrix; lazy pandas import for DataFrame export. |
| views_evaluation/evaluation/metric_catalog.py | Adds metric registry, membership, and hyperparameter resolution + validation logic. |
| views_evaluation/evaluation/evaluation_report.py | Adds structured report container with schema access + dict/dataframe export. |
| views_evaluation/evaluation/evaluation_frame.py | Adds validated, pure-NumPy container + grouping/selection utilities. |
| views_evaluation/evaluation/config_schema.py | Adds TypedDict documenting expected config keys for NativeEvaluator. |
| views_evaluation/adapters/init.py | Keeps adapters package placeholder for future framework bridges. |
| views_evaluation/init.py | Exposes the new public API surface (EvaluationFrame, NativeEvaluator, EvaluationReport, catalog utilities, profiles). |
| tests/test_evaluation_report.py | Adds direct unit coverage for EvaluationReport APIs and edge/failure modes. |
| tests/test_adversarial_inputs.py | Adds adversarial tests ensuring fail-loud behavior at the EvaluationFrame boundary and evaluator dispatch. |
| reports/technical_debt_backlog.md | Adds/updates technical debt tracking and status notes. |
| reports/stepshifter_full_imp_report.md | Adds forensic analysis of upstream evaluation interface expectations. |
| reports/proposal_manifest_driven_evaluation.md | Adds proposal document for manifest-driven orchestration architecture. |
| reports/post_mortems/post_mortem_report.md | Adds post-mortem documenting documentation/code verification work. |
| reports/post_mortems/2026-02-23_evaluation_ontology_liberation_post_mortem.md | Adds post-mortem capturing ontology/design decisions and migration learnings. |
| reports/post_mortem_multi_target_investigation.md | Adds post-mortem on multi-target support limits and contract hardening. |
| reports/phase_4_plan.md | Adds operational readiness plan (benchmarking/logging/memory profiling). |
| reports/phase_2_adversarial_testing_report.md | Adds adversarial testing findings and recommendations. |
| reports/documentation_discrepancy_report.md | Adds documentation discrepancy summary and follow-up recommendations. |
| reports/2026-02-25_evaluation_frame_refactor/10_orchestrator_migration_plan.md | Documents migration plan for moving alignment/adaptation upstream. |
| reports/2026-02-25_evaluation_frame_refactor/09_post_refactor_status.md | Documents post-refactor status and next steps. |
| reports/2026-02-25_evaluation_frame_refactor/07_implementation_plan.md | Documents phased implementation plan for EvaluationFrame migration. |
| reports/2026-02-25_evaluation_frame_refactor/06_investigation_summary.md | Documents investigation findings and recommendations for the new boundary. |
| reports/2026-02-25_evaluation_frame_refactor/05_probabilistic_scaling_benchmark.md | Adds benchmark results motivating dense NumPy representation. |
| reports/2026-02-25_evaluation_frame_refactor/04_parity_investigation_log.md | Adds parity investigation log and identified legacy behaviors/bugs. |
| reports/2026-02-25_evaluation_frame_refactor/03_evaluation_frame_contract.md | Adds EvaluationFrame contract specification document. |
| reports/2026-02-25_evaluation_frame_refactor/02_current_alignment_semantics.md | Documents legacy alignment/regrouping semantics to preserve/replace. |
| reports/2026-02-25_evaluation_frame_refactor/01_investigation_plan.md | Adds initial investigation plan for the refactor. |
| pyproject.toml | Bumps version to 0.5.0; makes pandas optional via [dataframe] extra; moves dev deps into dev group. |
| LICENSE | Adds MIT license text and copyright. |
| examples/using_native_api.py | Adds example showing how to use the new native API. |
| examples/quickstart.ipynb | Removes outdated notebook using the legacy API. |
| examples/evaluate_native_prototype.py | Adds prototype/demo script for grouping semantics on EvaluationFrame. |
| examples/benchmark_probabilistic_scaling.py | Adds benchmarking script comparing legacy vs native representations. |
| documentation/validate_docs.sh | Adds a doc consistency validation script for governance artifacts. |
| documentation/standards/physical_architecture_standard.md | Adds/updates physical architecture and layering/file-structure standard. |
| documentation/standards/logging_and_observability_standard.md | Adds/updates logging/observability standard and scope guidance. |
| documentation/integration_guide.md | Adds/updates integration guidance for the native API and data contract. |
| documentation/INSTANTIATION_CHECKLIST.md | Adds adoption checklist for governance artifacts/standards. |
| documentation/evaluation_concepts.md | Adds conceptual guide (schemas/parallelogram, partitions/sets). |
| documentation/contributor_protocols/silicon_based_agents.md | Adds protocol governing AI-assisted changes and safety constraints. |
| documentation/contributor_protocols/hardened_protocol_template.md | Adds hardened contributor protocol specific to numerical evaluation work. |
| documentation/contributor_protocols/carbon_based_agents.md | Adds protocol defining human contributor responsibilities. |
| documentation/CICs/README.md | Adds index for active Class Intent Contracts. |
| documentation/CICs/NativeEvaluator.md | Adds intent contract for NativeEvaluator responsibilities/failure modes. |
| documentation/CICs/MetricCatalog.md | Adds intent contract for MetricCatalog responsibilities/failure modes. |
| documentation/CICs/EvaluationReport.md | Adds intent contract for EvaluationReport responsibilities/failure modes. |
| documentation/CICs/EvaluationFrame.md | Adds intent contract for EvaluationFrame responsibilities/failure modes. |
| documentation/CICs/cic_template.md | Adds CIC template for future intent contracts. |
| documentation/ADRs/README.md | Reworks ADR index/numbering scheme and contribution guidance. |
| documentation/ADRs/adr_template.md | Replaces ADR template with expanded structure and guidance. |
| documentation/ADRs/042_metric_catalog.md | Adds/updates ADR for MetricCatalog + named profiles decision. |
| documentation/ADRs/041_evaluation_output_schema.md | Adds/updates ADR for output schema direction and responsibilities. |
| documentation/ADRs/040_evaluation_input_schema.md | Adds/updates ADR for input schema (native path) and identifier requirements. |
| documentation/ADRs/032_metric_calculation_schemas.md | Adds/updates ADR describing month/step/sequence evaluation schemas. |
| documentation/ADRs/031_evaluation_metrics.md | Adds/updates ADR describing metric set and implementation status notes. |
| documentation/ADRs/030_evaluation_strategy.md | Adds/updates ADR for rolling-origin evaluation strategy. |
| documentation/ADRs/023_technical_risk_register.md | Adds ADR formalizing the technical risk register artifact. |
| documentation/ADRs/022_evolution_and_stability.md | Adds deferred ADR placeholder for evolution/stability rules. |
| documentation/ADRs/021_intent_contracts_for_classes.md | Adds ADR requiring intent contracts for non-trivial classes. |
| documentation/ADRs/020_multi_perspective_testing.md | Adds ADR for Green/Beige/Red testing taxonomy and parity mandate. |
| documentation/ADRs/014_boundary_contracts_and_validation.md | Adds ADR for boundary contracts + config validation expectations (and deviations). |
| documentation/ADRs/013_observability_and_explicit_failure.md | Adds ADR for fail-loud + persistent observability. |
| documentation/ADRs/012_authority_over_inference.md | Adds ADR prohibiting semantic inference/sniffing across boundaries. |
| documentation/ADRs/011_topology_and_dependency_rules.md | Adds ADR for strict layering and dependency rules. |
| documentation/ADRs/010_ontology_of_evaluation.md | Adds ADR defining evaluation ontology and forbidden concepts. |
| documentation/ADRs/005_evaluation_output_schema.md | Removes obsolete ADR superseded by new numbering/structure. |
| documentation/ADRs/004_evaluation_input_schema.md | Removes obsolete ADR superseded by new numbering/structure. |
| documentation/ADRs/003_metric_calculation.md | Removes obsolete ADR superseded by new numbering/structure. |
| documentation/ADRs/002_evaluation_strategy.md | Removes obsolete ADR superseded by new numbering/structure. |
| documentation/ADRs/001_silicon_based_agent_protocol.md | Adds ADR governing silicon-based agent usage and constraints. |
| documentation/ADRs/001_evaluation_metrics.md | Removes obsolete ADR superseded by new numbering/structure. |
| documentation/ADRs/000_use_of_adrs.md | Adds ADR formalizing ADR usage in this repo. |
| .gitignore | Updates ignore patterns (notably adds reports/). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Config dict for NativeEvaluator. | ||
|
|
||
| All keys are optional (total=False) to match the existing .get() patterns. | ||
| Downstream validators (EvaluationManager._validate_config) enforce | ||
| required-key semantics at runtime. |
There was a problem hiding this comment.
The EvaluationConfig docstring still references EvaluationManager._validate_config as the runtime validator, but this PR’s architecture removes EvaluationManager. Update the docstring to reflect the current reality (e.g., validation happens in NativeEvaluator.evaluate() / at the orchestration boundary, or is a known gap tracked by the risk register).
| def _validate(y_true: np.ndarray, y_pred: np.ndarray, identifiers: Dict[str, np.ndarray]): | ||
| n_rows = len(y_true) | ||
| if y_pred.shape[0] != n_rows: | ||
| raise ValueError(f"y_pred rows ({y_pred.shape[0]}) mismatch y_true ({n_rows})") |
There was a problem hiding this comment.
EvaluationFrame._validate() doesn’t enforce that y_true is 1D. A 2D array like shape (N, 1) would currently pass length checks and can cause subtle broadcasting/metric issues later. Consider adding an explicit y_true.ndim == 1 check (and a clear error message) to enforce the (N,) contract.
| def _resolve_task_and_metrics(self, ef: EvaluationFrame): | ||
| target = ef.metadata.get('target') | ||
| # Determine task from config | ||
| if target in self.config.get("regression_targets", []): | ||
| task = "regression" | ||
| elif target in self.config.get("classification_targets", []): | ||
| task = "classification" | ||
| else: | ||
| raise ValueError(f"Target {target} not found in config") | ||
|
|
There was a problem hiding this comment.
If ef.metadata lacks a 'target' key, target becomes None and the resulting error (Target None not found in config) is hard to diagnose. Consider explicitly validating that target is present/non-empty and raising a clearer error that mentions the required metadata key and shows available targets from the config.
| def to_dataframe(self, schema: str): | ||
| """ | ||
| Converts a specific schema's results into a Pandas DataFrame. | ||
| If schema='raw', returns the dictionary of mapped metrics dataclasses. | ||
| """ | ||
| if schema == "raw": | ||
| warnings.warn( | ||
| "to_dataframe(schema='raw') is deprecated. Use to_dict()['schemas'] instead.", | ||
| DeprecationWarning, | ||
| stacklevel=2, | ||
| ) | ||
| return self._results |
There was a problem hiding this comment.
to_dataframe()’s docstring says schema='raw' returns “the dictionary of mapped metrics dataclasses”, but the implementation returns the raw internal _results dict (nested dicts of floats). Please update the docstring to match behavior (or change the behavior if the mapped-dataclass dict is what you intended).
DO NOT MERGE until
views-pipeline-coredevelopment branch is also ready to merge to main. This release removesEvaluationManagerwhich pipeline-core's main branch still imports. Both repos must be released together.Merge order:
views-pipeline-coredevelopment → main (containsNativeEvaluatormigration)views-evaluationdevelopment → main)Summary
67 commits spanning the full EvaluationFrame refactor, metric catalog adoption, and Phase 3 purge.
Breaking changes
EvaluationManagerandPandasAdapterdeleted — replaced byNativeEvaluator+EvaluationReport[dataframe]extra)Brier_sample/Brier_pointreplaced byBrier_cls_point,Brier_cls_sample,Brier_rgs_sampleNew features
EvaluationFrame: pure-NumPy validated container (Level 0 core)NativeEvaluator: stateless evaluation engine with 3-schema supportMetricCatalog(ADR-042): genome registry + Chain of Responsibility resolverbase,hydranet_ucdp)EvaluationReportwithto_dict()andto_dataframe()exportGovernance
Tests
Test plan
🤖 Generated with Claude Code