Skip to content

release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants#19

Open
Polichinel wants to merge 67 commits intomainfrom
development
Open

release: v0.5.0 — EvaluationFrame architecture, metric catalog, Brier variants#19
Polichinel wants to merge 67 commits intomainfrom
development

Conversation

@Polichinel
Copy link
Copy Markdown
Collaborator

⚠️ COORDINATION REQUIRED

DO NOT MERGE until views-pipeline-core development branch is also ready to merge to main. This release removes EvaluationManager which pipeline-core's main branch still imports. Both repos must be released together.

Merge order:

  1. Merge views-pipeline-core development → main (contains NativeEvaluator migration)
  2. Merge this PR (views-evaluation development → main)
  3. Verify cross-repo integration

Summary

67 commits spanning the full EvaluationFrame refactor, metric catalog adoption, and Phase 3 purge.

Breaking changes

  • EvaluationManager and PandasAdapter deleted — replaced by NativeEvaluator + EvaluationReport
  • pandas is now an optional dependency ([dataframe] extra)
  • Brier_sample/Brier_point replaced by Brier_cls_point, Brier_cls_sample, Brier_rgs_sample
  • Version: 0.4.0 → 0.5.0

New features

  • EvaluationFrame: pure-NumPy validated container (Level 0 core)
  • NativeEvaluator: stateless evaluation engine with 3-schema support
  • MetricCatalog (ADR-042): genome registry + Chain of Responsibility resolver
  • Named evaluation profiles (base, hydranet_ucdp)
  • 25 metrics in catalog (17 implemented), including 3 Brier variants, 2 QS variants, MCR, MIS, QIS, Coverage, Ignorance
  • EvaluationReport with to_dict() and to_dataframe() export

Governance

  • ADR suite renumbered (000–042)
  • CICs for EvaluationFrame, NativeEvaluator, MetricCatalog, EvaluationReport, PandasAdapter
  • Technical risk register: 3 open (C-02, C-05, C-13), 15 closed

Tests

  • 249 tests passing (up from ~77 pre-refactor)
  • Red/Beige/Green taxonomy (ADR-020)
  • Golden-value tests for all implemented metrics

Test plan

  • 249 tests passing locally
  • CI passes (pandas importorskip fix in place)
  • Cross-repo integration verified with updated views-pipeline-core

🤖 Generated with Claude Code

Polichinel and others added 30 commits January 23, 2026 11:00
Adds a new test suite in tests/test_documentation_contracts.py to
verify the contracts and claims made in the project's documentation.
These tests treat the documentation as hypotheses and verify them against
the actual behavior of the EvaluationManager.

Key findings from the tests:
- The EvaluationManager implicitly converts raw float point predictions
  to single-element numpy arrays, which contradicts the documentation's
  claim that this would cause an error.

The documentation has been updated to reflect this behavior:
- eval_lib_imp.md is updated to clarify the implicit conversion and
  change the 'Mandatory Reconciliation Step' to 'Recommended'.
- stepshifter_full_imp_report.md is updated with a final conclusion
  clarifying the EvaluationManager's actual behavior.

Also organizes the analysis reports into a new reports/ directory.
Adds a new document outlining the comprehensive plan for Phase 4:
Non-Functional & Operational Readiness testing. This includes detailed
sections on Performance & Scalability Benchmarking, Logging and
Observability Verification, Memory Profiling, and Concurrency/Parallelism
Safety as a future consideration. This plan aims to ensure the library's
suitability for critical infrastructure environments.
Adds a robust test suite covering adversarial inputs and metric correctness,
and generates a technical debt backlog document.

Phase 2 (Adversarial & Edge-Case Testing) findings:
- The  is not robust to non-finite numbers (, ),
  crashing with  from 's validation.
- It crashes with  on empty  lists (from ).
- It crashes with  on empty  DataFrames.
- It crashes with  on non-overlapping indices (from ).
- This highlights a lack of internal input validation and graceful error handling.

Phase 3 (Data-Centric & Metric-Specific Validation) findings:
- Verified numerical correctness of  with golden datasets.
- Confirmed  metric correctly uses  kwarg.
- Verified  for both point and uncertainty predictions against .

A  document has been created, detailing these
fragilities and recommending future improvements for robustness.
Moved  fixture to  for shared access.
Updates the  (VIEWS Evaluation Technical Integration Guide)
to incorporate critical findings from adversarial testing (Phase 2),
providing a clearer picture of the library's behavior and limitations.
This includes:
- A new section (3.5) detailing 'Robustness Limitations & Input Validation Responsibility',
  highlighting the library's fragility to non-finite numbers and malformed
  structural data, and emphasizing consumer responsibility for pre-validation.
- Enhanced Section 3.4 on 'Data-State Coherency' to clarify that the
   applies transformations without validating mathematical
  appropriateness.
- A cross-reference to  for a comprehensive
  list of known issues.

Updates the  (Forensic Analysis of
views-r2darts2 Evaluation Interface) with minor contextual notes:
- A clarification in Section 4 acknowledging that  has since
  been updated.
- A clarification in Section 5, Point 2, regarding 'Point Prediction Format Ambiguity',
  reflecting that  implicitly converts raw floats, making
  strict consumer-side reconciliation less critical for runtime.
Addressed linting errors in  and .
- : Replaced / with / for boolean comparisons.
- : Removed unused variable assignments for , , , , , , and .

These changes ensure adherence to linting standards within the test suite.
Removed  from  as it was an unused import,
identified by the ruff linter.
Applied automated  changes to files outside the  directory after confirming all tests pass.

- : Removed unused  import and fixed f-string formatting.
- : Removed unused  and  typing imports.
- : Removed unused , , and  typing imports.

These minor changes ensure code quality and adherence to linting standards throughout the project.
This commit introduces comprehensive documentation and a rigorous test suite to clarify and verify the core concepts of the views-evaluation library.

- Adds `documentation/evaluation_concepts.md` to clearly explain the differences between partitions, sets, and the three evaluation schemas (time-series-wise, step-wise, and month-wise).
- Adds `documentation/integration_guide.md`, a step-by-step guide for developers on how to format their data and integrate a new model with the library.
- Adds `tests/test_evaluation_schemas.py`, a permanent and rigorous test suite that programmatically verifies the grouping logic of the three evaluation schemas against the documentation.
- Fixes test pollution issues discovered during development by isolating mocks within the new test suite, ensuring the stability of the entire test run.
Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.
…n tests

- Hardens `EvaluationManager.validate_predictions` to strictly enforce the "exactly one column" contract, preventing crashes from duplicate or extra columns.
- Adds `tests/test_data_contract.py` to verify the single-target and single-column requirements.
- Updates `documentation/integration_guide.md` with a "Common Pitfalls" section to clarify MultiIndex and column usage.
- Updates `reports/technical_debt_backlog.md` to reflect resolved validation issues.
- Includes recent verification reports and drafts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (v0.4.0)

- EvaluationManager now dispatches on {regression,classification} x {point,uncertainty}
- Task type declared explicitly in config; prediction type detected from data shape
- Config schema: regression_targets, regression_point_metrics,
  regression_uncertainty_metrics, classification_targets,
  classification_point_metrics, classification_uncertainty_metrics
- Legacy config keys (targets, metrics) accepted with loud deprecation warning
- _normalise_config() and _validate_config() enforce fail-loud-fail-fast contract
- calculate_ap() no longer applies internal threshold; expects pre-binarised actuals
- AP moved to CLASSIFICATION_POINT_METRIC_FUNCTIONS only
- CRPS moved to uncertainty dicts only (regression and classification)
- Four new metric dataclasses mirror the four dispatch dicts
- transform_data() crash on unknown prefix replaced with logger.warning + identity
- EvaluationManager.__init__ no longer accepts metrics_list (breaking change)
- 70 tests passing, ruff clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Renames all 'uncertainty' metrics and classes to 'sample' (e.g., regression_sample_metrics)
- Updates EvaluationManager to use the new terminology while maintaining legacy aliases
- Adds a prominent Migration Notice and Configuration Schema to README.md
- Updates all tests and documentation to align with the new ontology
- Adds MIT License
…fication-suite

Feature/documentation verification suite
- Defined EvaluationFrame contract and pure-numpy logic.
- Implemented PandasAdapter for backward-compatible alignment.
- Created Parity Test Campaign (Green/Beige/Red teams).
- Documented performance scaling (14x speedup for sample metrics).
- Identified and documented legacy bugs (step-wise truncation).
Polichinel and others added 25 commits March 13, 2026 12:13
Add Magnitude Calibration Ratio (MCR_point, MCR_sample) metric with full
catalog/dispatch/dataclass integration. Add hydranet_ucdp evaluation profile.
Fix EvaluationManager._validate_config to accept sample-only models. Add
beige+red tests for twCRPS, QIS, MIS, and MCR per ADR-020. Remove dead code
from NativeEvaluator (unused metrics_map and dispatch dict imports). Sync
Brier/Jeffreys fields into RegressionSampleEvaluationMetrics dataclass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup

- Fix EvaluationReport CIC (constructor signature, to_json→to_dict, FM1 guard)
- Fix integration guide broken API example (adapt→from_dataframes)
- Remap all stale ADR references (old flat 001-008 → grouped 010-042)
- Update PandasAdapter CIC to reflect deprecated status and silent-skip behavior
- Add config keys, properties, and ADR-042 to EvaluationFrame/NativeEvaluator CICs
- Remove stale CIC README entries (ModelRunner, VolumeHandler, BoundaryValidator)
- Overhaul README: add Quick Start, 2×2 metrics tables, 3-layer architecture,
  evaluation profiles, updated project structure
- Add EvaluationConfig TypedDict (config_schema.py), deprecate to_dataframe('raw')
- Remove Brier/Jeffreys from regression sample dispatch, remove 17 unused aliases
- Add structural consistency test (metric membership ↔ dataclass fields)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nd guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sk register, hardened protocol

- Standardize all 17 ADR headers to base_docs format (Status/Date/Deciders/Consulted/Informed)
- Remove silicon-based agents from Deciders across all ADRs
- Convert table-format headers (030-042) to YAML-style format
- Replace ADR template with decision-focused base_docs template
- Add sections 9-12 (Incorrect Usage, Test Alignment, Evolution, Known Deviations) to 4 existing CICs
- Create MetricCatalog CIC documenting genome registry and resolver
- Create ADR-023 (Technical Risk Register) with tier/trigger/source format
- Add hardened protocol for numerical evaluation contributors
- Add physical architecture standard with critical bundling assessment
- Add INSTANTIATION_CHECKLIST.md and validate_docs.sh
- Update ADR and CIC READMEs with governance structure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics with full test coverage

Add 4 new threshold-dependent metric functions to the evaluation framework:
- Brier_sample: binary classification metric for ensemble predictions
- Brier_point: binary classification metric for point probability predictions
- QS_sample: quantile score (pinball loss) for ensemble predictions
- QS_point: quantile score (pinball loss) for point predictions

All metrics registered in MetricCatalog with genome declarations, added to
METRIC_MEMBERSHIP and legacy dispatch dicts, with BASE_PROFILE defaults
(threshold=1.0, quantile=0.99).

Test coverage: 22 new tests (8 golden-value, 9 beige, 5 red) including a
finding that Brier's comparison-based binarization swallows NaN rather
than propagating it (documented in red tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ariant

- Remove deprecation_msgs.py (dead code: raise_legacy_scale_msg never called)
- Remove legacy PointEvaluationMetrics and SampleEvaluationMetrics (unused,
  replaced by 2×2 typed dataclasses)
- Add y_pred.ndim != 2 validation to EvaluationFrame._validate() (closes C-03)
- Add tests: test_y_pred_1d_raises, test_y_pred_3d_raises
- Fix lint: remove unused variable in TestQuantileScoreBeige

Risk register: C-03 and C-09 closed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…name note

- F1: Add probability-range note to Brier_point docstring (y_pred should be [0,1])
- F2: Fix Brier_sample docstring: "on regression targets" → "binarized at a threshold"
- F3: Add NaN-swallowing note to both Brier docstrings (NumPy comparison semantics)
- F5: Document Brier → Brier_sample breaking rename in MetricCatalog CIC

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efense-in-depth, extreme values

Close 7 gaps identified by test-review audit:

Step 1 [Critical/Green]: 15 golden-value tests for MSE, MSLE, RMSLE, EMD,
  Pearson, MTD, MCR, Coverage, MIS, CRPS, twCRPS, QIS in TestGoldenValues.
Step 2 [High/Beige]: Classification evaluation tests — Brier_sample, Brier_point,
  AP+Brier_point combined, classification sample with profile resolution.
Step 3 [High/Red]: NaN/Inf defense-in-depth integration tests proving
  EvaluationFrame rejects corrupted data before Brier's NaN-swallowing executes.
Step 4 [Medium/Beige]: Multi-target evaluation test (regression + classification
  in same config, evaluated via separate EvaluationFrames).
Step 5 [Medium/Green]: Stateless execution test — evaluate() twice produces
  identical results.
Step 6 [Medium/Red]: Extreme-value tests near float64 limits for MSE, CRPS,
  Brier, Coverage.
Step 7 [Low/Green]: Migrate 14 module-level tests from DataFrame fixtures to
  raw NumPy arrays. Remove pandas import from test_metric_calculators.py.

Test count: 266 → 291 (+25 new tests). Lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Manager

- Remove filter_step_wise_evaluation() — defined but never called (-30 lines)
- Remove aggregate_month_wise_evaluation() — defined but never called (-83 lines)
- Remove unused BaseEvaluationMetrics import (was only used by aggregate)
- Remove vestigial self.is_sample assignment (set but never read)
- Retain self.actual/self.predictions (still tested by
  test_documentation_contracts.py reflective test)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dependency

BREAKING CHANGE: EvaluationManager and PandasAdapter have been removed.
Use NativeEvaluator with EvaluationFrame directly. Adapters belong in
the calling repository (e.g. views-pipeline-core's EvaluationAdapter).

Source deletions:
- views_evaluation/evaluation/evaluation_manager.py (607 lines)
- views_evaluation/adapters/pandas.py (150 lines)
- Legacy dispatch dicts and calculate_ap alias from native_metric_calculators.py

Test deletions (10 files, ~1800 lines):
- test_evaluation_manager.py, test_evaluation_schemas.py
- test_parity_green.py, test_parity_beige.py, test_parity_red.py
- test_parity_adapter_transfer.py, test_data_contract.py
- test_documentation_contracts.py, test_metric_correctness.py
- conftest.py (legacy fixtures)

Test migrations:
- test_adversarial_inputs.py: removed legacy TestAdversarialInputs class,
  kept TestAdversarialNativeInputs (9 tests)
- test_metric_calculators.py: replaced dispatch dict assertions with
  METRIC_MEMBERSHIP assertions; removed pandas import
- test_metric_catalog.py: removed dispatch dict sync test (single source
  of truth now)

Config:
- Removed pandas from pyproject.toml runtime dependencies
- Flipped legacy_compatibility default to False in NativeEvaluator.evaluate()

Documentation:
- Deleted CICs/PandasAdapter.md
- Updated README quick-start to native-only API
- Updated physical architecture standard (removed PHASE-3-DELETE entries)
- Updated ADR-042 (dispatch dicts note)

Preconditions confirmed:
- views-pipeline-core has EvaluationAdapter (mirrored PandasAdapter)
- Shadow parity verified and scaffolding removed (commit 84a997b)
- All model repos handle own inverse transformations (r2darts2, stepshifter,
  baseline, hydranet verified)

Result: 228 tests passing, 0 lint errors. Pure Math Engine achieved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xamples

- Update examples/using_native_api.py and evaluate_native_prototype.py to
  use EvaluationFrame directly (removed PandasAdapter imports)
- Delete examples/quickstart.ipynb (entirely EvaluationManager-based)
- Update integration_guide.md: remove legacy API section, update architecture
  diagram, update code example to native-only path
- Update CIC Known Deviations: remove resolved C-01 references from
  NativeEvaluator.md and MetricCatalog.md
- Update risk register: close C-01, C-04, C-06, C-08 (3 open concerns remain)
- Update README: remove EvaluationManager from component table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address all findings from review-base-docs audit:

M1: ADR-011 — update 3 EvaluationManager references to "Pipeline Core (external)"
M2: ADR-040 — rename PandasAdapter section, update to EvaluationFrame construction
M3: evaluation_concepts.md — "EvaluationManager assesses" → "evaluation framework assesses"
L1: EvaluationFrame CIC — 3 PandasAdapter references → "external adapters"
L1: EvaluationReport CIC — remove EvaluationManager from consumer list
L2: logging standard — remove EvaluationManager from orchestration example
L3: checklist — "(PHASE-3-DELETE)" → "(removed in Phase 3)"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion, add empty-config red test

- Ignorance: hand-computed golden value with known bin distribution (log2(8/3))
- AP: oracle test using sklearn.metrics.average_precision_score
- Fix NumPy deprecation: float(np.quantile(..., axis=1)) → .item() in QIS test
- Empty config red test: documents C-02 gap — NativeEvaluator({}) accepted at
  init but fails at evaluate() time

231 tests passing, 0 warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NativeEvaluator CIC: replace deleted parity test refs with adversarial test refs
- NativeEvaluator CIC: drop EvaluationManager comparison in Known Deviations
- ADR-012: "and PandasAdapter" → "and external adapters"
- ADR-014: "in EvaluationManager or Adapters" → "in EvaluationFrame constructor or NativeEvaluator"
- ADR-021: replace PandasAdapter with EvaluationReport in example list
- Update Last reviewed dates on 3 CICs to 2026-04-02

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump version 0.4.0 → 0.5.0 for Phase 3 breaking changes (closes C-11)
- Add pandas as optional dependency: `pip install views_evaluation[dataframe]`
  for to_dataframe() support (closes C-12)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat!: threshold metrics, Phase 3 purge, and governance adoption
…ntext, step sentinel

Address C-16, C-17, C-18 identified by risk register review with TDD:

- C-16: Wrap metric function calls in _calculate_metrics() with try/except
  that re-raises as ValueError naming the metric, task, and pred_type
- C-17: Replace hardcoded max_allowed_step=999 with float('inf') so steps
  >= 1000 are not silently dropped
- C-18: Add bounds validation in resolve_metric_params() for alpha, quantile,
  lower_quantile, upper_quantile — all must be in (0, 1). Cross-validation
  for QIS lower_quantile < upper_quantile

Also: update CICs (MetricCatalog, NativeEvaluator) and ADRs (011, 014) with
Known Deviations sections documenting C-02 and C-05. Close C-14 (stale
editable install metadata). Upgrade C-02 from Tier 3 to Tier 2.

9 new tests, 240 total passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s-schema consistency, C-10

- Add object-dtype rejection to EvaluationFrame._validate() (ADR-011 Pure NumPy contract)
- Remove 22 lines of dead pandas/object-dtype branches from _guard_shapes (closes C-10)
- Add 5 new tests: object-dtype rejection (2), malformed report dict (1),
  NaN metric detectability (1), cross-schema MSE consistency (1)
- 245 tests passing, risk register: 3 open concerns remain

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_evaluation_report.py imported pandas at module level, causing
a collection error in CI where pandas is not installed (optional
dependency via [dataframe] extra). Use pytest.importorskip to skip
gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: close risk register concerns C-16, C-17, C-18
Replace Brier_sample/Brier_point with three task-explicit variants:
- Brier_cls_point: classification point (y_pred is probability)
- Brier_cls_sample: classification sample (average MC Dropout probabilities)
- Brier_rgs_sample: regression sample (binarise count samples at threshold)

Brier_rgs_point intentionally omitted — regression point estimates are not
probabilities. The _cls_/_rgs_ infix makes the task context self-documenting.

Critical fix: Brier_cls_sample uses mean(y_pred) instead of mean(y_pred > threshold),
which was broken for probability samples (all probabilities > 0 → p_hat ≈ 1.0).

Profile defaults: classification Brier threshold=0.0 (PDF: event y > 0),
regression Brier threshold=1.0 (binarise at 1 fatality). Catalog size 24 → 25.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three Brier variants now default to threshold=0.0 in the base profile,
matching the Pre-Release Note 05 definition: Brier evaluates the binary
event "any fatality occurred" (y > 0). On integer-valued UCDP data,
y > 0 and y >= 1 are equivalent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: explicit Brier score variants for 2×2 evaluation matrix
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Release v0.5.0 introduces the new EvaluationFrame-based “native” evaluation architecture, a MetricCatalog + named profiles for metric hyperparameters, and updated reporting/export APIs, while making pandas optional.

Changes:

  • Added EvaluationFrame + NativeEvaluator + EvaluationReport as the core native evaluation path with schema regrouping and catalog-driven metric dispatch.
  • Introduced MetricCatalog (genome + Chain-of-Responsibility param resolver) and named evaluation profiles (base, hydranet_ucdp).
  • Updated packaging/docs/tests for the new architecture (pandas optional extra, new/updated guides, extensive tests, removed outdated notebook/ADRs).

Reviewed changes

Copilot reviewed 88 out of 90 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
views_evaluation/profiles/hydranet_ucdp.py Adds HydraNet/UCDP profile overrides on top of base metric hyperparameters.
views_evaluation/profiles/base.py Defines the baseline, system-wide metric hyperparameters used by the catalog resolver.
views_evaluation/profiles/init.py Registers named evaluation profiles for config selection.
views_evaluation/evaluation/native_evaluator.py Implements schema regrouping (month/sequence/step) and metric dispatch using catalog + profiles.
views_evaluation/evaluation/metrics.py Refactors legacy metric dataclasses into a 2×2 task/prediction-type matrix; lazy pandas import for DataFrame export.
views_evaluation/evaluation/metric_catalog.py Adds metric registry, membership, and hyperparameter resolution + validation logic.
views_evaluation/evaluation/evaluation_report.py Adds structured report container with schema access + dict/dataframe export.
views_evaluation/evaluation/evaluation_frame.py Adds validated, pure-NumPy container + grouping/selection utilities.
views_evaluation/evaluation/config_schema.py Adds TypedDict documenting expected config keys for NativeEvaluator.
views_evaluation/adapters/init.py Keeps adapters package placeholder for future framework bridges.
views_evaluation/init.py Exposes the new public API surface (EvaluationFrame, NativeEvaluator, EvaluationReport, catalog utilities, profiles).
tests/test_evaluation_report.py Adds direct unit coverage for EvaluationReport APIs and edge/failure modes.
tests/test_adversarial_inputs.py Adds adversarial tests ensuring fail-loud behavior at the EvaluationFrame boundary and evaluator dispatch.
reports/technical_debt_backlog.md Adds/updates technical debt tracking and status notes.
reports/stepshifter_full_imp_report.md Adds forensic analysis of upstream evaluation interface expectations.
reports/proposal_manifest_driven_evaluation.md Adds proposal document for manifest-driven orchestration architecture.
reports/post_mortems/post_mortem_report.md Adds post-mortem documenting documentation/code verification work.
reports/post_mortems/2026-02-23_evaluation_ontology_liberation_post_mortem.md Adds post-mortem capturing ontology/design decisions and migration learnings.
reports/post_mortem_multi_target_investigation.md Adds post-mortem on multi-target support limits and contract hardening.
reports/phase_4_plan.md Adds operational readiness plan (benchmarking/logging/memory profiling).
reports/phase_2_adversarial_testing_report.md Adds adversarial testing findings and recommendations.
reports/documentation_discrepancy_report.md Adds documentation discrepancy summary and follow-up recommendations.
reports/2026-02-25_evaluation_frame_refactor/10_orchestrator_migration_plan.md Documents migration plan for moving alignment/adaptation upstream.
reports/2026-02-25_evaluation_frame_refactor/09_post_refactor_status.md Documents post-refactor status and next steps.
reports/2026-02-25_evaluation_frame_refactor/07_implementation_plan.md Documents phased implementation plan for EvaluationFrame migration.
reports/2026-02-25_evaluation_frame_refactor/06_investigation_summary.md Documents investigation findings and recommendations for the new boundary.
reports/2026-02-25_evaluation_frame_refactor/05_probabilistic_scaling_benchmark.md Adds benchmark results motivating dense NumPy representation.
reports/2026-02-25_evaluation_frame_refactor/04_parity_investigation_log.md Adds parity investigation log and identified legacy behaviors/bugs.
reports/2026-02-25_evaluation_frame_refactor/03_evaluation_frame_contract.md Adds EvaluationFrame contract specification document.
reports/2026-02-25_evaluation_frame_refactor/02_current_alignment_semantics.md Documents legacy alignment/regrouping semantics to preserve/replace.
reports/2026-02-25_evaluation_frame_refactor/01_investigation_plan.md Adds initial investigation plan for the refactor.
pyproject.toml Bumps version to 0.5.0; makes pandas optional via [dataframe] extra; moves dev deps into dev group.
LICENSE Adds MIT license text and copyright.
examples/using_native_api.py Adds example showing how to use the new native API.
examples/quickstart.ipynb Removes outdated notebook using the legacy API.
examples/evaluate_native_prototype.py Adds prototype/demo script for grouping semantics on EvaluationFrame.
examples/benchmark_probabilistic_scaling.py Adds benchmarking script comparing legacy vs native representations.
documentation/validate_docs.sh Adds a doc consistency validation script for governance artifacts.
documentation/standards/physical_architecture_standard.md Adds/updates physical architecture and layering/file-structure standard.
documentation/standards/logging_and_observability_standard.md Adds/updates logging/observability standard and scope guidance.
documentation/integration_guide.md Adds/updates integration guidance for the native API and data contract.
documentation/INSTANTIATION_CHECKLIST.md Adds adoption checklist for governance artifacts/standards.
documentation/evaluation_concepts.md Adds conceptual guide (schemas/parallelogram, partitions/sets).
documentation/contributor_protocols/silicon_based_agents.md Adds protocol governing AI-assisted changes and safety constraints.
documentation/contributor_protocols/hardened_protocol_template.md Adds hardened contributor protocol specific to numerical evaluation work.
documentation/contributor_protocols/carbon_based_agents.md Adds protocol defining human contributor responsibilities.
documentation/CICs/README.md Adds index for active Class Intent Contracts.
documentation/CICs/NativeEvaluator.md Adds intent contract for NativeEvaluator responsibilities/failure modes.
documentation/CICs/MetricCatalog.md Adds intent contract for MetricCatalog responsibilities/failure modes.
documentation/CICs/EvaluationReport.md Adds intent contract for EvaluationReport responsibilities/failure modes.
documentation/CICs/EvaluationFrame.md Adds intent contract for EvaluationFrame responsibilities/failure modes.
documentation/CICs/cic_template.md Adds CIC template for future intent contracts.
documentation/ADRs/README.md Reworks ADR index/numbering scheme and contribution guidance.
documentation/ADRs/adr_template.md Replaces ADR template with expanded structure and guidance.
documentation/ADRs/042_metric_catalog.md Adds/updates ADR for MetricCatalog + named profiles decision.
documentation/ADRs/041_evaluation_output_schema.md Adds/updates ADR for output schema direction and responsibilities.
documentation/ADRs/040_evaluation_input_schema.md Adds/updates ADR for input schema (native path) and identifier requirements.
documentation/ADRs/032_metric_calculation_schemas.md Adds/updates ADR describing month/step/sequence evaluation schemas.
documentation/ADRs/031_evaluation_metrics.md Adds/updates ADR describing metric set and implementation status notes.
documentation/ADRs/030_evaluation_strategy.md Adds/updates ADR for rolling-origin evaluation strategy.
documentation/ADRs/023_technical_risk_register.md Adds ADR formalizing the technical risk register artifact.
documentation/ADRs/022_evolution_and_stability.md Adds deferred ADR placeholder for evolution/stability rules.
documentation/ADRs/021_intent_contracts_for_classes.md Adds ADR requiring intent contracts for non-trivial classes.
documentation/ADRs/020_multi_perspective_testing.md Adds ADR for Green/Beige/Red testing taxonomy and parity mandate.
documentation/ADRs/014_boundary_contracts_and_validation.md Adds ADR for boundary contracts + config validation expectations (and deviations).
documentation/ADRs/013_observability_and_explicit_failure.md Adds ADR for fail-loud + persistent observability.
documentation/ADRs/012_authority_over_inference.md Adds ADR prohibiting semantic inference/sniffing across boundaries.
documentation/ADRs/011_topology_and_dependency_rules.md Adds ADR for strict layering and dependency rules.
documentation/ADRs/010_ontology_of_evaluation.md Adds ADR defining evaluation ontology and forbidden concepts.
documentation/ADRs/005_evaluation_output_schema.md Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/004_evaluation_input_schema.md Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/003_metric_calculation.md Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/002_evaluation_strategy.md Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/001_silicon_based_agent_protocol.md Adds ADR governing silicon-based agent usage and constraints.
documentation/ADRs/001_evaluation_metrics.md Removes obsolete ADR superseded by new numbering/structure.
documentation/ADRs/000_use_of_adrs.md Adds ADR formalizing ADR usage in this repo.
.gitignore Updates ignore patterns (notably adds reports/).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +18
Config dict for NativeEvaluator.

All keys are optional (total=False) to match the existing .get() patterns.
Downstream validators (EvaluationManager._validate_config) enforce
required-key semantics at runtime.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EvaluationConfig docstring still references EvaluationManager._validate_config as the runtime validator, but this PR’s architecture removes EvaluationManager. Update the docstring to reflect the current reality (e.g., validation happens in NativeEvaluator.evaluate() / at the orchestration boundary, or is a known gap tracked by the risk register).

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +29
def _validate(y_true: np.ndarray, y_pred: np.ndarray, identifiers: Dict[str, np.ndarray]):
n_rows = len(y_true)
if y_pred.shape[0] != n_rows:
raise ValueError(f"y_pred rows ({y_pred.shape[0]}) mismatch y_true ({n_rows})")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvaluationFrame._validate() doesn’t enforce that y_true is 1D. A 2D array like shape (N, 1) would currently pass length checks and can cause subtle broadcasting/metric issues later. Consider adding an explicit y_true.ndim == 1 check (and a clear error message) to enforce the (N,) contract.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +50
def _resolve_task_and_metrics(self, ef: EvaluationFrame):
target = ef.metadata.get('target')
# Determine task from config
if target in self.config.get("regression_targets", []):
task = "regression"
elif target in self.config.get("classification_targets", []):
task = "classification"
else:
raise ValueError(f"Target {target} not found in config")

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ef.metadata lacks a 'target' key, target becomes None and the resulting error (Target None not found in config) is hard to diagnose. Consider explicitly validating that target is present/non-empty and raising a clearer error that mentions the required metadata key and shows available targets from the config.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +74
def to_dataframe(self, schema: str):
"""
Converts a specific schema's results into a Pandas DataFrame.
If schema='raw', returns the dictionary of mapped metrics dataclasses.
"""
if schema == "raw":
warnings.warn(
"to_dataframe(schema='raw') is deprecated. Use to_dict()['schemas'] instead.",
DeprecationWarning,
stacklevel=2,
)
return self._results
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_dataframe()’s docstring says schema='raw' returns “the dictionary of mapped metrics dataclasses”, but the implementation returns the raw internal _results dict (nested dicts of floats). Please update the docstring to match behavior (or change the behavior if the mapped-dataclass dict is what you intended).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants