Merged
Conversation
- Defined EvaluationFrame contract and pure-numpy logic. - Implemented PandasAdapter for backward-compatible alignment. - Created Parity Test Campaign (Green/Beige/Red teams). - Documented performance scaling (14x speedup for sample metrics). - Identified and documented legacy bugs (step-wise truncation).
…, and PandasAdapter
- Finalized NativeEvaluator and EvaluationFrame with optimized grouping. - Vectorized all metric calculators for significant performance gains. - Isolated PandasAdapter and ensured 'Fail-Loud' early validation. - Maintained 100% backward compatibility (all 77 tests pass). - Established Foundation ADR suite (000-041), CICs, and Protocols.
- Physically separated core math from framework-specific logic (ADR-011). - Implemented EvaluationReport for framework-agnostic result container (ADR-010). - Completed native metric kernel with vectorized Ignorance Score. - Exposed modern native API in __init__.py and provided usage example. - Verified 100% parity across all 77 tests (ADR-020). - Established Class Intent Contract for EvaluationReport.
- Physically separated core math from all external frameworks (ADR-011). - Deleted legacy metric_calculators.py; replaced with pure-numpy kernels. - Implemented robust shape-guards in math kernels to prevent broadcasting traps. - Refactored NativeEvaluator to return pure-dictionary results (ADR-010). - Centralized all legacy-compatible mapping inside EvaluationReport. - Verified 100% parity across all 77 tests after ontological cleanup. - Standardized metric signatures to handle legacy 'target' arguments.
…y pass
Source fixes:
- native_metric_calculators.py: move _guard_shapes dimension validation out of
the `if y_pred.dtype == object:` branch so it runs unconditionally for all
NumPy float arrays (was dead code for the primary use path)
- native_metric_calculators.py: change unimplemented metric stubs (SD, pEMDiv,
Variogram, Brier, Jeffreys) from NotImplementedError to ValueError with a
clear user-facing message, consistent with ADR-013 fail-loud contract
- evaluation_frame.py: enforce required identifier keys {time, unit, origin,
step} in _validate(); raises ValueError listing missing keys
- native_evaluator.py: fix step filtering — pre-initialise only the explicitly
declared steps (config['steps']) not range(1, max+1); sparse configs like
[1,3,6,12] now produce exactly those four result slots
Tests (152 passing, ruff clean):
- tests/test_evaluation_frame.py: 20 direct unit tests (GREEN/BEIGE/RED) for
EvaluationFrame construction, grouping, subsetting, and all validation paths
- tests/test_native_evaluator.py: 19 direct unit tests for NativeEvaluator
schemas, step filtering (sequential and sparse), legacy_compatibility flag,
classification, and failure modes
- tests/test_evaluation_report.py: 17 direct unit tests for EvaluationReport
to_dict, get_schema_results (all 4 task/pred_type combos), to_dataframe
- tests/test_adversarial_inputs.py: add TestAdversarialNativeInputs (7 tests)
targeting EvaluationFrame+NativeEvaluator directly, surviving Phase 3 deletion
- tests/test_metric_calculators.py: update test_not_implemented_metrics to
match ValueError (was NotImplementedError)
Documentation:
- integration_guide.md: complete rewrite — architecture diagram, native API
as primary entry point (PandasAdapter+NativeEvaluator+EvaluationReport),
legacy EvaluationManager section with deprecation notice, identifier glossary
defining origin/step/time/unit, transform behaviour clarification
- CICs/NativeEvaluator.md: fix broken §8 example (was results['month'][1],
now report.to_dataframe('month')); fix §6 (KeyError→ValueError); add
legacy_compatibility and exact step filtering to §3
- CICs/EvaluationFrame.md: add origin+step to §8 example; update §6 failure
modes to reflect actual ValueError behaviour
- ADRs/040_evaluation_input_schema.md: remove stale sniffing language; add
Native Path Invariants table; mark Accepted
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prepares the codebase for the eventual Phase 3 deletion of EvaluationManager, PandasAdapter, and the pandas dependency, once upstream parity is confirmed. No behaviour is changed and all 152 tests continue to pass. Changes: - __init__.py: split exports into permanent (EvaluationFrame, NativeEvaluator, EvaluationReport) and temporary (EvaluationManager, PandasAdapter) sections with clear comments - evaluation_manager.py: add module-level PHASE-3-DELETE docstring; emit DeprecationWarning on every instantiation - adapters/pandas.py: add module-level PHASE-3-DELETE docstring; emit DeprecationWarning on every from_dataframes() call - metrics.py: make `import pandas` lazy (moved inside evaluation_dict_to_dataframe()); permanent core module now loadable without pandas unless to_dataframe() is actually called - pyproject.toml: move pytest to [tool.poetry.group.dev.dependencies]; add comment noting pandas will become optional in Phase 3 - tests/ (10 files): add PHASE-3-DELETE docstring to every test file that covers the legacy EvaluationManager/PandasAdapter path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce MetricCatalog (ADR-042) following the views-r2darts2 catalog pattern: - MetricSpec genome registry declares required hyperparameters per metric - Named evaluation profiles provide values (Chain of Responsibility) - resolve_metric_params: model overrides → profile → fail loud - No default values in function signatures — enforces explicit configuration - NativeEvaluator now uses catalog for dispatch and hyperparameter resolution Also includes prior work from this branch: - Pure-numpy CRPS replacing properscoring (7 parity tests, 1e-10 tolerance) - twCRPS (threshold-weighted CRPS) using chaining representation - QIS (Quantile Interval Score) for asymmetric quantile levels - properscoring moved to dev-only dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Magnitude Calibration Ratio (MCR_point, MCR_sample) metric with full catalog/dispatch/dataclass integration. Add hydranet_ucdp evaluation profile. Fix EvaluationManager._validate_config to accept sample-only models. Add beige+red tests for twCRPS, QIS, MIS, and MCR per ADR-020. Remove dead code from NativeEvaluator (unused metrics_map and dispatch dict imports). Sync Brier/Jeffreys fields into RegressionSampleEvaluationMetrics dataclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup
- Fix EvaluationReport CIC (constructor signature, to_json→to_dict, FM1 guard)
- Fix integration guide broken API example (adapt→from_dataframes)
- Remap all stale ADR references (old flat 001-008 → grouped 010-042)
- Update PandasAdapter CIC to reflect deprecated status and silent-skip behavior
- Add config keys, properties, and ADR-042 to EvaluationFrame/NativeEvaluator CICs
- Remove stale CIC README entries (ModelRunner, VolumeHandler, BoundaryValidator)
- Overhaul README: add Quick Start, 2×2 metrics tables, 3-layer architecture,
evaluation profiles, updated project structure
- Add EvaluationConfig TypedDict (config_schema.py), deprecate to_dataframe('raw')
- Remove Brier/Jeffreys from regression sample dispatch, remove 17 unused aliases
- Add structural consistency test (metric membership ↔ dataclass fields)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Author
Code reviewFound 3 issues:
Lines 217 to 218 in 26179d0
views-evaluation/examples/evaluate_native_prototype.py Lines 2 to 4 in 26179d0
views-evaluation/documentation/integration_guide.md Lines 117 to 119 in 26179d0 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.