Feature/samples for fao by Polichinel · Pull Request #15 · views-platform/views-evaluation

Polichinel · 2026-03-13T20:31:43Z

No description provided.

- Defined EvaluationFrame contract and pure-numpy logic. - Implemented PandasAdapter for backward-compatible alignment. - Created Parity Test Campaign (Green/Beige/Red teams). - Documented performance scaling (14x speedup for sample metrics). - Identified and documented legacy bugs (step-wise truncation).

…, and PandasAdapter

- Finalized NativeEvaluator and EvaluationFrame with optimized grouping. - Vectorized all metric calculators for significant performance gains. - Isolated PandasAdapter and ensured 'Fail-Loud' early validation. - Maintained 100% backward compatibility (all 77 tests pass). - Established Foundation ADR suite (000-041), CICs, and Protocols.

- Physically separated core math from framework-specific logic (ADR-011). - Implemented EvaluationReport for framework-agnostic result container (ADR-010). - Completed native metric kernel with vectorized Ignorance Score. - Exposed modern native API in __init__.py and provided usage example. - Verified 100% parity across all 77 tests (ADR-020). - Established Class Intent Contract for EvaluationReport.

- Physically separated core math from all external frameworks (ADR-011). - Deleted legacy metric_calculators.py; replaced with pure-numpy kernels. - Implemented robust shape-guards in math kernels to prevent broadcasting traps. - Refactored NativeEvaluator to return pure-dictionary results (ADR-010). - Centralized all legacy-compatible mapping inside EvaluationReport. - Verified 100% parity across all 77 tests after ontological cleanup. - Standardized metric signatures to handle legacy 'target' arguments.

…trator migration

…y pass Source fixes: - native_metric_calculators.py: move _guard_shapes dimension validation out of the `if y_pred.dtype == object:` branch so it runs unconditionally for all NumPy float arrays (was dead code for the primary use path) - native_metric_calculators.py: change unimplemented metric stubs (SD, pEMDiv, Variogram, Brier, Jeffreys) from NotImplementedError to ValueError with a clear user-facing message, consistent with ADR-013 fail-loud contract - evaluation_frame.py: enforce required identifier keys {time, unit, origin, step} in _validate(); raises ValueError listing missing keys - native_evaluator.py: fix step filtering — pre-initialise only the explicitly declared steps (config['steps']) not range(1, max+1); sparse configs like [1,3,6,12] now produce exactly those four result slots Tests (152 passing, ruff clean): - tests/test_evaluation_frame.py: 20 direct unit tests (GREEN/BEIGE/RED) for EvaluationFrame construction, grouping, subsetting, and all validation paths - tests/test_native_evaluator.py: 19 direct unit tests for NativeEvaluator schemas, step filtering (sequential and sparse), legacy_compatibility flag, classification, and failure modes - tests/test_evaluation_report.py: 17 direct unit tests for EvaluationReport to_dict, get_schema_results (all 4 task/pred_type combos), to_dataframe - tests/test_adversarial_inputs.py: add TestAdversarialNativeInputs (7 tests) targeting EvaluationFrame+NativeEvaluator directly, surviving Phase 3 deletion - tests/test_metric_calculators.py: update test_not_implemented_metrics to match ValueError (was NotImplementedError) Documentation: - integration_guide.md: complete rewrite — architecture diagram, native API as primary entry point (PandasAdapter+NativeEvaluator+EvaluationReport), legacy EvaluationManager section with deprecation notice, identifier glossary defining origin/step/time/unit, transform behaviour clarification - CICs/NativeEvaluator.md: fix broken §8 example (was results['month'][1], now report.to_dataframe('month')); fix §6 (KeyError→ValueError); add legacy_compatibility and exact step filtering to §3 - CICs/EvaluationFrame.md: add origin+step to §8 example; update §6 failure modes to reflect actual ValueError behaviour - ADRs/040_evaluation_input_schema.md: remove stale sniffing language; add Native Path Invariants table; mark Accepted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Prepares the codebase for the eventual Phase 3 deletion of EvaluationManager, PandasAdapter, and the pandas dependency, once upstream parity is confirmed. No behaviour is changed and all 152 tests continue to pass. Changes: - __init__.py: split exports into permanent (EvaluationFrame, NativeEvaluator, EvaluationReport) and temporary (EvaluationManager, PandasAdapter) sections with clear comments - evaluation_manager.py: add module-level PHASE-3-DELETE docstring; emit DeprecationWarning on every instantiation - adapters/pandas.py: add module-level PHASE-3-DELETE docstring; emit DeprecationWarning on every from_dataframes() call - metrics.py: make `import pandas` lazy (moved inside evaluation_dict_to_dataframe()); permanent core module now loadable without pandas unless to_dataframe() is actually called - pyproject.toml: move pytest to [tool.poetry.group.dev.dependencies]; add comment noting pandas will become optional in Phase 3 - tests/ (10 files): add PHASE-3-DELETE docstring to every test file that covers the legacy EvaluationManager/PandasAdapter path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduce MetricCatalog (ADR-042) following the views-r2darts2 catalog pattern: - MetricSpec genome registry declares required hyperparameters per metric - Named evaluation profiles provide values (Chain of Responsibility) - resolve_metric_params: model overrides → profile → fail loud - No default values in function signatures — enforces explicit configuration - NativeEvaluator now uses catalog for dispatch and hyperparameter resolution Also includes prior work from this branch: - Pure-numpy CRPS replacing properscoring (7 parity tests, 1e-10 tolerance) - twCRPS (threshold-weighted CRPS) using chaining representation - QIS (Quantile Interval Score) for asymmetric quantile levels - properscoring moved to dev-only dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Magnitude Calibration Ratio (MCR_point, MCR_sample) metric with full catalog/dispatch/dataclass integration. Add hydranet_ucdp evaluation profile. Fix EvaluationManager._validate_config to accept sample-only models. Add beige+red tests for twCRPS, QIS, MIS, and MCR per ADR-020. Remove dead code from NativeEvaluator (unused metrics_map and dispatch dict imports). Sync Brier/Jeffreys fields into RegressionSampleEvaluationMetrics dataclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… cleanup - Fix EvaluationReport CIC (constructor signature, to_json→to_dict, FM1 guard) - Fix integration guide broken API example (adapt→from_dataframes) - Remap all stale ADR references (old flat 001-008 → grouped 010-042) - Update PandasAdapter CIC to reflect deprecated status and silent-skip behavior - Add config keys, properties, and ADR-042 to EvaluationFrame/NativeEvaluator CICs - Remove stale CIC README entries (ModelRunner, VolumeHandler, BoundaryValidator) - Overhaul README: add Quick Start, 2×2 metrics tables, 3-layer architecture, evaluation profiles, updated project structure - Add EvaluationConfig TypedDict (config_schema.py), deprecate to_dataframe('raw') - Remove Brier/Jeffreys from regression sample dispatch, remove 17 unused aliases - Add structural consistency test (metric membership ↔ dataclass fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Polichinel · 2026-03-13T23:43:09Z

Code review

Found 3 issues:

.gitignore has a corrupted line: *.log.*reports/ is a single token instead of two separate lines (*.log.* and reports/). The original file lacked a trailing newline, so appending reports/ merged it with the previous line. Neither pattern works correctly — *.log.* is broken and reports/ is not ignored.

views-evaluation/.gitignore

Lines 217 to 218 in 26179d0

    
           *.log 
        
           *.log.*reports/

examples/evaluate_native_prototype.py imports from a path that does not exist: from views_evaluation.evaluation.adapters import PandasAdapter. The actual module is at views_evaluation.adapters.pandas, not views_evaluation.evaluation.adapters. This will raise ModuleNotFoundError at runtime.

views-evaluation/examples/evaluate_native_prototype.py

Lines 2 to 4 in 26179d0

    
           import pandas as pd 
        
           from views_evaluation.evaluation.adapters import PandasAdapter 
        
           from views_evaluation.evaluation.evaluation_frame import EvaluationFrame

documentation/integration_guide.md line 118 has an incorrect import path: from views_evaluation.evaluation.adapters.pandas import PandasAdapter. The correct path is from views_evaluation.adapters.pandas import PandasAdapter (adapters is directly under views_evaluation/, not under views_evaluation/evaluation/). Users copy-pasting this example will get ModuleNotFoundError.

views-evaluation/documentation/integration_guide.md

Lines 117 to 119 in 26179d0

    
           import pandas as pd 
        
           from views_evaluation.evaluation.adapters.pandas import PandasAdapter 
        
           from views_evaluation.evaluation.native_evaluator import NativeEvaluator

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Polichinel and others added 18 commits February 25, 2026 15:28

docs: add investigation plan and initial alignment semantics analysis

cfb13ac

docs: add implementation plan for EvaluationFrame migration

8af7556

docs: reorganize investigation reports into numbered directory

a13c15f

docs: renumber ADR suite into foundational hierarchy (000-041)

5b04d87

docs: add Class Intent Contracts for EvaluationFrame, NativeEvaluator…

b09658d

…, and PandasAdapter

docs: finalize post-refactor status report

ace95b5

fix(linting): resolve linting issues identified by ruff

12090dd

feat: implement dual-entry support and shadow verification for orches…

4ad460b

…trator migration

docs: clarify defensive bridge in migration plan

193b06c

Polichinel merged commit 19dd0c2 into development Mar 14, 2026
4 checks passed

Polichinel deleted the feature/samples_for_fao branch March 14, 2026 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/samples for fao#15

Feature/samples for fao#15
Polichinel merged 18 commits intodevelopmentfrom
feature/samples_for_fao

Polichinel commented Mar 13, 2026

Uh oh!

Polichinel commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Polichinel commented Mar 13, 2026

Uh oh!

Polichinel commented Mar 13, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant