Skip to content

Feature/samples for fao#15

Merged
Polichinel merged 18 commits intodevelopmentfrom
feature/samples_for_fao
Mar 14, 2026
Merged

Feature/samples for fao#15
Polichinel merged 18 commits intodevelopmentfrom
feature/samples_for_fao

Conversation

@Polichinel
Copy link
Copy Markdown
Collaborator

No description provided.

Polichinel and others added 18 commits February 25, 2026 15:28
- Defined EvaluationFrame contract and pure-numpy logic.
- Implemented PandasAdapter for backward-compatible alignment.
- Created Parity Test Campaign (Green/Beige/Red teams).
- Documented performance scaling (14x speedup for sample metrics).
- Identified and documented legacy bugs (step-wise truncation).
- Finalized NativeEvaluator and EvaluationFrame with optimized grouping.
- Vectorized all metric calculators for significant performance gains.
- Isolated PandasAdapter and ensured 'Fail-Loud' early validation.
- Maintained 100% backward compatibility (all 77 tests pass).
- Established Foundation ADR suite (000-041), CICs, and Protocols.
- Physically separated core math from framework-specific logic (ADR-011).
- Implemented EvaluationReport for framework-agnostic result container (ADR-010).
- Completed native metric kernel with vectorized Ignorance Score.
- Exposed modern native API in __init__.py and provided usage example.
- Verified 100% parity across all 77 tests (ADR-020).
- Established Class Intent Contract for EvaluationReport.
- Physically separated core math from all external frameworks (ADR-011).
- Deleted legacy metric_calculators.py; replaced with pure-numpy kernels.
- Implemented robust shape-guards in math kernels to prevent broadcasting traps.
- Refactored NativeEvaluator to return pure-dictionary results (ADR-010).
- Centralized all legacy-compatible mapping inside EvaluationReport.
- Verified 100% parity across all 77 tests after ontological cleanup.
- Standardized metric signatures to handle legacy 'target' arguments.
…y pass

Source fixes:
- native_metric_calculators.py: move _guard_shapes dimension validation out of
  the `if y_pred.dtype == object:` branch so it runs unconditionally for all
  NumPy float arrays (was dead code for the primary use path)
- native_metric_calculators.py: change unimplemented metric stubs (SD, pEMDiv,
  Variogram, Brier, Jeffreys) from NotImplementedError to ValueError with a
  clear user-facing message, consistent with ADR-013 fail-loud contract
- evaluation_frame.py: enforce required identifier keys {time, unit, origin,
  step} in _validate(); raises ValueError listing missing keys
- native_evaluator.py: fix step filtering — pre-initialise only the explicitly
  declared steps (config['steps']) not range(1, max+1); sparse configs like
  [1,3,6,12] now produce exactly those four result slots

Tests (152 passing, ruff clean):
- tests/test_evaluation_frame.py: 20 direct unit tests (GREEN/BEIGE/RED) for
  EvaluationFrame construction, grouping, subsetting, and all validation paths
- tests/test_native_evaluator.py: 19 direct unit tests for NativeEvaluator
  schemas, step filtering (sequential and sparse), legacy_compatibility flag,
  classification, and failure modes
- tests/test_evaluation_report.py: 17 direct unit tests for EvaluationReport
  to_dict, get_schema_results (all 4 task/pred_type combos), to_dataframe
- tests/test_adversarial_inputs.py: add TestAdversarialNativeInputs (7 tests)
  targeting EvaluationFrame+NativeEvaluator directly, surviving Phase 3 deletion
- tests/test_metric_calculators.py: update test_not_implemented_metrics to
  match ValueError (was NotImplementedError)

Documentation:
- integration_guide.md: complete rewrite — architecture diagram, native API
  as primary entry point (PandasAdapter+NativeEvaluator+EvaluationReport),
  legacy EvaluationManager section with deprecation notice, identifier glossary
  defining origin/step/time/unit, transform behaviour clarification
- CICs/NativeEvaluator.md: fix broken §8 example (was results['month'][1],
  now report.to_dataframe('month')); fix §6 (KeyError→ValueError); add
  legacy_compatibility and exact step filtering to §3
- CICs/EvaluationFrame.md: add origin+step to §8 example; update §6 failure
  modes to reflect actual ValueError behaviour
- ADRs/040_evaluation_input_schema.md: remove stale sniffing language; add
  Native Path Invariants table; mark Accepted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prepares the codebase for the eventual Phase 3 deletion of EvaluationManager,
PandasAdapter, and the pandas dependency, once upstream parity is confirmed.
No behaviour is changed and all 152 tests continue to pass.

Changes:
- __init__.py: split exports into permanent (EvaluationFrame, NativeEvaluator,
  EvaluationReport) and temporary (EvaluationManager, PandasAdapter) sections
  with clear comments
- evaluation_manager.py: add module-level PHASE-3-DELETE docstring; emit
  DeprecationWarning on every instantiation
- adapters/pandas.py: add module-level PHASE-3-DELETE docstring; emit
  DeprecationWarning on every from_dataframes() call
- metrics.py: make `import pandas` lazy (moved inside
  evaluation_dict_to_dataframe()); permanent core module now loadable without
  pandas unless to_dataframe() is actually called
- pyproject.toml: move pytest to [tool.poetry.group.dev.dependencies]; add
  comment noting pandas will become optional in Phase 3
- tests/ (10 files): add PHASE-3-DELETE docstring to every test file that
  covers the legacy EvaluationManager/PandasAdapter path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce MetricCatalog (ADR-042) following the views-r2darts2 catalog pattern:
- MetricSpec genome registry declares required hyperparameters per metric
- Named evaluation profiles provide values (Chain of Responsibility)
- resolve_metric_params: model overrides → profile → fail loud
- No default values in function signatures — enforces explicit configuration
- NativeEvaluator now uses catalog for dispatch and hyperparameter resolution

Also includes prior work from this branch:
- Pure-numpy CRPS replacing properscoring (7 parity tests, 1e-10 tolerance)
- twCRPS (threshold-weighted CRPS) using chaining representation
- QIS (Quantile Interval Score) for asymmetric quantile levels
- properscoring moved to dev-only dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Magnitude Calibration Ratio (MCR_point, MCR_sample) metric with full
catalog/dispatch/dataclass integration. Add hydranet_ucdp evaluation profile.
Fix EvaluationManager._validate_config to accept sample-only models. Add
beige+red tests for twCRPS, QIS, MIS, and MCR per ADR-020. Remove dead code
from NativeEvaluator (unused metrics_map and dispatch dict imports). Sync
Brier/Jeffreys fields into RegressionSampleEvaluationMetrics dataclass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… cleanup

- Fix EvaluationReport CIC (constructor signature, to_json→to_dict, FM1 guard)
- Fix integration guide broken API example (adapt→from_dataframes)
- Remap all stale ADR references (old flat 001-008 → grouped 010-042)
- Update PandasAdapter CIC to reflect deprecated status and silent-skip behavior
- Add config keys, properties, and ADR-042 to EvaluationFrame/NativeEvaluator CICs
- Remove stale CIC README entries (ModelRunner, VolumeHandler, BoundaryValidator)
- Overhaul README: add Quick Start, 2×2 metrics tables, 3-layer architecture,
  evaluation profiles, updated project structure
- Add EvaluationConfig TypedDict (config_schema.py), deprecate to_dataframe('raw')
- Remove Brier/Jeffreys from regression sample dispatch, remove 17 unused aliases
- Add structural consistency test (metric membership ↔ dataclass fields)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Polichinel
Copy link
Copy Markdown
Collaborator Author

Code review

Found 3 issues:

  1. .gitignore has a corrupted line: *.log.*reports/ is a single token instead of two separate lines (*.log.* and reports/). The original file lacked a trailing newline, so appending reports/ merged it with the previous line. Neither pattern works correctly — *.log.* is broken and reports/ is not ignored.

views-evaluation/.gitignore

Lines 217 to 218 in 26179d0

*.log
*.log.*reports/

  1. examples/evaluate_native_prototype.py imports from a path that does not exist: from views_evaluation.evaluation.adapters import PandasAdapter. The actual module is at views_evaluation.adapters.pandas, not views_evaluation.evaluation.adapters. This will raise ModuleNotFoundError at runtime.

import pandas as pd
from views_evaluation.evaluation.adapters import PandasAdapter
from views_evaluation.evaluation.evaluation_frame import EvaluationFrame

  1. documentation/integration_guide.md line 118 has an incorrect import path: from views_evaluation.evaluation.adapters.pandas import PandasAdapter. The correct path is from views_evaluation.adapters.pandas import PandasAdapter (adapters is directly under views_evaluation/, not under views_evaluation/evaluation/). Users copy-pasting this example will get ModuleNotFoundError.

import pandas as pd
from views_evaluation.evaluation.adapters.pandas import PandasAdapter
from views_evaluation.evaluation.native_evaluator import NativeEvaluator

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@Polichinel Polichinel merged commit 19dd0c2 into development Mar 14, 2026
4 checks passed
@Polichinel Polichinel deleted the feature/samples_for_fao branch March 14, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant