Feature/documentation verification suite by Polichinel · Pull Request #14 · views-platform/views-evaluation

Polichinel · 2026-02-24T14:17:27Z

Strategic Terminology & Configuration Ontology Update

This PR implements a strategic shift in the library’s terminology and configuration ontology.

The primary goal is to move from the conceptually broad term “uncertainty” to the technically precise term “sample”, reflecting that the views-evaluation engine evaluates draws/samples from predictive distributions.

Key Changes

1. Terminology Migration (Ontology)

Renamed all uncertainty-related classes and dictionaries to sample
- Example:
  - RegressionUncertaintyEvaluationMetrics → RegressionSampleEvaluationMetrics
Updated internal logic to explicitly distinguish between:
- Regression vs Classification tasks
- Point vs Sample prediction types
Backward Compatibility
- Maintained legacy aliases
- Implemented deprecated fallback logic for all renamed objects
- Ensures zero immediate breakage for downstream repositories

2. Standardized Configuration Schema

The evaluate() method now follows a strictly typed configuration ontology.

A normalization layer has been added to EvaluationManager that:

Translates legacy keys → canonical keys
Emits a DeprecationWarning when legacy keys are used

Configuration Key Migration

Legacy Key	New Canonical Key
`targets`	`regression_targets`
`metrics`	`regression_point_metrics`
`regression_uncertainty_metrics`	`regression_sample_metrics`
`classification_uncertainty_metrics`	`classification_sample_metrics`

3. Documentation & Transparency

README Migration Notice
- Added a prominent “ATTENTION” section at the top of the README
- Guides developers through the transition
Configuration Schema Table
- Added a clear schema specification to the README
- Enables downstream users to validate their config dictionaries
Integration Guide
- Fully updated:
  - documentation/integration_guide.md
  - Relevant ADRs
- Reflects the new API and terminology

4. Legal

Added a standard MIT License

Verification Results

Test Suite
- ✅ 70 / 70 tests passing (pytest)
Legacy Support
- Manually verified that legacy-style configs (e.g., targets, metrics)
- Produce correct results
- Emit the intended DeprecationWarning
Documentation
- Verified all documentation examples and doclinks are functional

Impact on Downstream Repositories

Existing implementations in:

views-pipeline-core
views-models

will continue to function without breakage but will emit DeprecationWarning logs.

It is recommended to migrate evaluation configurations to the new canonical keys during the next maintenance cycle.

Adds a new test suite in tests/test_documentation_contracts.py to verify the contracts and claims made in the project's documentation. These tests treat the documentation as hypotheses and verify them against the actual behavior of the EvaluationManager. Key findings from the tests: - The EvaluationManager implicitly converts raw float point predictions to single-element numpy arrays, which contradicts the documentation's claim that this would cause an error. The documentation has been updated to reflect this behavior: - eval_lib_imp.md is updated to clarify the implicit conversion and change the 'Mandatory Reconciliation Step' to 'Recommended'. - stepshifter_full_imp_report.md is updated with a final conclusion clarifying the EvaluationManager's actual behavior. Also organizes the analysis reports into a new reports/ directory.

Adds a new document outlining the comprehensive plan for Phase 4: Non-Functional & Operational Readiness testing. This includes detailed sections on Performance & Scalability Benchmarking, Logging and Observability Verification, Memory Profiling, and Concurrency/Parallelism Safety as a future consideration. This plan aims to ensure the library's suitability for critical infrastructure environments.

Adds a robust test suite covering adversarial inputs and metric correctness, and generates a technical debt backlog document. Phase 2 (Adversarial & Edge-Case Testing) findings: - The is not robust to non-finite numbers (, ), crashing with from 's validation. - It crashes with on empty lists (from ). - It crashes with on empty DataFrames. - It crashes with on non-overlapping indices (from ). - This highlights a lack of internal input validation and graceful error handling. Phase 3 (Data-Centric & Metric-Specific Validation) findings: - Verified numerical correctness of with golden datasets. - Confirmed metric correctly uses kwarg. - Verified for both point and uncertainty predictions against . A document has been created, detailing these fragilities and recommending future improvements for robustness. Moved fixture to for shared access.

Updates the (VIEWS Evaluation Technical Integration Guide) to incorporate critical findings from adversarial testing (Phase 2), providing a clearer picture of the library's behavior and limitations. This includes: - A new section (3.5) detailing 'Robustness Limitations & Input Validation Responsibility', highlighting the library's fragility to non-finite numbers and malformed structural data, and emphasizing consumer responsibility for pre-validation. - Enhanced Section 3.4 on 'Data-State Coherency' to clarify that the applies transformations without validating mathematical appropriateness. - A cross-reference to for a comprehensive list of known issues. Updates the (Forensic Analysis of views-r2darts2 Evaluation Interface) with minor contextual notes: - A clarification in Section 4 acknowledging that has since been updated. - A clarification in Section 5, Point 2, regarding 'Point Prediction Format Ambiguity', reflecting that implicitly converts raw floats, making strict consumer-side reconciliation less critical for runtime.

Addressed linting errors in and . - : Replaced / with / for boolean comparisons. - : Removed unused variable assignments for , , , , , , and . These changes ensure adherence to linting standards within the test suite.

Removed from as it was an unused import, identified by the ruff linter.

Applied automated changes to files outside the directory after confirming all tests pass. - : Removed unused import and fixed f-string formatting. - : Removed unused and typing imports. - : Removed unused , , and typing imports. These minor changes ensure code quality and adherence to linting standards throughout the project.

This commit introduces comprehensive documentation and a rigorous test suite to clarify and verify the core concepts of the views-evaluation library. - Adds `documentation/evaluation_concepts.md` to clearly explain the differences between partitions, sets, and the three evaluation schemas (time-series-wise, step-wise, and month-wise). - Adds `documentation/integration_guide.md`, a step-by-step guide for developers on how to format their data and integrate a new model with the library. - Adds `tests/test_evaluation_schemas.py`, a permanent and rigorous test suite that programmatically verifies the grouping logic of the three evaluation schemas against the documentation. - Fixes test pollution issues discovered during development by isolating mocks within the new test suite, ensuring the stability of the entire test run.

Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.

…n tests - Hardens `EvaluationManager.validate_predictions` to strictly enforce the "exactly one column" contract, preventing crashes from duplicate or extra columns. - Adds `tests/test_data_contract.py` to verify the single-target and single-column requirements. - Updates `documentation/integration_guide.md` with a "Common Pitfalls" section to clarify MultiIndex and column usage. - Updates `reports/technical_debt_backlog.md` to reflect resolved validation issues. - Includes recent verification reports and drafts.

…ob and _raw

… evaluation proposal

…ntation-verification-suite

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… (v0.4.0) - EvaluationManager now dispatches on {regression,classification} x {point,uncertainty} - Task type declared explicitly in config; prediction type detected from data shape - Config schema: regression_targets, regression_point_metrics, regression_uncertainty_metrics, classification_targets, classification_point_metrics, classification_uncertainty_metrics - Legacy config keys (targets, metrics) accepted with loud deprecation warning - _normalise_config() and _validate_config() enforce fail-loud-fail-fast contract - calculate_ap() no longer applies internal threshold; expects pre-binarised actuals - AP moved to CLASSIFICATION_POINT_METRIC_FUNCTIONS only - CRPS moved to uncertainty dicts only (regression and classification) - Four new metric dataclasses mirror the four dispatch dicts - transform_data() crash on unknown prefix replaced with logger.warning + identity - EvaluationManager.__init__ no longer accepts metrics_list (breaking change) - 70 tests passing, ruff clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rtem

- Renames all 'uncertainty' metrics and classes to 'sample' (e.g., regression_sample_metrics) - Updates EvaluationManager to use the new terminology while maintaining legacy aliases - Adds a prominent Migration Notice and Configuration Schema to README.md - Updates all tests and documentation to align with the new ontology - Adds MIT License

Polichinel and others added 20 commits January 23, 2026 11:00

Fix: Linting issues in test files

4b29e81

Addressed linting errors in and . - : Replaced / with / for boolean comparisons. - : Removed unused variable assignments for , , , , , , and . These changes ensure adherence to linting standards within the test suite.

Fix: Remove unused import in tests/test_metric_calculators.py

50bfe9c

Removed from as it was an unused import, identified by the ruff linter.

docs(ADR-001): Mark unimplemented metrics

0409129

Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.

docs(reports): Add post-mortem on multi-target investigation

8dc478b

small patch to allow for Hydranet to pass pred_taget with surffix _pr…

c7e9697

…ob and _raw

refactor(evaluation): remove hydranet patches and add manifest-driven…

8632fe9

… evaluation proposal

Merge remote-tracking branch 'origin/development' into feature/docume…

98ef932

…ntation-verification-suite

fix(linting): remove unused variable assignment flagged by ruff (F841)

5967466

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(post-mortem): add evaluation ontology liberation session post-mo…

9b631c0

…rtem

docs: update copyright holders in LICENSE

2433433

docs: add Håvard Hegre to copyright holders

ddb0542

Polichinel merged commit fcbe9e4 into development Feb 24, 2026
4 checks passed

Polichinel deleted the feature/documentation-verification-suite branch February 24, 2026 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/documentation verification suite#14

Feature/documentation verification suite#14
Polichinel merged 20 commits intodevelopmentfrom
feature/documentation-verification-suite

Polichinel commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Polichinel commented Feb 24, 2026

Strategic Terminology & Configuration Ontology Update

Key Changes

1. Terminology Migration (Ontology)

2. Standardized Configuration Schema

Configuration Key Migration

3. Documentation & Transparency

4. Legal

Verification Results

Impact on Downstream Repositories

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant