Feature/documentation verification suite#14
Merged
Polichinel merged 20 commits intodevelopmentfrom Feb 24, 2026
Merged
Conversation
Adds a new test suite in tests/test_documentation_contracts.py to verify the contracts and claims made in the project's documentation. These tests treat the documentation as hypotheses and verify them against the actual behavior of the EvaluationManager. Key findings from the tests: - The EvaluationManager implicitly converts raw float point predictions to single-element numpy arrays, which contradicts the documentation's claim that this would cause an error. The documentation has been updated to reflect this behavior: - eval_lib_imp.md is updated to clarify the implicit conversion and change the 'Mandatory Reconciliation Step' to 'Recommended'. - stepshifter_full_imp_report.md is updated with a final conclusion clarifying the EvaluationManager's actual behavior. Also organizes the analysis reports into a new reports/ directory.
Adds a new document outlining the comprehensive plan for Phase 4: Non-Functional & Operational Readiness testing. This includes detailed sections on Performance & Scalability Benchmarking, Logging and Observability Verification, Memory Profiling, and Concurrency/Parallelism Safety as a future consideration. This plan aims to ensure the library's suitability for critical infrastructure environments.
Adds a robust test suite covering adversarial inputs and metric correctness, and generates a technical debt backlog document. Phase 2 (Adversarial & Edge-Case Testing) findings: - The is not robust to non-finite numbers (, ), crashing with from 's validation. - It crashes with on empty lists (from ). - It crashes with on empty DataFrames. - It crashes with on non-overlapping indices (from ). - This highlights a lack of internal input validation and graceful error handling. Phase 3 (Data-Centric & Metric-Specific Validation) findings: - Verified numerical correctness of with golden datasets. - Confirmed metric correctly uses kwarg. - Verified for both point and uncertainty predictions against . A document has been created, detailing these fragilities and recommending future improvements for robustness. Moved fixture to for shared access.
Updates the (VIEWS Evaluation Technical Integration Guide) to incorporate critical findings from adversarial testing (Phase 2), providing a clearer picture of the library's behavior and limitations. This includes: - A new section (3.5) detailing 'Robustness Limitations & Input Validation Responsibility', highlighting the library's fragility to non-finite numbers and malformed structural data, and emphasizing consumer responsibility for pre-validation. - Enhanced Section 3.4 on 'Data-State Coherency' to clarify that the applies transformations without validating mathematical appropriateness. - A cross-reference to for a comprehensive list of known issues. Updates the (Forensic Analysis of views-r2darts2 Evaluation Interface) with minor contextual notes: - A clarification in Section 4 acknowledging that has since been updated. - A clarification in Section 5, Point 2, regarding 'Point Prediction Format Ambiguity', reflecting that implicitly converts raw floats, making strict consumer-side reconciliation less critical for runtime.
Addressed linting errors in and . - : Replaced / with / for boolean comparisons. - : Removed unused variable assignments for , , , , , , and . These changes ensure adherence to linting standards within the test suite.
Removed from as it was an unused import, identified by the ruff linter.
Applied automated changes to files outside the directory after confirming all tests pass. - : Removed unused import and fixed f-string formatting. - : Removed unused and typing imports. - : Removed unused , , and typing imports. These minor changes ensure code quality and adherence to linting standards throughout the project.
This commit introduces comprehensive documentation and a rigorous test suite to clarify and verify the core concepts of the views-evaluation library. - Adds `documentation/evaluation_concepts.md` to clearly explain the differences between partitions, sets, and the three evaluation schemas (time-series-wise, step-wise, and month-wise). - Adds `documentation/integration_guide.md`, a step-by-step guide for developers on how to format their data and integrate a new model with the library. - Adds `tests/test_evaluation_schemas.py`, a permanent and rigorous test suite that programmatically verifies the grouping logic of the three evaluation schemas against the documentation. - Fixes test pollution issues discovered during development by isolating mocks within the new test suite, ensuring the stability of the entire test run.
Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.
…n tests - Hardens `EvaluationManager.validate_predictions` to strictly enforce the "exactly one column" contract, preventing crashes from duplicate or extra columns. - Adds `tests/test_data_contract.py` to verify the single-target and single-column requirements. - Updates `documentation/integration_guide.md` with a "Common Pitfalls" section to clarify MultiIndex and column usage. - Updates `reports/technical_debt_backlog.md` to reflect resolved validation issues. - Includes recent verification reports and drafts.
… evaluation proposal
…ntation-verification-suite
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (v0.4.0)
- EvaluationManager now dispatches on {regression,classification} x {point,uncertainty}
- Task type declared explicitly in config; prediction type detected from data shape
- Config schema: regression_targets, regression_point_metrics,
regression_uncertainty_metrics, classification_targets,
classification_point_metrics, classification_uncertainty_metrics
- Legacy config keys (targets, metrics) accepted with loud deprecation warning
- _normalise_config() and _validate_config() enforce fail-loud-fail-fast contract
- calculate_ap() no longer applies internal threshold; expects pre-binarised actuals
- AP moved to CLASSIFICATION_POINT_METRIC_FUNCTIONS only
- CRPS moved to uncertainty dicts only (regression and classification)
- Four new metric dataclasses mirror the four dispatch dicts
- transform_data() crash on unknown prefix replaced with logger.warning + identity
- EvaluationManager.__init__ no longer accepts metrics_list (breaking change)
- 70 tests passing, ruff clean
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Renames all 'uncertainty' metrics and classes to 'sample' (e.g., regression_sample_metrics) - Updates EvaluationManager to use the new terminology while maintaining legacy aliases - Adds a prominent Migration Notice and Configuration Schema to README.md - Updates all tests and documentation to align with the new ontology - Adds MIT License
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Strategic Terminology & Configuration Ontology Update
This PR implements a strategic shift in the library’s terminology and configuration ontology.
The primary goal is to move from the conceptually broad term “uncertainty” to the technically precise term “sample”, reflecting that the
views-evaluationengine evaluates draws/samples from predictive distributions.Key Changes
1. Terminology Migration (Ontology)
Renamed all
uncertainty-related classes and dictionaries tosampleExample:
RegressionUncertaintyEvaluationMetrics→RegressionSampleEvaluationMetricsUpdated internal logic to explicitly distinguish between:
Backward Compatibility
2. Standardized Configuration Schema
The
evaluate()method now follows a strictly typed configuration ontology.A normalization layer has been added to
EvaluationManagerthat:DeprecationWarningwhen legacy keys are usedConfiguration Key Migration
targetsregression_targetsmetricsregression_point_metricsregression_uncertainty_metricsregression_sample_metricsclassification_uncertainty_metricsclassification_sample_metrics3. Documentation & Transparency
README Migration Notice
Configuration Schema Table
Integration Guide
Fully updated:
documentation/integration_guide.mdReflects the new API and terminology
4. Legal
Verification Results
Test Suite
pytest)Legacy Support
targets,metrics)DeprecationWarningDocumentation
Impact on Downstream Repositories
Existing implementations in:
views-pipeline-coreviews-modelswill continue to function without breakage but will emit
DeprecationWarninglogs.It is recommended to migrate evaluation configurations to the new canonical keys during the next maintenance cycle.