Skip to content

Feature/documentation verification suite#14

Merged
Polichinel merged 20 commits intodevelopmentfrom
feature/documentation-verification-suite
Feb 24, 2026
Merged

Feature/documentation verification suite#14
Polichinel merged 20 commits intodevelopmentfrom
feature/documentation-verification-suite

Conversation

@Polichinel
Copy link
Copy Markdown
Collaborator

Strategic Terminology & Configuration Ontology Update

This PR implements a strategic shift in the library’s terminology and configuration ontology.

The primary goal is to move from the conceptually broad term “uncertainty” to the technically precise term “sample”, reflecting that the views-evaluation engine evaluates draws/samples from predictive distributions.


Key Changes

1. Terminology Migration (Ontology)

  • Renamed all uncertainty-related classes and dictionaries to sample

    • Example:

      • RegressionUncertaintyEvaluationMetricsRegressionSampleEvaluationMetrics
  • Updated internal logic to explicitly distinguish between:

    • Regression vs Classification tasks
    • Point vs Sample prediction types
  • Backward Compatibility

    • Maintained legacy aliases
    • Implemented deprecated fallback logic for all renamed objects
    • Ensures zero immediate breakage for downstream repositories

2. Standardized Configuration Schema

The evaluate() method now follows a strictly typed configuration ontology.

A normalization layer has been added to EvaluationManager that:

  • Translates legacy keys → canonical keys
  • Emits a DeprecationWarning when legacy keys are used

Configuration Key Migration

Legacy Key New Canonical Key
targets regression_targets
metrics regression_point_metrics
regression_uncertainty_metrics regression_sample_metrics
classification_uncertainty_metrics classification_sample_metrics

3. Documentation & Transparency

  • README Migration Notice

    • Added a prominent “ATTENTION” section at the top of the README
    • Guides developers through the transition
  • Configuration Schema Table

    • Added a clear schema specification to the README
    • Enables downstream users to validate their config dictionaries
  • Integration Guide

    • Fully updated:

      • documentation/integration_guide.md
      • Relevant ADRs
    • Reflects the new API and terminology


4. Legal

  • Added a standard MIT License

Verification Results

  • Test Suite

    • ✅ 70 / 70 tests passing (pytest)
  • Legacy Support

    • Manually verified that legacy-style configs (e.g., targets, metrics)
    • Produce correct results
    • Emit the intended DeprecationWarning
  • Documentation

    • Verified all documentation examples and doclinks are functional

Impact on Downstream Repositories

Existing implementations in:

  • views-pipeline-core
  • views-models

will continue to function without breakage but will emit DeprecationWarning logs.

It is recommended to migrate evaluation configurations to the new canonical keys during the next maintenance cycle.

Polichinel and others added 20 commits January 23, 2026 11:00
Adds a new test suite in tests/test_documentation_contracts.py to
verify the contracts and claims made in the project's documentation.
These tests treat the documentation as hypotheses and verify them against
the actual behavior of the EvaluationManager.

Key findings from the tests:
- The EvaluationManager implicitly converts raw float point predictions
  to single-element numpy arrays, which contradicts the documentation's
  claim that this would cause an error.

The documentation has been updated to reflect this behavior:
- eval_lib_imp.md is updated to clarify the implicit conversion and
  change the 'Mandatory Reconciliation Step' to 'Recommended'.
- stepshifter_full_imp_report.md is updated with a final conclusion
  clarifying the EvaluationManager's actual behavior.

Also organizes the analysis reports into a new reports/ directory.
Adds a new document outlining the comprehensive plan for Phase 4:
Non-Functional & Operational Readiness testing. This includes detailed
sections on Performance & Scalability Benchmarking, Logging and
Observability Verification, Memory Profiling, and Concurrency/Parallelism
Safety as a future consideration. This plan aims to ensure the library's
suitability for critical infrastructure environments.
Adds a robust test suite covering adversarial inputs and metric correctness,
and generates a technical debt backlog document.

Phase 2 (Adversarial & Edge-Case Testing) findings:
- The  is not robust to non-finite numbers (, ),
  crashing with  from 's validation.
- It crashes with  on empty  lists (from ).
- It crashes with  on empty  DataFrames.
- It crashes with  on non-overlapping indices (from ).
- This highlights a lack of internal input validation and graceful error handling.

Phase 3 (Data-Centric & Metric-Specific Validation) findings:
- Verified numerical correctness of  with golden datasets.
- Confirmed  metric correctly uses  kwarg.
- Verified  for both point and uncertainty predictions against .

A  document has been created, detailing these
fragilities and recommending future improvements for robustness.
Moved  fixture to  for shared access.
Updates the  (VIEWS Evaluation Technical Integration Guide)
to incorporate critical findings from adversarial testing (Phase 2),
providing a clearer picture of the library's behavior and limitations.
This includes:
- A new section (3.5) detailing 'Robustness Limitations & Input Validation Responsibility',
  highlighting the library's fragility to non-finite numbers and malformed
  structural data, and emphasizing consumer responsibility for pre-validation.
- Enhanced Section 3.4 on 'Data-State Coherency' to clarify that the
   applies transformations without validating mathematical
  appropriateness.
- A cross-reference to  for a comprehensive
  list of known issues.

Updates the  (Forensic Analysis of
views-r2darts2 Evaluation Interface) with minor contextual notes:
- A clarification in Section 4 acknowledging that  has since
  been updated.
- A clarification in Section 5, Point 2, regarding 'Point Prediction Format Ambiguity',
  reflecting that  implicitly converts raw floats, making
  strict consumer-side reconciliation less critical for runtime.
Addressed linting errors in  and .
- : Replaced / with / for boolean comparisons.
- : Removed unused variable assignments for , , , , , , and .

These changes ensure adherence to linting standards within the test suite.
Removed  from  as it was an unused import,
identified by the ruff linter.
Applied automated  changes to files outside the  directory after confirming all tests pass.

- : Removed unused  import and fixed f-string formatting.
- : Removed unused  and  typing imports.
- : Removed unused , , and  typing imports.

These minor changes ensure code quality and adherence to linting standards throughout the project.
This commit introduces comprehensive documentation and a rigorous test suite to clarify and verify the core concepts of the views-evaluation library.

- Adds `documentation/evaluation_concepts.md` to clearly explain the differences between partitions, sets, and the three evaluation schemas (time-series-wise, step-wise, and month-wise).
- Adds `documentation/integration_guide.md`, a step-by-step guide for developers on how to format their data and integrate a new model with the library.
- Adds `tests/test_evaluation_schemas.py`, a permanent and rigorous test suite that programmatically verifies the grouping logic of the three evaluation schemas against the documentation.
- Fixes test pollution issues discovered during development by isolating mocks within the new test suite, ensuring the stability of the entire test run.
Adds a prominent note to ADR-001 to clarify that several documented metrics are not yet implemented in the code. This makes the discrepancy clear to developers and aligns the documentation with the current state of the project.
…n tests

- Hardens `EvaluationManager.validate_predictions` to strictly enforce the "exactly one column" contract, preventing crashes from duplicate or extra columns.
- Adds `tests/test_data_contract.py` to verify the single-target and single-column requirements.
- Updates `documentation/integration_guide.md` with a "Common Pitfalls" section to clarify MultiIndex and column usage.
- Updates `reports/technical_debt_backlog.md` to reflect resolved validation issues.
- Includes recent verification reports and drafts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (v0.4.0)

- EvaluationManager now dispatches on {regression,classification} x {point,uncertainty}
- Task type declared explicitly in config; prediction type detected from data shape
- Config schema: regression_targets, regression_point_metrics,
  regression_uncertainty_metrics, classification_targets,
  classification_point_metrics, classification_uncertainty_metrics
- Legacy config keys (targets, metrics) accepted with loud deprecation warning
- _normalise_config() and _validate_config() enforce fail-loud-fail-fast contract
- calculate_ap() no longer applies internal threshold; expects pre-binarised actuals
- AP moved to CLASSIFICATION_POINT_METRIC_FUNCTIONS only
- CRPS moved to uncertainty dicts only (regression and classification)
- Four new metric dataclasses mirror the four dispatch dicts
- transform_data() crash on unknown prefix replaced with logger.warning + identity
- EvaluationManager.__init__ no longer accepts metrics_list (breaking change)
- 70 tests passing, ruff clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Renames all 'uncertainty' metrics and classes to 'sample' (e.g., regression_sample_metrics)
- Updates EvaluationManager to use the new terminology while maintaining legacy aliases
- Adds a prominent Migration Notice and Configuration Schema to README.md
- Updates all tests and documentation to align with the new ontology
- Adds MIT License
@Polichinel Polichinel merged commit fcbe9e4 into development Feb 24, 2026
4 checks passed
@Polichinel Polichinel deleted the feature/documentation-verification-suite branch February 24, 2026 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant