Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
7e4bb0f
feat: Add contract tests and update documentation
Polichinel Jan 23, 2026
3e478ea
docs: Add detailed Phase 4 testing plan
Polichinel Jan 26, 2026
44e6f85
feat: Implement Phases 2 & 3 testing, document tech debt
Polichinel Jan 26, 2026
ce8cc45
docs: Update eval_lib_imp and r2darts2 reports with new findings
Polichinel Jan 26, 2026
4b29e81
Fix: Linting issues in test files
Polichinel Jan 26, 2026
50bfe9c
Fix: Remove unused import in tests/test_metric_calculators.py
Polichinel Jan 26, 2026
14ddcaf
Fix: Apply ruff linting fixes outside of tests
Polichinel Jan 26, 2026
ce3ea98
feat(docs, tests): Add evaluation guides and schema verification tests
Polichinel Jan 27, 2026
0409129
docs(ADR-001): Mark unimplemented metrics
Polichinel Jan 27, 2026
514ae26
feat(validation): Harden prediction data contract and add verificatio…
Polichinel Jan 28, 2026
8dc478b
docs(reports): Add post-mortem on multi-target investigation
Polichinel Jan 28, 2026
c7e9697
small patch to allow for Hydranet to pass pred_taget with surffix _pr…
Polichinel Feb 4, 2026
8632fe9
refactor(evaluation): remove hydranet patches and add manifest-driven…
Polichinel Feb 10, 2026
98ef932
Merge remote-tracking branch 'origin/development' into feature/docume…
Polichinel Feb 22, 2026
5967466
fix(linting): remove unused variable assignment flagged by ruff (F841)
Polichinel Feb 22, 2026
19266b9
feat(evaluation): implement 2x2 config-driven evaluation architecture…
Polichinel Feb 23, 2026
9b631c0
docs(post-mortem): add evaluation ontology liberation session post-mo…
Polichinel Feb 23, 2026
4b637bc
refactor: rename uncertainty to sample and update config ontology
Polichinel Feb 24, 2026
2433433
docs: update copyright holders in LICENSE
Polichinel Feb 24, 2026
ddb0542
docs: add Håvard Hegre to copyright holders
Polichinel Feb 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2026 Xiaolong Sun, Borbála Farkas, Dylan Pinheiro, Sonja Häffner, Simon Polichinel von der Maase and Håvard Hegre

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
60 changes: 57 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,28 @@

> **Part of the [VIEWS Platform](https://github.com/views-platform) ecosystem for large-scale conflict forecasting.**

---

### ⚠️ **ATTENTION: Migration Notice (v0.4.0+)**

The evaluation ontology has been updated to be more explicit and task-specific. If your pipeline broke after updating, please update your configuration dictionary. The library now distinguishes between **regression** vs **classification** tasks, and **point** vs **sample** predictions.

**Key Changes:**
* `targets` is now **`regression_targets`** or **`classification_targets`**.
* `metrics` is now **`regression_point_metrics`**.
* All **`uncertainty`** keys have been renamed to **`sample`** (reflecting that we evaluate draws/samples from a distribution).

| Legacy Key | New Canonical Key |
|:--- |:--- |
| `targets` | `regression_targets` |
| `metrics` | `regression_point_metrics` |
| `regression_uncertainty_metrics` | `regression_sample_metrics` |
| `classification_uncertainty_metrics` | `classification_sample_metrics` |

*Note: Legacy keys still work but will trigger a `DeprecationWarning`.*

---

## 📚 **Table of Contents**

1. [Overview](#overview)
Expand Down Expand Up @@ -50,7 +72,7 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
---

## ✨ **Features**
* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **uncertainty** metrics.
* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **sample** metrics.
* **Multiple Evaluation Schemas**:
* **Step-wise evaluation**: groups and evaluates predictions by the respective steps from all models.
* **Time-series-wise evaluation**: evaluates predictions for each time-series.
Expand Down Expand Up @@ -79,8 +101,40 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
| Brier Score | `Brier` | Accuracy of probabilistic predictions | ❌ | ✅ |
| Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ | ✅ |

> **Note:** Metrics marked with ✅ in "Supports Distributions" can be used for uncertainty evaluation with ensemble/sample-based predictions.
* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and uncertainty evaluation.
> **Note:** Metrics marked with ✅ in "Supports Distributions" can be used for sample evaluation with ensemble/sample-based predictions.

---

### 📝 **Configuration Schema**

The `EvaluationManager.evaluate()` method expects a configuration dictionary with the following keys:

| Key | Type | Description |
|:--- |:--- |:--- |
| `steps` | `List[int]` | List of forecast steps to evaluate (e.g., `[1, 3, 6, 12]`). |
| `regression_targets` | `List[str]` | List of continuous targets (e.g., `['ged_sb_best']`). |
| `regression_point_metrics` | `List[str]` | Metrics to compute for regression point predictions. |
| `regression_sample_metrics` | `List[str]` | Metrics to compute for regression sample predictions (e.g., `['CRPS']`). |
| `classification_targets` | `List[str]` | List of binary targets (e.g., `['by_sb_best']`). |
| `classification_point_metrics` | `List[str]` | Metrics to compute for classification probability scores. |
| `classification_sample_metrics` | `List[str]` | Metrics to compute for classification sample predictions. |

#### **Example Configuration:**

```python
config = {
"steps": [1, 3, 6, 12],
"regression_targets": ["lr_ged_sb_best"],
"regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
"regression_sample_metrics": ["CRPS", "MIS", "Coverage"],
"classification_targets": ["by_ged_sb_best"],
"classification_point_metrics": ["AP"],
}
```

---

* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and sample evaluation.
* **Automatic Index Matching**: Aligns actual and predicted values based on MultiIndex structures.
* **Planned Enhancements**:
* **Expanding metric calculations** beyond RMSLE, CRPS, and AP.
Expand Down
6 changes: 5 additions & 1 deletion documentation/ADRs/001_evaluation_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ In the context of the VIEWS pipeline, it is necessary to evaluate the models usi


## Decision
> **Note:** This ADR reflects the architectural goal. As of Jan 2026, several metrics are defined in the ADR but not yet implemented in the code.
> - **Not Implemented:** `Sinkhorn Distance (SD)`, `pEMDiv`, `Variogram`, `Brier Score`, `Jeffreys Divergence`.
> This discrepancy should be resolved in a future development cycle.

Below are the evaluation metrics that will be used to assess the performance of models in the VIEWS pipeline:

| Metric | Abbreviation | Task | Notes |
Expand Down Expand Up @@ -45,7 +49,7 @@ The selected metrics are designed to address the unique characteristics of confl
Relying solely on traditional error metrics such as MSE (MSLE) can result in poor performance on relevant tasks like identifying onsets of conflict.

Using a mix of probabilistic and point-based metrics will allow us to:
- Better capture the range of possible outcomes and assess predictions in terms of uncertainty.
- Better capture the range of possible outcomes and assess predictions in terms of sample.
- Focus evaluation on onsets of conflict, which are often the most critical and hardest to predict.
- Ensure consistency and calibration across different spatial and temporal resolutions, from grid-level to country-level predictions.

Expand Down
2 changes: 1 addition & 1 deletion documentation/ADRs/002_evaluation_strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ For further technical details:

- The number of sequences (k) can be tuned depending on evaluation budget or forecast range.

- Consider future support for probabilistic or uncertainty-aware forecasts in the same rolling evaluation framework.
- Consider future support for probabilistic or sample-aware forecasts in the same rolling evaluation framework.



Expand Down
2 changes: 1 addition & 1 deletion documentation/ADRs/004_evaluation_input_schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Both the actual and prediction DataFrames must use a multi-index of `(month_id,

The number of prediction DataFrames is flexible. However, the standard practice is to evaluate **12 sequences**. When more than two predictions are provided, the evaluation will behave similarly to a **rolling origin evaluation** with a **fixed holdout size of 1**. For further reference, see the [ADR 002](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/002_evaluation_strategy.md) on rolling origin methodology.

The class automatically determines the evaluation type (point or uncertainty) and aligns `month_id` values between the actuals and each prediction. By default, the evaluation is performed **month-wise**, **step-wise**, **time-series-wise** (more information in [ADR 003](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md))
The class automatically determines the evaluation type (point or sample) and aligns `month_id` values between the actuals and each prediction. By default, the evaluation is performed **month-wise**, **step-wise**, **time-series-wise** (more information in [ADR 003](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md))


## Consequences
Expand Down
66 changes: 66 additions & 0 deletions documentation/evaluation_concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Core Concepts in VIEWS Evaluation

This document explains the core concepts behind the `views-evaluation` framework, clarifying how data is organized and how model performance is measured.

## 1. Data Organization: Partitions and Sets

The framework uses a two-level data separation strategy to ensure robust and realistic model assessment.

### Level 1: Partitions (The "When")

Partitions are large, distinct, non-overlapping blocks of historical time. They separate the model lifecycle into distinct stages.

- **Calibration Partition:** The oldest block of data, used for initial research and development, feature engineering, and experimental training.
- **Validation Partition:** A more recent block of "clean" historical data the model has not seen during development. It is used for the final, fair, out-of-sample benchmarking of a finalized model. This is where performance metrics for academic papers are generated.
- **Forecasting Partition:** The most recent data, used to generate live, operational forecasts. It has no ground-truth outcomes to test against yet.

**Analogy:** Think of Partitions as different books in a history series (e.g., *Vol. 1: The Early Years*, *Vol. 2: The Middle Era*).

### Level 2: Sets (The "How")

Within the Calibration and Validation partitions, data is further divided into `train` and `test` sets.

- **Train Set:** The portion of a partition's data used to train a model.
- **Test Set:** The remaining portion of that partition's data used to evaluate the model's performance.

**Analogy:** Within each book (Partition), you use some chapters to study (the `train set`) and the remaining chapters for a quiz (the `test set`).

---

## 2. The Predictive Parallelogram

The standard offline evaluation process uses a rolling-origin strategy. A model is trained and used to predict a 36-month sequence. The training window is then rolled forward one month, and the process repeats. When stacked, these 12 overlapping forecast sequences form a **predictive parallelogram**.

This parallelogram is the fundamental data structure that is analyzed by the three evaluation schemas.

## 3. The Three Evaluation Schemas

The `EvaluationManager` assesses the predictive parallelogram by "slicing" it in three different ways. Each schema groups the data differently to answer a unique question about model performance.

### Schema 1: Time-series-wise Evaluation

- **Grouping Method:** Groups predictions by **forecast run**. Each of the 12 forecast sequences is evaluated as a single, complete unit. This is a "vertical slice" of the parallelogram.
- **Question Answered:** "How good was the model's entire 36-month forecast, on average, when it was issued from a specific start time?"
- **Analogy:** Getting a single, overall grade for an entire essay.

### Schema 2: Step-wise Evaluation

- **Grouping Method:** Groups predictions by **forecast horizon** (or lead time). All "1-month-ahead" predictions are grouped, all "2-months-ahead" are grouped, and so on. This corresponds to the "diagonals" of the parallelogram.
- **Question Answered:** "How does the model's accuracy change as it predicts further into the future?" This is the most critical evaluation schema in the VIEWS framework.
- **Analogy:** Grading the quality of all the *introduction paragraphs* from a batch of essays, then all the *body paragraphs*, then all the *conclusions* separately.

### Schema 3: Month-wise Evaluation

- **Grouping Method:** Groups all predictions that target the **same calendar month**, regardless of when the forecast was issued. This is a "horizontal slice" of the parallelogram.
- **Question Answered:** "How well did the system predict the events of March 2022, using all forecasts that targeted that specific month?"
- **Analogy:** Grading every student's answer to "Question #5" on a test.

---

### Summary Table

| Evaluation Schema | Groups Predictions By... | Question It Answers | Analogy |
| ------------------- | ------------------------ | -------------------------------------------------------- | ----------------------------------------- |
| **Time-series-wise**| Forecast Run | "How good was an entire 36-month forecast?" | Grading a whole essay. |
| **Step-wise** | Forecast Horizon (Step) | "How good is the model at predicting 6 months out?" | Grading all introductions separately. |
| **Month-wise** | Target Calendar Month | "How well did we predict the events of a specific month?" | Grading all answers to one test question. |
Loading
Loading