views-platform · Polichinel · Feb 24, 2026 · Jan 23, 2026 · Jan 26, 2026 · Jan 26, 2026
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 Xiaolong Sun, Borbála Farkas, Dylan Pinheiro, Sonja Häffner, Simon Polichinel von der Maase and Håvard Hegre
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -11,6 +11,28 @@
 
 > **Part of the [VIEWS Platform](https://github.com/views-platform) ecosystem for large-scale conflict forecasting.**  
 
+---
+
+### ⚠️ **ATTENTION: Migration Notice (v0.4.0+)**
+
+The evaluation ontology has been updated to be more explicit and task-specific. If your pipeline broke after updating, please update your configuration dictionary. The library now distinguishes between **regression** vs **classification** tasks, and **point** vs **sample** predictions.
+
+**Key Changes:**
+* `targets` is now **`regression_targets`** or **`classification_targets`**.
+* `metrics` is now **`regression_point_metrics`**.
+* All **`uncertainty`** keys have been renamed to **`sample`** (reflecting that we evaluate draws/samples from a distribution).
+
+| Legacy Key | New Canonical Key |
+|:--- |:--- |
+| `targets` | `regression_targets` |
+| `metrics` | `regression_point_metrics` |
+| `regression_uncertainty_metrics` | `regression_sample_metrics` |
+| `classification_uncertainty_metrics` | `classification_sample_metrics` |
+
+*Note: Legacy keys still work but will trigger a `DeprecationWarning`.*
+
+---
+
 ## 📚 **Table of Contents**  
 
 1. [Overview](#overview)  
@@ -50,7 +72,7 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
 ---
 
 ## ✨ **Features**  
-* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **uncertainty** metrics.
+* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **sample** metrics.
 * **Multiple Evaluation Schemas**:
   * **Step-wise evaluation**: groups and evaluates predictions by the respective steps from all models.
   * **Time-series-wise evaluation**: evaluates predictions for each time-series.
@@ -79,8 +101,40 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
 | Brier Score | `Brier` | Accuracy of probabilistic predictions | ❌ | ✅ |
 | Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ | ✅ |
 
-> **Note:** Metrics marked with ✅ in "Supports Distributions" can be used for uncertainty evaluation with ensemble/sample-based predictions.
-* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and uncertainty evaluation.
+> **Note:** Metrics marked with ✅ in "Supports Distributions" can be used for sample evaluation with ensemble/sample-based predictions.
+
+---
+
+### 📝 **Configuration Schema**
+
+The `EvaluationManager.evaluate()` method expects a configuration dictionary with the following keys:
+
+| Key | Type | Description |
+|:--- |:--- |:--- |
+| `steps` | `List[int]` | List of forecast steps to evaluate (e.g., `[1, 3, 6, 12]`). |
+| `regression_targets` | `List[str]` | List of continuous targets (e.g., `['ged_sb_best']`). |
+| `regression_point_metrics` | `List[str]` | Metrics to compute for regression point predictions. |
+| `regression_sample_metrics` | `List[str]` | Metrics to compute for regression sample predictions (e.g., `['CRPS']`). |
+| `classification_targets` | `List[str]` | List of binary targets (e.g., `['by_sb_best']`). |
+| `classification_point_metrics` | `List[str]` | Metrics to compute for classification probability scores. |
+| `classification_sample_metrics` | `List[str]` | Metrics to compute for classification sample predictions. |
+
+#### **Example Configuration:**
+
+```python
+config = {
+    "steps": [1, 3, 6, 12],
+    "regression_targets": ["lr_ged_sb_best"],
+    "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
+    "regression_sample_metrics": ["CRPS", "MIS", "Coverage"],
+    "classification_targets": ["by_ged_sb_best"],
+    "classification_point_metrics": ["AP"],
+}
+```
+
+---
+
+* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and sample evaluation.
 * **Automatic Index Matching**: Aligns actual and predicted values based on MultiIndex structures.
 * **Planned Enhancements**: 
   * **Expanding metric calculations** beyond RMSLE, CRPS, and AP.  

diff --git a/documentation/ADRs/001_evaluation_metrics.md b/documentation/ADRs/001_evaluation_metrics.md
@@ -14,6 +14,10 @@ In the context of the VIEWS pipeline, it is necessary to evaluate the models usi
 
 
 ## Decision
+> **Note:** This ADR reflects the architectural goal. As of Jan 2026, several metrics are defined in the ADR but not yet implemented in the code.
+> - **Not Implemented:** `Sinkhorn Distance (SD)`, `pEMDiv`, `Variogram`, `Brier Score`, `Jeffreys Divergence`.
+> This discrepancy should be resolved in a future development cycle.
+
 Below are the evaluation metrics that will be used to assess the performance of models in the VIEWS pipeline:
 
 | Metric                              | Abbreviation          | Task             | Notes                                                                            |
@@ -45,7 +49,7 @@ The selected metrics are designed to address the unique characteristics of confl
 Relying solely on traditional error metrics such as MSE (MSLE) can result in poor performance on relevant tasks like identifying onsets of conflict.
 
 Using a mix of probabilistic and point-based metrics will allow us to:
-- Better capture the range of possible outcomes and assess predictions in terms of uncertainty.
+- Better capture the range of possible outcomes and assess predictions in terms of sample.
 - Focus evaluation on onsets of conflict, which are often the most critical and hardest to predict.
 - Ensure consistency and calibration across different spatial and temporal resolutions, from grid-level to country-level predictions.
 

diff --git a/documentation/ADRs/002_evaluation_strategy.md b/documentation/ADRs/002_evaluation_strategy.md
@@ -95,7 +95,7 @@ For further technical details:
 
 - The number of sequences (k) can be tuned depending on evaluation budget or forecast range.
 
-- Consider future support for probabilistic or uncertainty-aware forecasts in the same rolling evaluation framework.
+- Consider future support for probabilistic or sample-aware forecasts in the same rolling evaluation framework.
 
 
 

diff --git a/documentation/ADRs/004_evaluation_input_schema.md b/documentation/ADRs/004_evaluation_input_schema.md
@@ -26,7 +26,7 @@ Both the actual and prediction DataFrames must use a multi-index of `(month_id,
 
 The number of prediction DataFrames is flexible. However, the standard practice is to evaluate **12 sequences**. When more than two predictions are provided, the evaluation will behave similarly to a **rolling origin evaluation** with a **fixed holdout size of 1**. For further reference, see the [ADR 002](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/002_evaluation_strategy.md) on rolling origin methodology.
 
-The class automatically determines the evaluation type (point or uncertainty) and aligns `month_id` values between the actuals and each prediction. By default, the evaluation is performed **month-wise**, **step-wise**, **time-series-wise** (more information in [ADR 003](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md))
+The class automatically determines the evaluation type (point or sample) and aligns `month_id` values between the actuals and each prediction. By default, the evaluation is performed **month-wise**, **step-wise**, **time-series-wise** (more information in [ADR 003](https://github.com/views-platform/views-evaluation/blob/main/documentation/ADRs/003_metric_calculation.md))
 
 
 ## Consequences

diff --git a/documentation/evaluation_concepts.md b/documentation/evaluation_concepts.md
@@ -0,0 +1,66 @@
+# Core Concepts in VIEWS Evaluation
+
+This document explains the core concepts behind the `views-evaluation` framework, clarifying how data is organized and how model performance is measured.
+
+## 1. Data Organization: Partitions and Sets
+
+The framework uses a two-level data separation strategy to ensure robust and realistic model assessment.
+
+### Level 1: Partitions (The "When")
+
+Partitions are large, distinct, non-overlapping blocks of historical time. They separate the model lifecycle into distinct stages.
+
+-   **Calibration Partition:** The oldest block of data, used for initial research and development, feature engineering, and experimental training.
+-   **Validation Partition:** A more recent block of "clean" historical data the model has not seen during development. It is used for the final, fair, out-of-sample benchmarking of a finalized model. This is where performance metrics for academic papers are generated.
+-   **Forecasting Partition:** The most recent data, used to generate live, operational forecasts. It has no ground-truth outcomes to test against yet.
+
+**Analogy:** Think of Partitions as different books in a history series (e.g., *Vol. 1: The Early Years*, *Vol. 2: The Middle Era*).
+
+### Level 2: Sets (The "How")
+
+Within the Calibration and Validation partitions, data is further divided into `train` and `test` sets.
+
+-   **Train Set:** The portion of a partition's data used to train a model.
+-   **Test Set:** The remaining portion of that partition's data used to evaluate the model's performance.
+
+**Analogy:** Within each book (Partition), you use some chapters to study (the `train set`) and the remaining chapters for a quiz (the `test set`).
+
+---
+
+## 2. The Predictive Parallelogram
+
+The standard offline evaluation process uses a rolling-origin strategy. A model is trained and used to predict a 36-month sequence. The training window is then rolled forward one month, and the process repeats. When stacked, these 12 overlapping forecast sequences form a **predictive parallelogram**.
+
+This parallelogram is the fundamental data structure that is analyzed by the three evaluation schemas.
+
+## 3. The Three Evaluation Schemas
+
+The `EvaluationManager` assesses the predictive parallelogram by "slicing" it in three different ways. Each schema groups the data differently to answer a unique question about model performance.
+
+### Schema 1: Time-series-wise Evaluation
+
+-   **Grouping Method:** Groups predictions by **forecast run**. Each of the 12 forecast sequences is evaluated as a single, complete unit. This is a "vertical slice" of the parallelogram.
+-   **Question Answered:** "How good was the model's entire 36-month forecast, on average, when it was issued from a specific start time?"
+-   **Analogy:** Getting a single, overall grade for an entire essay.
+
+### Schema 2: Step-wise Evaluation
+
+-   **Grouping Method:** Groups predictions by **forecast horizon** (or lead time). All "1-month-ahead" predictions are grouped, all "2-months-ahead" are grouped, and so on. This corresponds to the "diagonals" of the parallelogram.
+-   **Question Answered:** "How does the model's accuracy change as it predicts further into the future?" This is the most critical evaluation schema in the VIEWS framework.
+-   **Analogy:** Grading the quality of all the *introduction paragraphs* from a batch of essays, then all the *body paragraphs*, then all the *conclusions* separately.
+
+### Schema 3: Month-wise Evaluation
+
+-   **Grouping Method:** Groups all predictions that target the **same calendar month**, regardless of when the forecast was issued. This is a "horizontal slice" of the parallelogram.
+-   **Question Answered:** "How well did the system predict the events of March 2022, using all forecasts that targeted that specific month?"
+-   **Analogy:** Grading every student's answer to "Question #5" on a test.
+
+---
+
+### Summary Table
+
+| Evaluation Schema   | Groups Predictions By... | Question It Answers                                      | Analogy                                   |
+| ------------------- | ------------------------ | -------------------------------------------------------- | ----------------------------------------- |
+| **Time-series-wise**| Forecast Run             | "How good was an entire 36-month forecast?"              | Grading a whole essay.                    |
+| **Step-wise**       | Forecast Horizon (Step)  | "How good is the model at predicting 6 months out?"      | Grading all introductions separately.     |
+| **Month-wise**      | Target Calendar Month    | "How well did we predict the events of a specific month?" | Grading all answers to one test question. |
Original file line number	Diff line number	Diff line change
Expand Up		@@ -95,7 +95,7 @@ For further technical details:

		- The number of sequences (k) can be tuned depending on evaluation budget or forecast range.

		- Consider future support for probabilistic or uncertainty-aware forecasts in the same rolling evaluation framework.
		- Consider future support for probabilistic or sample-aware forecasts in the same rolling evaluation framework.



Expand Down