views-platform · Polichinel · Jan 23, 2026 · Jan 26, 2026 · Jan 26, 2026 · Jan 26, 2026
diff --git a/.gitignore b/.gitignore
@@ -215,4 +215,5 @@ cython_debug/
 
 # logs
 *.log
-*.log.*
+*.log.*
+reports/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 Xiaolong Sun, Borbála Farkas, Dylan Pinheiro, Sonja Häffner, Simon Polichinel von der Maase and Håvard Hegre
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -11,35 +11,93 @@
 
 > **Part of the [VIEWS Platform](https://github.com/views-platform) ecosystem for large-scale conflict forecasting.**  
 
+---
+
+### ⚠️ **ATTENTION: Migration Notice (v0.4.0+)**
+
+The evaluation ontology has been updated to be more explicit and task-specific. If your pipeline broke after updating, please update your configuration dictionary. The library now distinguishes between **regression** vs **classification** tasks, and **point** vs **sample** predictions.
+
+**Key Changes:**
+* `targets` is now **`regression_targets`** or **`classification_targets`**.
+* `metrics` is now **`regression_point_metrics`**.
+* All **`uncertainty`** keys have been renamed to **`sample`** (reflecting that we evaluate draws/samples from a distribution).
+
+| Legacy Key | New Canonical Key |
+|:--- |:--- |
+| `targets` | `regression_targets` |
+| `metrics` | `regression_point_metrics` |
+| `regression_uncertainty_metrics` | `regression_sample_metrics` |
+| `classification_uncertainty_metrics` | `classification_sample_metrics` |
+
+*Note: Legacy keys still work but will trigger a `DeprecationWarning`.*
+
+---
+
 ## 📚 **Table of Contents**  
 
-1. [Overview](#overview)  
-2. [Role in the VIEWS Pipeline](#role-in-the-views-pipeline)  
-3. [Features](#features)  
-4. [Installation](#installation)  
-5. [Architecture](#architecture)  
-6. [Project Structure](#project-structure)  
-7. [Contributing](#contributing)  
-8. [License](#license)  
-9. [Acknowledgements](#acknowledgements)  
+1. [Overview](#overview)
+2. [Quick Start](#quick-start)
+3. [Role in the VIEWS Pipeline](#role-in-the-views-pipeline)
+4. [Features](#features)
+5. [Installation](#installation)
+6. [Architecture](#architecture)
+7. [Project Structure](#project-structure)
+8. [Contributing](#contributing)
+9. [License](#license)
+10. [Acknowledgements](#acknowledgements)  
 
 ---
 
 ## 🧠 **Overview**  
 
-The **VIEWS Evaluation** repository provides a standardized framework for **assessing time-series forecasting models** used in the **VIEWS conflict prediction pipeline**. It ensures consistent, robust, and interpretable evaluations through **metrics tailored to conflict-related data**, which often exhibit **right-skewness and zero-inflation**.  
+The **VIEWS Evaluation** repository provides a standardized framework for **assessing time-series forecasting models** used in the **VIEWS conflict prediction pipeline**. It ensures consistent, robust, and interpretable evaluations through **metrics tailored to conflict-related data**, which often exhibit **right-skewness and zero-inflation**.
+
+The library is built on a **three-layer architecture** with a framework-agnostic NumPy core, ensuring that all mathematical evaluation logic is independent of Pandas or any other data-frame library.  
+
+---
+
+## 🚀 **Quick Start**
+
+```python
+from views_evaluation import EvaluationFrame, NativeEvaluator
+import numpy as np
+
+# 1. Construct EvaluationFrame with NumPy arrays
+ef = EvaluationFrame(
+    y_true=y_true_array,
+    y_pred=y_pred_array,  # shape (N, S) where S >= 1
+    identifiers={'time': times, 'unit': units, 'origin': origins, 'step': steps},
+    metadata={'target': 'ged_sb_best'},
+)
+
+# 2. Configure and evaluate
+config = {
+    "steps": [1, 2, 3, 4, 5, 6],
+    "regression_targets": ["ged_sb_best"],
+    "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
+}
+evaluator = NativeEvaluator(config)
+report = evaluator.evaluate(ef)
+
+# 3. Access results
+report.to_dataframe("step")          # pd.DataFrame
+report.to_dict()                     # nested dict
+report.get_schema_results("month")   # typed metrics dataclass
+```
+
+> For the full walkthrough including input formatting and sample evaluation, see [`documentation/integration_guide.md`](documentation/integration_guide.md).
 
 ---
 
-## 🌍 **Role in the VIEWS Pipeline**  
+## 🌍 **Role in the VIEWS Pipeline**
 
-VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **official evaluation component** of the VIEWS ecosystem.  
+VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **official evaluation component** of the VIEWS ecosystem.
 
-### **Pipeline Integration:**  
-1. **Model Predictions** →  
-2. **Evaluation Metrics Processing** →  
-3. **Metrics Computation (via EvaluationManager)** →  
-4. **Final Performance Reports**  
+### **Pipeline Integration:**
+1. **Model Predictions** →
+2. **EvaluationFrame** (validated NumPy container) →
+3. **NativeEvaluator** (metrics computation) →
+4. **EvaluationReport** (structured results)  
 
 ### **Integration with Other Repositories:**  
 - **[views-pipeline-core](https://github.com/views-platform/views-pipeline-core):** Supplies preprocessed data for evaluation.  
@@ -50,19 +108,101 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
 ---
 
 ## ✨ **Features**  
-* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **uncertainty** metrics.
+* **Comprehensive Evaluation Framework**: The `NativeEvaluator` provides structured, stateless evaluation of time series predictions across a 2×2 matrix of **regression/classification** tasks and **point/sample** prediction types.
 * **Multiple Evaluation Schemas**:
   * **Step-wise evaluation**: groups and evaluates predictions by the respective steps from all models.
   * **Time-series-wise evaluation**: evaluates predictions for each time-series.
   * **Month-wise evaluation**: groups and evaluates predictions at a monthly level.
-* **Support for Mulyiple Metrics**
-  * **Point Evaluation Metrics**: RMSLE, CRPS, Average Precision (Brier Score, Jeffreys Divergence, Pearson Correlation, Sinkhorn/Earth-mover Distance & pEMDiv and Variogram to be added).
-  * **Uncertainty Evaluation Metrics**: CRPS (and more to be added in the future).
-* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and uncertainty evaluation.
-* **Automatic Index Matching**: Aligns actual and predicted values based on MultiIndex structures.
-* **Planned Enhancements**: 
-  * **Expanding metric calculations** beyond RMSLE, CRPS, and AP.  
-  * **New visualization tools** for better interpretability of evaluation reports.  
+* **Support for Multiple Metrics** (see table below for details)
+
+### **Available Metrics**
+
+Metrics are organized by the 2×2 evaluation matrix: **task** (regression / classification) × **prediction type** (point / sample).
+
+#### Regression Point Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Mean Squared Error | `MSE` | Average of squared differences | ✅ |
+| Mean Squared Log Error | `MSLE` | MSE computed on log-transformed values | ✅ |
+| Root Mean Squared Log Error | `RMSLE` | Square root of MSLE | ✅ |
+| Earth Mover's Distance | `EMD` | Wasserstein distance between distributions | ✅ |
+| Pearson Correlation | `Pearson` | Linear correlation between predictions and actuals | ✅ |
+| Mean Tweedie Deviance | `MTD` | Tweedie deviance (configurable power), ideal for zero-inflated data | ✅ |
+| Mean Prediction | `y_hat_bar` | Average of all predicted values (diagnostic) | ✅ |
+| Magnitude Calibration Ratio | `MCR_point` | Ratio of predicted to actual magnitude | ✅ |
+| Sinkhorn Distance | `SD` | Regularized optimal transport distance | ❌ |
+| pseudo-Earth Mover Divergence | `pEMDiv` | Efficient EMD approximation | ❌ |
+| Variogram | `Variogram` | Spatial/temporal correlation structure score | ❌ |
+
+#### Regression Sample Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Continuous Ranked Probability Score | `CRPS` | Calibration and sharpness of probabilistic forecasts | ✅ |
+| Threshold-Weighted CRPS | `twCRPS` | CRPS emphasizing values above a threshold | ✅ |
+| Mean Interval Score | `MIS` | Prediction interval width and coverage | ✅ |
+| Quantile Interval Score | `QIS` | Interval score at specified quantiles | ✅ |
+| Coverage | `Coverage` | Proportion of actuals within prediction intervals | ✅ |
+| Ignorance Score | `Ignorance` | Logarithmic scoring rule for probabilistic predictions | ✅ |
+| Mean Prediction | `y_hat_bar` | Average of all predicted values (diagnostic) | ✅ |
+| Magnitude Calibration Ratio | `MCR_sample` | Ratio of predicted to actual magnitude | ✅ |
+
+#### Classification Point Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Average Precision | `AP` | Area under precision-recall curve | ✅ |
+
+#### Classification Sample Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Continuous Ranked Probability Score | `CRPS` | Calibration and sharpness | ✅ |
+| Threshold-Weighted CRPS | `twCRPS` | CRPS emphasizing values above a threshold | ✅ |
+| Brier Score | `Brier` | Accuracy of probabilistic binary predictions | ❌ |
+| Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ |
+
+> **Note:** Metrics marked ❌ are defined in the catalog but not yet implemented — requesting them raises a clear `ValueError`.
+
+---
+
+### 📝 **Configuration Schema**
+
+The `NativeEvaluator` accepts a configuration dictionary (`EvaluationConfig` TypedDict) with the following keys:
+
+| Key | Type | Description |
+|:--- |:--- |:--- |
+| `steps` | `List[int]` | List of forecast steps to evaluate (e.g., `[1, 3, 6, 12]`). |
+| `regression_targets` | `List[str]` | List of continuous targets (e.g., `['ged_sb_best']`). |
+| `regression_point_metrics` | `List[str]` | Metrics to compute for regression point predictions. |
+| `regression_sample_metrics` | `List[str]` | Metrics to compute for regression sample predictions (e.g., `['CRPS']`). |
+| `classification_targets` | `List[str]` | List of binary targets (e.g., `['by_sb_best']`). |
+| `classification_point_metrics` | `List[str]` | Metrics to compute for classification probability scores. |
+| `classification_sample_metrics` | `List[str]` | Metrics to compute for classification sample predictions. |
+| `evaluation_profile` | `str` | Named hyperparameter profile (default: `"base"`). See `views_evaluation/profiles/`. |
+| `metric_hyperparameters` | `Dict[str, Dict]` | Per-metric overrides that take precedence over the profile. |
+
+#### **Example Configuration:**
+
+```python
+config = {
+    "steps": [1, 3, 6, 12],
+    "regression_targets": ["ged_sb_best"],
+    "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
+    "regression_sample_metrics": ["CRPS", "twCRPS", "MIS", "Coverage"],
+    "evaluation_profile": "base",  # or "hydranet_ucdp"
+    "metric_hyperparameters": {
+        "twCRPS": {"threshold": 10.0},  # override profile default
+    },
+}
+```
+
+---
+
+* **Data Integrity Checks**: Validates input arrays for shape consistency, NaN/infinity, and required identifiers.
+* **Framework-Agnostic Core**: All evaluation operates on pure NumPy arrays via `EvaluationFrame`.
+* **Metric Catalog & Profiles**: Hyperparameters are managed through named evaluation profiles with a Chain of Responsibility resolver (model overrides → profile → fail loud).  
 
 ---
 
@@ -77,37 +217,60 @@ pip install views_evaluation
 ```
 
 ---
-## 🏗 **Architecture**  
+## 🏗 **Architecture**
 
-### **1. Evaluation Metrics Framework**  
-- **Handles forecasting evaluation** across **multiple models, levels of analysis, and forecasting windows**.  
-- Converts model outputs into **standardized evaluation reports**.  
+The library follows a strict three-layer architecture (ADR-011):
 
-### **2. Metrics Computation Pipeline**  
-1. **Input**: Predictions from models in standardized DataFrames.  
-2. **Processing**: Calculation of relevant evaluation metrics.  
-3. **Output**: Performance scores for comparison across models.  
+```
+Level 0 — Pure Core (NumPy + SciPy only, zero framework imports)
+  EvaluationFrame       Canonical data container (y_true, y_pred, identifiers)
+  NativeEvaluator       Stateless evaluation engine (month/sequence/step schemas)
+  MetricCatalog         Genome registry mapping metrics → functions + required params
+  Profiles              Named hyperparameter sets (base, hydranet_ucdp, ...)
+
+Level 1 — Bridge / Adapter
+  EvaluationFrame       Validated NumPy data container
+  EvaluationReport      Results container with DataFrame/dict export
+
+Level 2 — Legacy Orchestrator
+  MetricCatalog         Genome registry and parameter resolver
+```
 
-### **3. Error Handling & Standardization**  
-- **Ensures conformity to VIEWS evaluation standards**.  
-- **Warns about unrecognized or incorrectly formatted metrics**.  
+**Key design decisions:**
+- **ADR-011**: No Pandas/Polars imports in Level 0 — math is framework-agnostic.
+- **ADR-013**: Fail-loud — all structural failures raise exceptions with actionable messages, never silently degrade.
+- **ADR-042**: Metric catalog — each metric declares its required hyperparameters ("genome"); values are resolved via Chain of Responsibility.  
 
 ---
 
 ## 🗂 **Project Structure**  
 
 ```plaintext
 views-evaluation/
-├── README.md                   # Documentation
-├── .github/workflows/           # CI/CD pipelines
-├── tests/                       # Unit tests
-├── views_evaluation/            # Main source code
+├── views_evaluation/
+│   ├── __init__.py                        # Public API exports
+│   ├── adapters/
+│   │   └── __init__.py                     # Reserved for future framework bridges
 │   ├── evaluation/
-│   │   ├── metrics.py
-│   ├── __init__.py              # Package initialization
-├── .gitignore                   # Git ignore rules
-├── pyproject.toml               # Poetry project file
-├── poetry.lock                  # Dependency lock file
+│   │   ├── config_schema.py               # EvaluationConfig TypedDict
+│   │   ├── evaluation_frame.py            # Core data container
+│   │   ├── evaluation_manager.py          # Legacy orchestrator (deprecated)
+│   │   ├── evaluation_report.py           # Results container
+│   │   ├── metric_catalog.py              # ADR-042 registry + resolver
+│   │   ├── metrics.py                     # Typed metric dataclasses
+│   │   ├── native_evaluator.py            # Core evaluation engine
+│   │   └── native_metric_calculators.py   # Metric implementations
+│   └── profiles/
+│       ├── base.py                        # Standard hyperparameter defaults
+│       └── hydranet_ucdp.py               # Domain-specific profile
+├── tests/                                 # 242 tests (Green/Beige/Red)
+├── documentation/
+│   ├── ADRs/                              # 17 Architecture Decision Records
+│   ├── CICs/                              # Class Intent Contracts
+│   ├── integration_guide.md               # Full API walkthrough
+│   └── evaluation_concepts.md             # Domain concepts
+├── pyproject.toml
+└── README.md
 ```
 
 ---

diff --git a/documentation/ADRs/000_use_of_adrs.md b/documentation/ADRs/000_use_of_adrs.md
@@ -0,0 +1,37 @@
+# ADR-000: Use of Architecture Decision Records (ADRs)
+
+**Status:** Accepted  
+**Date:** 2026-02-25  
+**Deciders:** Project maintainers  
+**Consulted:** —  
+**Informed:** All contributors  
+
+---
+
+## Context
+
+The Views Evaluation repository sits at the intersection of evolving research (new metrics, probabilistic scaling) and production stability (Pipeline Core integration). 
+
+Significant decisions in such systems are often made under uncertainty and revisited later, leading to regressions or duplicated debate. Without a shared record of *why* decisions were made, we risk:
+- Accidental reversals of critical design choices (e.g., re-introducing Pandas into the math core).
+- Losing institutional memory as contributors and agents change.
+
+## Decision
+
+We will use **Architecture Decision Records (ADRs)** to document all significant technical, architectural, and conceptual decisions.
+
+- ADRs are stored in the repository under `documentation/ADRs/`.
+- ADRs are numbered sequentially and represent a decision, not just a discussion.
+- ADRs and code must agree; code that violates an ADR is considered an architectural defect.
+- If a decision changes, it is **superseded** by a new ADR, never erased.
+
+## Consequences
+
+### Positive
+- Clearer decision-making and fewer repeated debates.
+- Easier onboarding for both carbon-based and silicon-based contributors.
+- Better long-term coherence through the "Pure Math Engine" refactor.
+
+### Negative
+- Small upfront cost in writing and discipline to maintain.
+- Forces explicitness where ambiguity may feel easier.
-Original file line number
+Diff line change
@@ Expand Up / @@ -215,4 +215,5 @@ cython_debug/ @@
     # logs
     *.log
-    *.log.*
+    *.log.*
+    reports/