views-platform · Polichinel · Mar 14, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/.gitignore b/.gitignore
@@ -215,4 +215,4 @@ cython_debug/
 
 # logs
 *.log
-*.log.*
+*.log.*reports/
diff --git a/README.md b/README.md
@@ -35,33 +35,63 @@ The evaluation ontology has been updated to be more explicit and task-specific.
 
 ## 📚 **Table of Contents**  
 
-1. [Overview](#overview)  
-2. [Role in the VIEWS Pipeline](#role-in-the-views-pipeline)  
-3. [Features](#features)  
-4. [Installation](#installation)  
-5. [Architecture](#architecture)  
-6. [Project Structure](#project-structure)  
-7. [Contributing](#contributing)  
-8. [License](#license)  
-9. [Acknowledgements](#acknowledgements)  
+1. [Overview](#overview)
+2. [Quick Start](#quick-start)
+3. [Role in the VIEWS Pipeline](#role-in-the-views-pipeline)
+4. [Features](#features)
+5. [Installation](#installation)
+6. [Architecture](#architecture)
+7. [Project Structure](#project-structure)
+8. [Contributing](#contributing)
+9. [License](#license)
+10. [Acknowledgements](#acknowledgements)  
 
 ---
 
 ## 🧠 **Overview**  
 
-The **VIEWS Evaluation** repository provides a standardized framework for **assessing time-series forecasting models** used in the **VIEWS conflict prediction pipeline**. It ensures consistent, robust, and interpretable evaluations through **metrics tailored to conflict-related data**, which often exhibit **right-skewness and zero-inflation**.  
+The **VIEWS Evaluation** repository provides a standardized framework for **assessing time-series forecasting models** used in the **VIEWS conflict prediction pipeline**. It ensures consistent, robust, and interpretable evaluations through **metrics tailored to conflict-related data**, which often exhibit **right-skewness and zero-inflation**.
+
+The library is built on a **three-layer architecture** with a framework-agnostic NumPy core, ensuring that all mathematical evaluation logic is independent of Pandas or any other data-frame library.  
+
+---
+
+## 🚀 **Quick Start**
+
+```python
+from views_evaluation import PandasAdapter, NativeEvaluator
+
+# 1. Convert DataFrames → EvaluationFrame
+ef = PandasAdapter.from_dataframes(actual=actuals, predictions=predictions_list, target="ged_sb_best")
+
+# 2. Configure and evaluate
+config = {
+    "steps": [1, 2, 3, 4, 5, 6],
+    "regression_targets": ["ged_sb_best"],
+    "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
+}
+evaluator = NativeEvaluator(config)
+report = evaluator.evaluate(ef)
+
+# 3. Access results
+report.to_dataframe("step")          # pd.DataFrame
+report.to_dict()                     # nested dict
+report.get_schema_results("month")   # typed metrics dataclass
+```
+
+> For the full walkthrough including input formatting and sample evaluation, see [`documentation/integration_guide.md`](documentation/integration_guide.md).
 
 ---
 
-## 🌍 **Role in the VIEWS Pipeline**  
+## 🌍 **Role in the VIEWS Pipeline**
 
-VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **official evaluation component** of the VIEWS ecosystem.  
+VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **official evaluation component** of the VIEWS ecosystem.
 
-### **Pipeline Integration:**  
-1. **Model Predictions** →  
-2. **Evaluation Metrics Processing** →  
-3. **Metrics Computation (via EvaluationManager)** →  
-4. **Final Performance Reports**  
+### **Pipeline Integration:**
+1. **Model Predictions** →
+2. **PandasAdapter** (DataFrame → EvaluationFrame) →
+3. **NativeEvaluator** (metrics computation) →
+4. **EvaluationReport** (structured results)  
 
 ### **Integration with Other Repositories:**  
 - **[views-pipeline-core](https://github.com/views-platform/views-pipeline-core):** Supplies preprocessed data for evaluation.  
@@ -72,7 +102,7 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
 ---
 
 ## ✨ **Features**  
-* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **sample** metrics.
+* **Comprehensive Evaluation Framework**: The `NativeEvaluator` provides structured, stateless evaluation of time series predictions across a 2×2 matrix of **regression/classification** tasks and **point/sample** prediction types.
 * **Multiple Evaluation Schemas**:
   * **Step-wise evaluation**: groups and evaluates predictions by the respective steps from all models.
   * **Time-series-wise evaluation**: evaluates predictions for each time-series.
@@ -81,33 +111,59 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
 
 ### **Available Metrics**
 
-| Metric | Key | Description | Available | Supports Distributions |
-|--------|-----|-------------|:---------:|:----------------------:|
-| Mean Squared Error | `MSE` | Average of squared differences between predictions and actuals | ✅ | ❌ |
-| Mean Squared Log Error | `MSLE` | MSE computed on log-transformed values | ✅ | ❌ |
-| Root Mean Squared Log Error | `RMSLE` | Square root of MSLE | ✅ | ❌ |
-| Mean Tweedie Deviance | `MTD` | Tweedie deviance with power=1.5, ideal for zero-inflated data | ✅ | ❌ |
-| Average Precision | `AP` | Area under precision-recall curve for binary classification | ✅ | ❌ |
-| Pearson Correlation | `Pearson` | Linear correlation between predictions and actuals | ✅ | ❌ |
-| Earth Mover's Distance | `EMD` | Wasserstein distance between predicted and actual distributions | ✅ | ✅ |
-| Continuous Ranked Probability Score | `CRPS` | Measures calibration and sharpness of probabilistic forecasts | ✅ | ✅ |
-| Mean Interval Score | `MIS` | Evaluates prediction interval width and coverage | ✅ | ✅ |
-| Ignorance Score | `Ignorance` | Logarithmic scoring rule for probabilistic predictions | ✅ | ✅ |
-| Coverage | `Coverage` | Proportion of actuals falling within prediction intervals | ✅ | ✅ |
-| Mean Prediction | `y_hat_bar` | Average of all predicted values | ✅ | ✅ |
-| Sinkhorn Distance | `SD` | Regularized optimal transport distance | ❌ | ✅ |
-| pseudo-Earth Mover Divergence | `pEMDiv` | Efficient EMD approximation | ❌ | ✅ |
-| Variogram | `Variogram` | Spatial/temporal correlation structure score | ❌ | ❌ |
-| Brier Score | `Brier` | Accuracy of probabilistic predictions | ❌ | ✅ |
-| Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ | ✅ |
-
-> **Note:** Metrics marked with ✅ in "Supports Distributions" can be used for sample evaluation with ensemble/sample-based predictions.
+Metrics are organized by the 2×2 evaluation matrix: **task** (regression / classification) × **prediction type** (point / sample).
+
+#### Regression Point Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Mean Squared Error | `MSE` | Average of squared differences | ✅ |
+| Mean Squared Log Error | `MSLE` | MSE computed on log-transformed values | ✅ |
+| Root Mean Squared Log Error | `RMSLE` | Square root of MSLE | ✅ |
+| Earth Mover's Distance | `EMD` | Wasserstein distance between distributions | ✅ |
+| Pearson Correlation | `Pearson` | Linear correlation between predictions and actuals | ✅ |
+| Mean Tweedie Deviance | `MTD` | Tweedie deviance (configurable power), ideal for zero-inflated data | ✅ |
+| Mean Prediction | `y_hat_bar` | Average of all predicted values (diagnostic) | ✅ |
+| Magnitude Calibration Ratio | `MCR_point` | Ratio of predicted to actual magnitude | ✅ |
+| Sinkhorn Distance | `SD` | Regularized optimal transport distance | ❌ |
+| pseudo-Earth Mover Divergence | `pEMDiv` | Efficient EMD approximation | ❌ |
+| Variogram | `Variogram` | Spatial/temporal correlation structure score | ❌ |
+
+#### Regression Sample Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Continuous Ranked Probability Score | `CRPS` | Calibration and sharpness of probabilistic forecasts | ✅ |
+| Threshold-Weighted CRPS | `twCRPS` | CRPS emphasizing values above a threshold | ✅ |
+| Mean Interval Score | `MIS` | Prediction interval width and coverage | ✅ |
+| Quantile Interval Score | `QIS` | Interval score at specified quantiles | ✅ |
+| Coverage | `Coverage` | Proportion of actuals within prediction intervals | ✅ |
+| Ignorance Score | `Ignorance` | Logarithmic scoring rule for probabilistic predictions | ✅ |
+| Mean Prediction | `y_hat_bar` | Average of all predicted values (diagnostic) | ✅ |
+| Magnitude Calibration Ratio | `MCR_sample` | Ratio of predicted to actual magnitude | ✅ |
+
+#### Classification Point Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Average Precision | `AP` | Area under precision-recall curve | ✅ |
+
+#### Classification Sample Metrics
+
+| Metric | Key | Description | Status |
+|--------|-----|-------------|:------:|
+| Continuous Ranked Probability Score | `CRPS` | Calibration and sharpness | ✅ |
+| Threshold-Weighted CRPS | `twCRPS` | CRPS emphasizing values above a threshold | ✅ |
+| Brier Score | `Brier` | Accuracy of probabilistic binary predictions | ❌ |
+| Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ |
+
+> **Note:** Metrics marked ❌ are defined in the catalog but not yet implemented — requesting them raises a clear `ValueError`.
 
 ---
 
 ### 📝 **Configuration Schema**
 
-The `EvaluationManager.evaluate()` method expects a configuration dictionary with the following keys:
+The `NativeEvaluator` accepts a configuration dictionary (`EvaluationConfig` TypedDict) with the following keys:
 
 | Key | Type | Description |
 |:--- |:--- |:--- |
@@ -118,27 +174,29 @@ The `EvaluationManager.evaluate()` method expects a configuration dictionary wit
 | `classification_targets` | `List[str]` | List of binary targets (e.g., `['by_sb_best']`). |
 | `classification_point_metrics` | `List[str]` | Metrics to compute for classification probability scores. |
 | `classification_sample_metrics` | `List[str]` | Metrics to compute for classification sample predictions. |
+| `evaluation_profile` | `str` | Named hyperparameter profile (default: `"base"`). See `views_evaluation/profiles/`. |
+| `metric_hyperparameters` | `Dict[str, Dict]` | Per-metric overrides that take precedence over the profile. |
 
 #### **Example Configuration:**
 
 ```python
 config = {
     "steps": [1, 3, 6, 12],
-    "regression_targets": ["lr_ged_sb_best"],
+    "regression_targets": ["ged_sb_best"],
     "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
-    "regression_sample_metrics": ["CRPS", "MIS", "Coverage"],
-    "classification_targets": ["by_ged_sb_best"],
-    "classification_point_metrics": ["AP"],
+    "regression_sample_metrics": ["CRPS", "twCRPS", "MIS", "Coverage"],
+    "evaluation_profile": "base",  # or "hydranet_ucdp"
+    "metric_hyperparameters": {
+        "twCRPS": {"threshold": 10.0},  # override profile default
+    },
 }
 ```
 
 ---
 
-* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and sample evaluation.
-* **Automatic Index Matching**: Aligns actual and predicted values based on MultiIndex structures.
-* **Planned Enhancements**: 
-  * **Expanding metric calculations** beyond RMSLE, CRPS, and AP.  
-  * **New visualization tools** for better interpretability of evaluation reports.  
+* **Data Integrity Checks**: Validates input arrays for shape consistency, NaN/infinity, and required identifiers.
+* **Automatic Index Matching**: `PandasAdapter` aligns actual and predicted values based on MultiIndex structures.
+* **Metric Catalog & Profiles**: Hyperparameters are managed through named evaluation profiles with a Chain of Responsibility resolver (model overrides → profile → fail loud).  
 
 ---
 
@@ -153,37 +211,60 @@ pip install views_evaluation
 ```
 
 ---
-## 🏗 **Architecture**  
+## 🏗 **Architecture**
 
-### **1. Evaluation Metrics Framework**  
-- **Handles forecasting evaluation** across **multiple models, levels of analysis, and forecasting windows**.  
-- Converts model outputs into **standardized evaluation reports**.  
+The library follows a strict three-layer architecture (ADR-011):
 
-### **2. Metrics Computation Pipeline**  
-1. **Input**: Predictions from models in standardized DataFrames.  
-2. **Processing**: Calculation of relevant evaluation metrics.  
-3. **Output**: Performance scores for comparison across models.  
+```
+Level 0 — Pure Core (NumPy + SciPy only, zero framework imports)
+  EvaluationFrame       Canonical data container (y_true, y_pred, identifiers)
+  NativeEvaluator       Stateless evaluation engine (month/sequence/step schemas)
+  MetricCatalog         Genome registry mapping metrics → functions + required params
+  Profiles              Named hyperparameter sets (base, hydranet_ucdp, ...)
+
+Level 1 — Bridge / Adapter
+  PandasAdapter         DataFrame → EvaluationFrame conversion (PHASE-3-DELETE)
+  EvaluationReport      Results container with DataFrame/dict export
+
+Level 2 — Legacy Orchestrator
+  EvaluationManager     Deprecated wrapper; delegates to Level 0
+```
 
-### **3. Error Handling & Standardization**  
-- **Ensures conformity to VIEWS evaluation standards**.  
-- **Warns about unrecognized or incorrectly formatted metrics**.  
+**Key design decisions:**
+- **ADR-011**: No Pandas/Polars imports in Level 0 — math is framework-agnostic.
+- **ADR-013**: Fail-loud — all structural failures raise exceptions with actionable messages, never silently degrade.
+- **ADR-042**: Metric catalog — each metric declares its required hyperparameters ("genome"); values are resolved via Chain of Responsibility.  
 
 ---
 
 ## 🗂 **Project Structure**  
 
 ```plaintext
 views-evaluation/
-├── README.md                   # Documentation
-├── .github/workflows/           # CI/CD pipelines
-├── tests/                       # Unit tests
-├── views_evaluation/            # Main source code
+├── views_evaluation/
+│   ├── __init__.py                        # Public API exports
+│   ├── adapters/
+│   │   └── pandas.py                      # PandasAdapter (PHASE-3-DELETE)
 │   ├── evaluation/
-│   │   ├── metrics.py
-│   ├── __init__.py              # Package initialization
-├── .gitignore                   # Git ignore rules
-├── pyproject.toml               # Poetry project file
-├── poetry.lock                  # Dependency lock file
+│   │   ├── config_schema.py               # EvaluationConfig TypedDict
+│   │   ├── evaluation_frame.py            # Core data container
+│   │   ├── evaluation_manager.py          # Legacy orchestrator (deprecated)
+│   │   ├── evaluation_report.py           # Results container
+│   │   ├── metric_catalog.py              # ADR-042 registry + resolver
+│   │   ├── metrics.py                     # Typed metric dataclasses
+│   │   ├── native_evaluator.py            # Core evaluation engine
+│   │   └── native_metric_calculators.py   # Metric implementations
+│   └── profiles/
+│       ├── base.py                        # Standard hyperparameter defaults
+│       └── hydranet_ucdp.py               # Domain-specific profile
+├── tests/                                 # 242 tests (Green/Beige/Red)
+├── documentation/
+│   ├── ADRs/                              # 17 Architecture Decision Records
+│   ├── CICs/                              # Class Intent Contracts
+│   ├── integration_guide.md               # Full API walkthrough
+│   └── evaluation_concepts.md             # Domain concepts
+├── pyproject.toml
+└── README.md
 ```
 
 ---

diff --git a/documentation/ADRs/000_use_of_adrs.md b/documentation/ADRs/000_use_of_adrs.md
@@ -0,0 +1,35 @@
+# ADR-000: Use of Architecture Decision Records (ADRs)
+
+**Status:** Accepted  
+**Date:** 2026-02-25  
+**Deciders:** Project maintainers, Gemini CLI  
+
+---
+
+## Context
+
+The Views Evaluation repository sits at the intersection of evolving research (new metrics, probabilistic scaling) and production stability (Pipeline Core integration). 
+
+Significant decisions in such systems are often made under uncertainty and revisited later, leading to regressions or duplicated debate. Without a shared record of *why* decisions were made, we risk:
+- Accidental reversals of critical design choices (e.g., re-introducing Pandas into the math core).
+- Losing institutional memory as contributors and agents change.
+
+## Decision
+
+We will use **Architecture Decision Records (ADRs)** to document all significant technical, architectural, and conceptual decisions.
+
+- ADRs are stored in the repository under `documentation/ADRs/`.
+- ADRs are numbered sequentially and represent a decision, not just a discussion.
+- ADRs and code must agree; code that violates an ADR is considered an architectural defect.
+- If a decision changes, it is **superseded** by a new ADR, never erased.
+
+## Consequences
+
+### Positive
+- Clearer decision-making and fewer repeated debates.
+- Easier onboarding for both carbon-based and silicon-based contributors.
+- Better long-term coherence through the "Pure Math Engine" refactor.
+
+### Negative
+- Small upfront cost in writing and discipline to maintain.
+- Forces explicitness where ambiguity may feel easier.
-Original file line number
+Diff line change
@@ Expand Up / @@ -215,4 +215,4 @@ cython_debug/ @@
     # logs
     *.log
-    *.log.*
+    *.log.*reports/