Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
cfb13ac
docs: add investigation plan and initial alignment semantics analysis
Polichinel Feb 25, 2026
0faabda
feat: complete investigation into canonical EvaluationFrame boundary
Polichinel Feb 25, 2026
8af7556
docs: add implementation plan for EvaluationFrame migration
Polichinel Feb 25, 2026
a13c15f
docs: reorganize investigation reports into numbered directory
Polichinel Feb 26, 2026
5b04d87
docs: renumber ADR suite into foundational hierarchy (000-041)
Polichinel Feb 26, 2026
b09658d
docs: add Class Intent Contracts for EvaluationFrame, NativeEvaluator…
Polichinel Feb 26, 2026
f2b1931
feat: complete verified EvaluationFrame migration
Polichinel Feb 26, 2026
ecd8402
feat: complete EvaluationFrame refactor with clean boundary contracts
Polichinel Feb 26, 2026
a256e72
feat: complete architectural cleanup and 1000% functional verification
Polichinel Feb 26, 2026
ace95b5
docs: finalize post-refactor status report
Polichinel Feb 26, 2026
12090dd
fix(linting): resolve linting issues identified by ruff
Polichinel Feb 26, 2026
4ad460b
feat: implement dual-entry support and shadow verification for orches…
Polichinel Feb 26, 2026
193b06c
docs: clarify defensive bridge in migration plan
Polichinel Feb 27, 2026
917511c
fix+test+docs: native path hardening, step filtering, and doc accurac…
Polichinel Feb 27, 2026
3f1fcdc
chore: demarcate permanent vs temporary code for Phase 3 readiness
Polichinel Feb 27, 2026
cfee82e
feat: add metric catalog, named profiles, and pure-numpy CRPS/twCRPS/QIS
Polichinel Mar 11, 2026
5ad4067
feat: add MCR metric, hydranet_ucdp profile, and tech debt cleanup
Polichinel Mar 13, 2026
26179d0
docs+chore: documentation remediation, README overhaul, and tech debt…
Polichinel Mar 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -215,4 +215,4 @@ cython_debug/

# logs
*.log
*.log.*
*.log.*reports/
219 changes: 150 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,33 +35,63 @@ The evaluation ontology has been updated to be more explicit and task-specific.

## 📚 **Table of Contents**

1. [Overview](#overview)
2. [Role in the VIEWS Pipeline](#role-in-the-views-pipeline)
3. [Features](#features)
4. [Installation](#installation)
5. [Architecture](#architecture)
6. [Project Structure](#project-structure)
7. [Contributing](#contributing)
8. [License](#license)
9. [Acknowledgements](#acknowledgements)
1. [Overview](#overview)
2. [Quick Start](#quick-start)
3. [Role in the VIEWS Pipeline](#role-in-the-views-pipeline)
4. [Features](#features)
5. [Installation](#installation)
6. [Architecture](#architecture)
7. [Project Structure](#project-structure)
8. [Contributing](#contributing)
9. [License](#license)
10. [Acknowledgements](#acknowledgements)

---

## 🧠 **Overview**

The **VIEWS Evaluation** repository provides a standardized framework for **assessing time-series forecasting models** used in the **VIEWS conflict prediction pipeline**. It ensures consistent, robust, and interpretable evaluations through **metrics tailored to conflict-related data**, which often exhibit **right-skewness and zero-inflation**.
The **VIEWS Evaluation** repository provides a standardized framework for **assessing time-series forecasting models** used in the **VIEWS conflict prediction pipeline**. It ensures consistent, robust, and interpretable evaluations through **metrics tailored to conflict-related data**, which often exhibit **right-skewness and zero-inflation**.

The library is built on a **three-layer architecture** with a framework-agnostic NumPy core, ensuring that all mathematical evaluation logic is independent of Pandas or any other data-frame library.

---

## 🚀 **Quick Start**

```python
from views_evaluation import PandasAdapter, NativeEvaluator

# 1. Convert DataFrames → EvaluationFrame
ef = PandasAdapter.from_dataframes(actual=actuals, predictions=predictions_list, target="ged_sb_best")

# 2. Configure and evaluate
config = {
"steps": [1, 2, 3, 4, 5, 6],
"regression_targets": ["ged_sb_best"],
"regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
}
evaluator = NativeEvaluator(config)
report = evaluator.evaluate(ef)

# 3. Access results
report.to_dataframe("step") # pd.DataFrame
report.to_dict() # nested dict
report.get_schema_results("month") # typed metrics dataclass
```

> For the full walkthrough including input formatting and sample evaluation, see [`documentation/integration_guide.md`](documentation/integration_guide.md).

---

## 🌍 **Role in the VIEWS Pipeline**
## 🌍 **Role in the VIEWS Pipeline**

VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **official evaluation component** of the VIEWS ecosystem.
VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **official evaluation component** of the VIEWS ecosystem.

### **Pipeline Integration:**
1. **Model Predictions** →
2. **Evaluation Metrics Processing**
3. **Metrics Computation (via EvaluationManager)** →
4. **Final Performance Reports**
### **Pipeline Integration:**
1. **Model Predictions** →
2. **PandasAdapter** (DataFrame → EvaluationFrame) →
3. **NativeEvaluator** (metrics computation) →
4. **EvaluationReport** (structured results)

### **Integration with Other Repositories:**
- **[views-pipeline-core](https://github.com/views-platform/views-pipeline-core):** Supplies preprocessed data for evaluation.
Expand All @@ -72,7 +102,7 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **
---

## ✨ **Features**
* **Comprehensive Evaluation Framework**: The `EvaluationManager` class provides structured methods to evaluate time series predictions based on **point** and **sample** metrics.
* **Comprehensive Evaluation Framework**: The `NativeEvaluator` provides structured, stateless evaluation of time series predictions across a 2×2 matrix of **regression/classification** tasks and **point/sample** prediction types.
* **Multiple Evaluation Schemas**:
* **Step-wise evaluation**: groups and evaluates predictions by the respective steps from all models.
* **Time-series-wise evaluation**: evaluates predictions for each time-series.
Expand All @@ -81,33 +111,59 @@ VIEWS Evaluation ensures **forecasting accuracy and model robustness** as the **

### **Available Metrics**

| Metric | Key | Description | Available | Supports Distributions |
|--------|-----|-------------|:---------:|:----------------------:|
| Mean Squared Error | `MSE` | Average of squared differences between predictions and actuals | ✅ | ❌ |
| Mean Squared Log Error | `MSLE` | MSE computed on log-transformed values | ✅ | ❌ |
| Root Mean Squared Log Error | `RMSLE` | Square root of MSLE | ✅ | ❌ |
| Mean Tweedie Deviance | `MTD` | Tweedie deviance with power=1.5, ideal for zero-inflated data | ✅ | ❌ |
| Average Precision | `AP` | Area under precision-recall curve for binary classification | ✅ | ❌ |
| Pearson Correlation | `Pearson` | Linear correlation between predictions and actuals | ✅ | ❌ |
| Earth Mover's Distance | `EMD` | Wasserstein distance between predicted and actual distributions | ✅ | ✅ |
| Continuous Ranked Probability Score | `CRPS` | Measures calibration and sharpness of probabilistic forecasts | ✅ | ✅ |
| Mean Interval Score | `MIS` | Evaluates prediction interval width and coverage | ✅ | ✅ |
| Ignorance Score | `Ignorance` | Logarithmic scoring rule for probabilistic predictions | ✅ | ✅ |
| Coverage | `Coverage` | Proportion of actuals falling within prediction intervals | ✅ | ✅ |
| Mean Prediction | `y_hat_bar` | Average of all predicted values | ✅ | ✅ |
| Sinkhorn Distance | `SD` | Regularized optimal transport distance | ❌ | ✅ |
| pseudo-Earth Mover Divergence | `pEMDiv` | Efficient EMD approximation | ❌ | ✅ |
| Variogram | `Variogram` | Spatial/temporal correlation structure score | ❌ | ❌ |
| Brier Score | `Brier` | Accuracy of probabilistic predictions | ❌ | ✅ |
| Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ | ✅ |

> **Note:** Metrics marked with ✅ in "Supports Distributions" can be used for sample evaluation with ensemble/sample-based predictions.
Metrics are organized by the 2×2 evaluation matrix: **task** (regression / classification) × **prediction type** (point / sample).

#### Regression Point Metrics

| Metric | Key | Description | Status |
|--------|-----|-------------|:------:|
| Mean Squared Error | `MSE` | Average of squared differences | ✅ |
| Mean Squared Log Error | `MSLE` | MSE computed on log-transformed values | ✅ |
| Root Mean Squared Log Error | `RMSLE` | Square root of MSLE | ✅ |
| Earth Mover's Distance | `EMD` | Wasserstein distance between distributions | ✅ |
| Pearson Correlation | `Pearson` | Linear correlation between predictions and actuals | ✅ |
| Mean Tweedie Deviance | `MTD` | Tweedie deviance (configurable power), ideal for zero-inflated data | ✅ |
| Mean Prediction | `y_hat_bar` | Average of all predicted values (diagnostic) | ✅ |
| Magnitude Calibration Ratio | `MCR_point` | Ratio of predicted to actual magnitude | ✅ |
| Sinkhorn Distance | `SD` | Regularized optimal transport distance | ❌ |
| pseudo-Earth Mover Divergence | `pEMDiv` | Efficient EMD approximation | ❌ |
| Variogram | `Variogram` | Spatial/temporal correlation structure score | ❌ |

#### Regression Sample Metrics

| Metric | Key | Description | Status |
|--------|-----|-------------|:------:|
| Continuous Ranked Probability Score | `CRPS` | Calibration and sharpness of probabilistic forecasts | ✅ |
| Threshold-Weighted CRPS | `twCRPS` | CRPS emphasizing values above a threshold | ✅ |
| Mean Interval Score | `MIS` | Prediction interval width and coverage | ✅ |
| Quantile Interval Score | `QIS` | Interval score at specified quantiles | ✅ |
| Coverage | `Coverage` | Proportion of actuals within prediction intervals | ✅ |
| Ignorance Score | `Ignorance` | Logarithmic scoring rule for probabilistic predictions | ✅ |
| Mean Prediction | `y_hat_bar` | Average of all predicted values (diagnostic) | ✅ |
| Magnitude Calibration Ratio | `MCR_sample` | Ratio of predicted to actual magnitude | ✅ |

#### Classification Point Metrics

| Metric | Key | Description | Status |
|--------|-----|-------------|:------:|
| Average Precision | `AP` | Area under precision-recall curve | ✅ |

#### Classification Sample Metrics

| Metric | Key | Description | Status |
|--------|-----|-------------|:------:|
| Continuous Ranked Probability Score | `CRPS` | Calibration and sharpness | ✅ |
| Threshold-Weighted CRPS | `twCRPS` | CRPS emphasizing values above a threshold | ✅ |
| Brier Score | `Brier` | Accuracy of probabilistic binary predictions | ❌ |
| Jeffreys Divergence | `Jeffreys` | Symmetric measure of distribution difference | ❌ |

> **Note:** Metrics marked ❌ are defined in the catalog but not yet implemented — requesting them raises a clear `ValueError`.

---

### 📝 **Configuration Schema**

The `EvaluationManager.evaluate()` method expects a configuration dictionary with the following keys:
The `NativeEvaluator` accepts a configuration dictionary (`EvaluationConfig` TypedDict) with the following keys:

| Key | Type | Description |
|:--- |:--- |:--- |
Expand All @@ -118,27 +174,29 @@ The `EvaluationManager.evaluate()` method expects a configuration dictionary wit
| `classification_targets` | `List[str]` | List of binary targets (e.g., `['by_sb_best']`). |
| `classification_point_metrics` | `List[str]` | Metrics to compute for classification probability scores. |
| `classification_sample_metrics` | `List[str]` | Metrics to compute for classification sample predictions. |
| `evaluation_profile` | `str` | Named hyperparameter profile (default: `"base"`). See `views_evaluation/profiles/`. |
| `metric_hyperparameters` | `Dict[str, Dict]` | Per-metric overrides that take precedence over the profile. |

#### **Example Configuration:**

```python
config = {
"steps": [1, 3, 6, 12],
"regression_targets": ["lr_ged_sb_best"],
"regression_targets": ["ged_sb_best"],
"regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
"regression_sample_metrics": ["CRPS", "MIS", "Coverage"],
"classification_targets": ["by_ged_sb_best"],
"classification_point_metrics": ["AP"],
"regression_sample_metrics": ["CRPS", "twCRPS", "MIS", "Coverage"],
"evaluation_profile": "base", # or "hydranet_ucdp"
"metric_hyperparameters": {
"twCRPS": {"threshold": 10.0}, # override profile default
},
}
```

---

* **Data Integrity Checks**: Ensures that input DataFrames conform to expected structures before evaluation based on point and sample evaluation.
* **Automatic Index Matching**: Aligns actual and predicted values based on MultiIndex structures.
* **Planned Enhancements**:
* **Expanding metric calculations** beyond RMSLE, CRPS, and AP.
* **New visualization tools** for better interpretability of evaluation reports.
* **Data Integrity Checks**: Validates input arrays for shape consistency, NaN/infinity, and required identifiers.
* **Automatic Index Matching**: `PandasAdapter` aligns actual and predicted values based on MultiIndex structures.
* **Metric Catalog & Profiles**: Hyperparameters are managed through named evaluation profiles with a Chain of Responsibility resolver (model overrides → profile → fail loud).

---

Expand All @@ -153,37 +211,60 @@ pip install views_evaluation
```

---
## 🏗 **Architecture**
## 🏗 **Architecture**

### **1. Evaluation Metrics Framework**
- **Handles forecasting evaluation** across **multiple models, levels of analysis, and forecasting windows**.
- Converts model outputs into **standardized evaluation reports**.
The library follows a strict three-layer architecture (ADR-011):

### **2. Metrics Computation Pipeline**
1. **Input**: Predictions from models in standardized DataFrames.
2. **Processing**: Calculation of relevant evaluation metrics.
3. **Output**: Performance scores for comparison across models.
```
Level 0 — Pure Core (NumPy + SciPy only, zero framework imports)
EvaluationFrame Canonical data container (y_true, y_pred, identifiers)
NativeEvaluator Stateless evaluation engine (month/sequence/step schemas)
MetricCatalog Genome registry mapping metrics → functions + required params
Profiles Named hyperparameter sets (base, hydranet_ucdp, ...)

Level 1 — Bridge / Adapter
PandasAdapter DataFrame → EvaluationFrame conversion (PHASE-3-DELETE)
EvaluationReport Results container with DataFrame/dict export

Level 2 — Legacy Orchestrator
EvaluationManager Deprecated wrapper; delegates to Level 0
```

### **3. Error Handling & Standardization**
- **Ensures conformity to VIEWS evaluation standards**.
- **Warns about unrecognized or incorrectly formatted metrics**.
**Key design decisions:**
- **ADR-011**: No Pandas/Polars imports in Level 0 — math is framework-agnostic.
- **ADR-013**: Fail-loud — all structural failures raise exceptions with actionable messages, never silently degrade.
- **ADR-042**: Metric catalog — each metric declares its required hyperparameters ("genome"); values are resolved via Chain of Responsibility.

---

## 🗂 **Project Structure**

```plaintext
views-evaluation/
├── README.md # Documentation
├── .github/workflows/ # CI/CD pipelines
├── tests/ # Unit tests
── views_evaluation/ # Main source code
├── views_evaluation/
├── __init__.py # Public API exports
├── adapters/
│ │ └── pandas.py # PandasAdapter (PHASE-3-DELETE)
│ ├── evaluation/
│ │ ├── metrics.py
│ ├── __init__.py # Package initialization
├── .gitignore # Git ignore rules
├── pyproject.toml # Poetry project file
├── poetry.lock # Dependency lock file
│ │ ├── config_schema.py # EvaluationConfig TypedDict
│ │ ├── evaluation_frame.py # Core data container
│ │ ├── evaluation_manager.py # Legacy orchestrator (deprecated)
│ │ ├── evaluation_report.py # Results container
│ │ ├── metric_catalog.py # ADR-042 registry + resolver
│ │ ├── metrics.py # Typed metric dataclasses
│ │ ├── native_evaluator.py # Core evaluation engine
│ │ └── native_metric_calculators.py # Metric implementations
│ └── profiles/
│ ├── base.py # Standard hyperparameter defaults
│ └── hydranet_ucdp.py # Domain-specific profile
├── tests/ # 242 tests (Green/Beige/Red)
├── documentation/
│ ├── ADRs/ # 17 Architecture Decision Records
│ ├── CICs/ # Class Intent Contracts
│ ├── integration_guide.md # Full API walkthrough
│ └── evaluation_concepts.md # Domain concepts
├── pyproject.toml
└── README.md
```

---
Expand Down
35 changes: 35 additions & 0 deletions documentation/ADRs/000_use_of_adrs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# ADR-000: Use of Architecture Decision Records (ADRs)

**Status:** Accepted
**Date:** 2026-02-25
**Deciders:** Project maintainers, Gemini CLI

---

## Context

The Views Evaluation repository sits at the intersection of evolving research (new metrics, probabilistic scaling) and production stability (Pipeline Core integration).

Significant decisions in such systems are often made under uncertainty and revisited later, leading to regressions or duplicated debate. Without a shared record of *why* decisions were made, we risk:
- Accidental reversals of critical design choices (e.g., re-introducing Pandas into the math core).
- Losing institutional memory as contributors and agents change.

## Decision

We will use **Architecture Decision Records (ADRs)** to document all significant technical, architectural, and conceptual decisions.

- ADRs are stored in the repository under `documentation/ADRs/`.
- ADRs are numbered sequentially and represent a decision, not just a discussion.
- ADRs and code must agree; code that violates an ADR is considered an architectural defect.
- If a decision changes, it is **superseded** by a new ADR, never erased.

## Consequences

### Positive
- Clearer decision-making and fewer repeated debates.
- Easier onboarding for both carbon-based and silicon-based contributors.
- Better long-term coherence through the "Pure Math Engine" refactor.

### Negative
- Small upfront cost in writing and discipline to maintain.
- Forces explicitness where ambiguity may feel easier.
Loading
Loading