Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions documentation/ADRs/011_topology_and_dependency_rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,7 @@ If a dependency feels “convenient but wrong,” it probably is.
### Negative
- Requires a "middle-man" (Adapter) to convert DataFrames to `EvaluationFrame`.
- Small amount of boilerplate for simple scripts.

### Known Deviations

- **sklearn/scipy in Level 0:** `native_metric_calculators.py` imports `sklearn.metrics` (AP, MTD) and `scipy.stats` (EMD, Pearson) at module level. These 4 of ~25 metrics violate the "no external imports except numpy" claim. The ADR permits `scipy`; `sklearn` is a pragmatic deviation pending pure-NumPy replacements or migration to a Level 1 module. Tracked as risk register C-05 (Tier 3).
4 changes: 4 additions & 0 deletions documentation/ADRs/014_boundary_contracts_and_validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,7 @@ Validation failures must be logged and raised explicitly (ADR-013). Warnings are
### Negative
- Requires explicit schemas or validation logic.
- Increases up-front configuration clarity requirements.

### Known Deviations

- **NativeEvaluator defers config validation:** `NativeEvaluator.__init__` only validates the profile name. Missing or malformed config keys (`steps`, target lists, metric lists) are not caught until `evaluate()` is called, producing cryptic errors deep in the call stack. This violates Section 2 ("validate at entry, before execution begins"). Tracked as risk register C-02 (Tier 2, High).
10 changes: 7 additions & 3 deletions documentation/CICs/MetricCatalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**Status:** Active
**Owner:** Evaluation Core
**Last reviewed:** 2026-03-31
**Last reviewed:** 2026-04-04
**Related ADRs:** ADR-042 (Metric Catalog), ADR-012 (Authority), ADR-013 (Observability)

---
Expand Down Expand Up @@ -60,6 +60,8 @@ A genome registry and Chain of Responsibility resolver for evaluation metric hyp
- `ValueError` if a resolved parameter is `None`.
- `ValueError` if overrides contain unknown parameters not in the genome.
- `ValueError` if overrides are provided for a metric with empty genome.
- `ValueError` if a probability/proportion parameter (`alpha`, `quantile`, `lower_quantile`, `upper_quantile`) is not in the open interval (0, 1).
- `ValueError` if `lower_quantile >= upper_quantile` for metrics requiring both (e.g. QIS).

All failures are immediate and explicit. No warnings, no fallbacks, no silent degradation.

Expand Down Expand Up @@ -109,7 +111,8 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE)
- **Green:** `tests/test_metric_catalog.py` — registry snapshot integrity, resolver happy path, genome completeness checks.
- **Beige:** `tests/test_metric_catalog.py` — partial overrides, profile-only resolution, edge case param values.
- **Red:** `tests/test_metric_catalog.py` — unknown metrics, unimplemented metrics, missing params, None values, unknown overrides.
- **Correctness:** `tests/test_metric_correctness.py` — golden-value tests (5 tests; coverage gap noted).
- **Red (bounds):** `tests/test_metric_catalog.py::TestResolveMetricParamsBoundsRed` — 7 tests for out-of-range alpha/quantile and crossed QIS quantiles.
- **Correctness:** `tests/test_metric_calculators.py::TestGoldenValues` — 17 golden-value tests for all implemented metrics.

---

Expand All @@ -118,13 +121,14 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE)
- New metrics are added by: (1) implementing the function in `native_metric_calculators.py`, (2) adding a `MetricSpec` to `METRIC_CATALOG`, (3) adding to `METRIC_MEMBERSHIP`, (4) adding genome values to relevant profiles, (5) adding a field to the typed metrics dataclass in `metrics.py`.
- The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth.
- Profile structure is stable; new profiles are added by creating a new file in `profiles/`.
- Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering.

---

## 12. Known Deviations

- **No profile completeness validation:** There is no mechanism to verify that a profile provides values for all metrics with non-empty genomes. A profile missing a metric's params will only fail at evaluation time, not at profile registration.
- **Weak golden-value coverage:** Only 5 tests in `test_metric_correctness.py` verify metric functions against independently computed known answers. Most metrics lack this verification (see risk register C-07).
- **Golden-value coverage complete:** 17 tests in `tests/test_metric_calculators.py::TestGoldenValues` plus 8 Brier/QS golden-value tests cover all implemented metrics (C-07 closed 2026-04-02).
- **Breaking rename:** The legacy `Brier` metric (unimplemented placeholder) was replaced by `Brier_sample` and `Brier_point` (implemented). The field in `ClassificationSampleEvaluationMetrics` was renamed from `Brier` to `Brier_sample`. External consumers accessing `.Brier` on classification sample results must update to `.Brier_sample`.

---
Expand Down
4 changes: 3 additions & 1 deletion documentation/CICs/NativeEvaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**Status:** Active
**Owner:** Evaluation Core
**Last reviewed:** 2026-04-02
**Last reviewed:** 2026-04-04
**Related ADRs:** ADR-010 (Ontology), ADR-011 (Topology), ADR-032 (Schemas), ADR-042 (Metric Catalog)

---
Expand Down Expand Up @@ -105,6 +105,8 @@ schema = report.get_schema_results('time_series') # dict → typed metrics data
- `legacy_compatibility` default was flipped to `False` in Phase 3. The flag is retained for callers that need truncation behavior.
- Config validation may be added to `__init__` to catch structural config errors at construction time rather than at evaluation time (currently a known gap — risk register C-02).
- The `EvaluationReport` return type is stable; the internal `_calculate_metrics` dispatch may evolve as the `MetricCatalog` grows.
- Exception wrapping added to `_calculate_metrics()` (2026-04-04, C-16): metric function exceptions are now caught and re-raised as `ValueError` naming the metric, task, and pred_type. Test: `test_metric_function_error_includes_metric_name`.
- Step sentinel changed from hardcoded `999` to `float('inf')` (2026-04-04, C-17): steps >= 1000 are no longer silently dropped. Test: `test_step_values_above_999_not_silently_dropped`.

---

Expand Down
20 changes: 20 additions & 0 deletions tests/test_evaluation_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,26 @@ def test_select_indices_with_full_index_range(self):

class TestEvaluationFrameRed:

def test_object_dtype_y_true_raises(self):
"""Object-dtype y_true must be rejected — Pure NumPy contract (ADR-011)."""
n = 2
with pytest.raises(ValueError, match="numeric"):
EvaluationFrame(
y_true=np.array([[1, 2], [3, 4]], dtype=object),
y_pred=np.ones((n, 1)),
identifiers=_make_identifiers(n),
)

def test_object_dtype_y_pred_raises(self):
"""Object-dtype y_pred must be rejected — Pure NumPy contract (ADR-011)."""
n = 2
with pytest.raises(ValueError, match="numeric"):
EvaluationFrame(
y_true=np.ones(n),
y_pred=np.array([[1, 2], [3, 4]], dtype=object),
identifiers=_make_identifiers(n),
)

def test_y_pred_row_mismatch_raises(self):
with pytest.raises(ValueError, match="mismatch"):
EvaluationFrame(np.ones(5), np.ones((4, 1)), _make_identifiers(5))
Expand Down
17 changes: 14 additions & 3 deletions tests/test_evaluation_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
BEIGE — empty schema, multiple metrics per group, raw schema passthrough
RED — unknown schema key, invalid task/pred_type combination
"""
import pandas as pd
import pytest
pd = pytest.importorskip("pandas")

from views_evaluation.evaluation.evaluation_report import EvaluationReport
from views_evaluation.evaluation.metrics import (
from views_evaluation.evaluation.evaluation_report import EvaluationReport # noqa: E402
from views_evaluation.evaluation.metrics import ( # noqa: E402
RegressionPointEvaluationMetrics,
RegressionSampleEvaluationMetrics,
ClassificationPointEvaluationMetrics,
Expand Down Expand Up @@ -198,6 +198,17 @@ def test_all_four_task_pred_type_combinations_resolve_correctly(self):

class TestEvaluationReportRed:

def test_non_dict_schema_value_fails_at_access(self):
"""Malformed result dict: schema value is a string, not a dict.

Construction succeeds (no deep validation), but get_schema_results
fails when it tries to iterate the non-dict value.
"""
results = {'month': 'not_a_dict', 'time_series': {}, 'step': {}}
report = EvaluationReport('t', 'regression', 'point', results)
with pytest.raises(AttributeError):
report.get_schema_results('month')

def test_get_schema_results_unknown_schema_raises_key_error(self):
report = EvaluationReport('t', 'regression', 'point', {})
with pytest.raises(KeyError, match="nonexistent"):
Expand Down
42 changes: 42 additions & 0 deletions tests/test_metric_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,48 @@ def test_overrides_on_empty_genome_raises(self):
resolve_metric_params("MSE", {"spurious": 1.0}, BASE_PROFILE)


# ---------------------------------------------------------------------------
# Red: bounds validation for hyperparameters (C-18)
# ---------------------------------------------------------------------------

class TestResolveMetricParamsBoundsRed:
"""Bounds validation for metric hyperparameters."""

def test_alpha_above_one_raises(self):
"""Coverage alpha must be in (0, 1)."""
with pytest.raises(ValueError, match="alpha"):
resolve_metric_params("Coverage", {"alpha": 1.5}, BASE_PROFILE)

def test_alpha_zero_raises(self):
"""Coverage alpha=0 would cause division by zero in MIS."""
with pytest.raises(ValueError, match="alpha"):
resolve_metric_params("MIS", {"alpha": 0.0}, BASE_PROFILE)

def test_alpha_negative_raises(self):
with pytest.raises(ValueError, match="alpha"):
resolve_metric_params("Coverage", {"alpha": -0.1}, BASE_PROFILE)

def test_quantile_above_one_raises(self):
"""QS quantile must be in (0, 1)."""
with pytest.raises(ValueError, match="quantile"):
resolve_metric_params("QS_sample", {"quantile": 1.0}, BASE_PROFILE)

def test_quantile_zero_raises(self):
with pytest.raises(ValueError, match="quantile"):
resolve_metric_params("QS_sample", {"quantile": 0.0}, BASE_PROFILE)

def test_quantile_negative_raises(self):
with pytest.raises(ValueError, match="quantile"):
resolve_metric_params("QS_point", {"quantile": -0.5}, BASE_PROFILE)

def test_qis_lower_quantile_above_upper_raises(self):
"""QIS lower_quantile must be < upper_quantile."""
with pytest.raises(ValueError, match="quantile"):
resolve_metric_params(
"QIS", {"lower_quantile": 0.9, "upper_quantile": 0.1}, BASE_PROFILE
)


# ---------------------------------------------------------------------------
# Beige: structural integrity
# ---------------------------------------------------------------------------
Expand Down
95 changes: 95 additions & 0 deletions tests/test_native_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,83 @@ def test_evaluate_twice_produces_identical_results(self):
report2 = evaluator.evaluate(ef)
assert report1.to_dict() == report2.to_dict()

def test_step_values_above_999_not_silently_dropped(self):
"""Steps >= 1000 must not be silently dropped by a hardcoded sentinel (C-17)."""
n = 4
ef = EvaluationFrame(
y_true=np.zeros(n),
y_pred=np.zeros((n, 1)),
identifiers={
'time': np.array([2000, 2000, 2001, 2001]),
'unit': np.array([1, 2, 1, 2]),
'origin': np.zeros(n, dtype=int),
'step': np.array([1000, 1000, 1001, 1001]),
},
metadata={'target': 'test_target'},
)
config = _regression_point_config(steps=[1000, 1001])
report = NativeEvaluator(config).evaluate(ef, legacy_compatibility=False)
step_results = report.to_dict()['schemas']['step']
# Metrics must be computed (non-empty dict), not just pre-initialized
assert 'MSE' in step_results.get('step1000', {}), \
"Step 1000 metrics were silently dropped by sentinel"
assert 'MSE' in step_results.get('step1001', {}), \
"Step 1001 metrics were silently dropped by sentinel"

def test_nan_metric_result_is_finite_checkable(self):
"""Metric results that are NaN (e.g., Pearson on constant data) must be
detectable via np.isfinite. This documents that NaN can appear in results
when data is degenerate, and callers should check."""
n = 4
ef = EvaluationFrame(
y_true=np.array([1.0, 1.0, 1.0, 1.0]), # constant → Pearson = NaN
y_pred=np.array([[1.0], [1.0], [1.0], [1.0]]),
identifiers={
'time': np.array([100, 100, 101, 101]),
'unit': np.array([1, 2, 1, 2]),
'origin': np.zeros(n, dtype=int),
'step': np.array([1, 1, 2, 2]),
},
metadata={'target': 'test_target'},
)
config = _regression_point_config(steps=[1, 2], metrics=['Pearson'])
report = NativeEvaluator(config).evaluate(ef)
month_results = report.to_dict()['schemas']['month']
pearson_val = month_results['month100']['Pearson']
assert np.isnan(pearson_val), "Pearson on constant data should be NaN"
# Callers can detect this with np.isfinite
assert not np.isfinite(pearson_val)

def test_cross_schema_consistency_mse_values(self):
"""MSE computed via month-wise on a single-month window must equal
step-wise MSE for the same data slice."""
# Single origin, single step, single month → all schemas see same data
n = 4
y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([[1.5], [2.5], [3.5], [4.5]])
ef = EvaluationFrame(
y_true=y_true,
y_pred=y_pred,
identifiers={
'time': np.array([100, 100, 100, 100]),
'unit': np.array([1, 2, 3, 4]),
'origin': np.zeros(n, dtype=int),
'step': np.ones(n, dtype=int),
},
metadata={'target': 'test_target'},
)
config = _regression_point_config(steps=[1], metrics=['MSE'])
report = NativeEvaluator(config).evaluate(ef)
schemas = report.to_dict()['schemas']
mse_month = schemas['month']['month100']['MSE']
mse_step = schemas['step']['step01']['MSE']
mse_ts = schemas['time_series']['ts00']['MSE']
# All three schemas see the same 4 observations → same MSE
assert mse_month == pytest.approx(mse_step, abs=1e-12)
assert mse_month == pytest.approx(mse_ts, abs=1e-12)
# And the value is correct: mean((0.5)^2) = 0.25
assert mse_month == pytest.approx(0.25, abs=1e-12)

def test_sample_predictions_produce_point_pred_type_false(self):
n = 4
ef = EvaluationFrame(
Expand Down Expand Up @@ -418,6 +495,24 @@ def test_empty_config_accepted_at_init_fails_at_evaluate(self):
evaluator = NativeEvaluator({}) # does NOT raise — C-02
with pytest.raises((ValueError, KeyError)):
evaluator.evaluate(ef)

def test_metric_function_error_includes_metric_name(self):
"""When a metric function raises, the error message must name the metric (C-16)."""
import dataclasses
from unittest.mock import patch, MagicMock
from views_evaluation.evaluation.metric_catalog import METRIC_CATALOG

ef = _make_parallelogram_ef(n_origins=1, n_steps=2, n_units=2)
config = _regression_point_config(steps=[1, 2], metrics=['MSE'])

# Inject a failure into MSE's function
original_spec = METRIC_CATALOG['MSE']
broken_fn = MagicMock(side_effect=RuntimeError("sklearn internal error"))
broken_spec = dataclasses.replace(original_spec, function=broken_fn)
with patch.dict(METRIC_CATALOG, {'MSE': broken_spec}):
with pytest.raises(ValueError, match="MSE"):
NativeEvaluator(config).evaluate(ef)

def test_classification_metric_on_regression_target_raises(self):
"""AP is only valid for classification; using it with regression_targets must fail."""
ef = _make_parallelogram_ef(n_origins=1, n_steps=2, n_units=2)
Expand Down
6 changes: 6 additions & 0 deletions views_evaluation/evaluation/evaluation_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ def _validate(y_true: np.ndarray, y_pred: np.ndarray, identifiers: Dict[str, np.
if y_pred.shape[0] != n_rows:
raise ValueError(f"y_pred rows ({y_pred.shape[0]}) mismatch y_true ({n_rows})")

# ADR-011: Pure NumPy contract — reject object-dtype arrays
if y_true.dtype == object:
raise ValueError("y_true must be numeric (got dtype=object)")
if y_pred.dtype == object:
raise ValueError("y_pred must be numeric (got dtype=object)")

# Rectangular sample validation: y_pred must be a dense 2D array
if y_pred.ndim != 2:
raise ValueError(
Expand Down
20 changes: 20 additions & 0 deletions views_evaluation/evaluation/metric_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@
calculate_jeffreys_native,
)

# Parameters that must be in the open interval (0, 1)
_UNIT_INTERVAL_EXCLUSIVE = {"alpha", "quantile", "lower_quantile", "upper_quantile"}


@dataclass(frozen=True)
class MetricSpec:
Expand Down Expand Up @@ -192,4 +195,21 @@ def resolve_metric_params(
f"All hyperparameters must be explicitly set."
)

# Bounds validation for probability/proportion parameters
for param, value in resolved.items():
if param in _UNIT_INTERVAL_EXCLUSIVE:
if not (0 < value < 1):
raise ValueError(
f"Metric '{metric_name}' parameter '{param}' must be in (0, 1), "
f"got {value}."
)

# Cross-parameter validation
if "lower_quantile" in resolved and "upper_quantile" in resolved:
if resolved["lower_quantile"] >= resolved["upper_quantile"]:
raise ValueError(
f"Metric '{metric_name}': lower_quantile ({resolved['lower_quantile']}) "
f"must be less than upper_quantile ({resolved['upper_quantile']})."
)

return resolved
9 changes: 7 additions & 2 deletions views_evaluation/evaluation/native_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,12 @@ def _calculate_metrics(self, ef: EvaluationFrame, metrics_list: List[str],
spec = METRIC_CATALOG[m]
overrides = self.metric_overrides.get(m, {})
resolved = resolve_metric_params(m, overrides, self.profile)
results[m] = spec.function(ef.y_true, ef.y_pred, **resolved)
try:
results[m] = spec.function(ef.y_true, ef.y_pred, **resolved)
except Exception as e:
raise ValueError(
f"Metric '{m}' failed for ({task}, {pred_type}): {e}"
) from e
return results

def evaluate(self, ef: EvaluationFrame, legacy_compatibility: bool = False) -> EvaluationReport:
Expand Down Expand Up @@ -102,7 +107,7 @@ def evaluate(self, ef: EvaluationFrame, legacy_compatibility: bool = False) -> E
step_results = {f"step{str(s).zfill(2)}": {} for s in config_steps}

# LEGACY PARITY: Truncate steps to the shortest sequence length if in compat mode
max_allowed_step = 999
max_allowed_step = float('inf')
if legacy_compatibility:
origin_indices = ef.get_group_indices('origin')
seq_lengths = []
Expand Down
Loading
Loading