From 2578532394efc6733f6d23d3c5640c15ac8343ee Mon Sep 17 00:00:00 2001
From: Polichinl <simmaa@prio.org>
Date: Sat, 4 Apr 2026 02:07:27 +0200
Subject: [PATCH 1/3] =?UTF-8?q?fix:=20close=203=20risk=20register=20concer?=
 =?UTF-8?q?ns=20=E2=80=94=20bounds=20validation,=20exception=20context,=20?=
 =?UTF-8?q?step=20sentinel?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address C-16, C-17, C-18 identified by risk register review with TDD:

- C-16: Wrap metric function calls in _calculate_metrics() with try/except
  that re-raises as ValueError naming the metric, task, and pred_type
- C-17: Replace hardcoded max_allowed_step=999 with float('inf') so steps
  >= 1000 are not silently dropped
- C-18: Add bounds validation in resolve_metric_params() for alpha, quantile,
  lower_quantile, upper_quantile — all must be in (0, 1). Cross-validation
  for QIS lower_quantile < upper_quantile

Also: update CICs (MetricCatalog, NativeEvaluator) and ADRs (011, 014) with
Known Deviations sections documenting C-02 and C-05. Close C-14 (stale
editable install metadata). Upgrade C-02 from Tier 3 to Tier 2.

9 new tests, 240 total passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../ADRs/011_topology_and_dependency_rules.md |  4 ++
 .../014_boundary_contracts_and_validation.md  |  4 ++
 documentation/CICs/MetricCatalog.md           | 10 +++--
 documentation/CICs/NativeEvaluator.md         |  4 +-
 tests/test_metric_catalog.py                  | 42 +++++++++++++++++++
 tests/test_native_evaluator.py                | 41 ++++++++++++++++++
 views_evaluation/evaluation/metric_catalog.py | 20 +++++++++
 .../evaluation/native_evaluator.py            |  9 +++-
 8 files changed, 128 insertions(+), 6 deletions(-)

diff --git a/documentation/ADRs/011_topology_and_dependency_rules.md b/documentation/ADRs/011_topology_and_dependency_rules.md
index b482fc7..92622c2 100644
--- a/documentation/ADRs/011_topology_and_dependency_rules.md
+++ b/documentation/ADRs/011_topology_and_dependency_rules.md
@@ -53,3 +53,7 @@ If a dependency feels “convenient but wrong,” it probably is.
 ### Negative
 - Requires a "middle-man" (Adapter) to convert DataFrames to `EvaluationFrame`.
 - Small amount of boilerplate for simple scripts.
+
+### Known Deviations
+
+- **sklearn/scipy in Level 0:** `native_metric_calculators.py` imports `sklearn.metrics` (AP, MTD) and `scipy.stats` (EMD, Pearson) at module level. These 4 of ~25 metrics violate the "no external imports except numpy" claim. The ADR permits `scipy`; `sklearn` is a pragmatic deviation pending pure-NumPy replacements or migration to a Level 1 module. Tracked as risk register C-05 (Tier 3).
diff --git a/documentation/ADRs/014_boundary_contracts_and_validation.md b/documentation/ADRs/014_boundary_contracts_and_validation.md
index dc46108..f8b1b30 100644
--- a/documentation/ADRs/014_boundary_contracts_and_validation.md
+++ b/documentation/ADRs/014_boundary_contracts_and_validation.md
@@ -46,3 +46,7 @@ Validation failures must be logged and raised explicitly (ADR-013). Warnings are
 ### Negative
 - Requires explicit schemas or validation logic.
 - Increases up-front configuration clarity requirements.
+
+### Known Deviations
+
+- **NativeEvaluator defers config validation:** `NativeEvaluator.__init__` only validates the profile name. Missing or malformed config keys (`steps`, target lists, metric lists) are not caught until `evaluate()` is called, producing cryptic errors deep in the call stack. This violates Section 2 ("validate at entry, before execution begins"). Tracked as risk register C-02 (Tier 2, High).
diff --git a/documentation/CICs/MetricCatalog.md b/documentation/CICs/MetricCatalog.md
index c4040c1..25b4d6c 100644
--- a/documentation/CICs/MetricCatalog.md
+++ b/documentation/CICs/MetricCatalog.md
@@ -2,7 +2,7 @@
 
 **Status:** Active  
 **Owner:** Evaluation Core  
-**Last reviewed:** 2026-03-31  
+**Last reviewed:** 2026-04-04  
 **Related ADRs:** ADR-042 (Metric Catalog), ADR-012 (Authority), ADR-013 (Observability)
 
 ---
@@ -60,6 +60,8 @@ A genome registry and Chain of Responsibility resolver for evaluation metric hyp
 - `ValueError` if a resolved parameter is `None`.
 - `ValueError` if overrides contain unknown parameters not in the genome.
 - `ValueError` if overrides are provided for a metric with empty genome.
+- `ValueError` if a probability/proportion parameter (`alpha`, `quantile`, `lower_quantile`, `upper_quantile`) is not in the open interval (0, 1).
+- `ValueError` if `lower_quantile >= upper_quantile` for metrics requiring both (e.g. QIS).
 
 All failures are immediate and explicit. No warnings, no fallbacks, no silent degradation.
 
@@ -109,7 +111,8 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE)
 - **Green:** `tests/test_metric_catalog.py` — registry snapshot integrity, resolver happy path, genome completeness checks.
 - **Beige:** `tests/test_metric_catalog.py` — partial overrides, profile-only resolution, edge case param values.
 - **Red:** `tests/test_metric_catalog.py` — unknown metrics, unimplemented metrics, missing params, None values, unknown overrides.
-- **Correctness:** `tests/test_metric_correctness.py` — golden-value tests (5 tests; coverage gap noted).
+- **Red (bounds):** `tests/test_metric_catalog.py::TestResolveMetricParamsBoundsRed` — 7 tests for out-of-range alpha/quantile and crossed QIS quantiles.
+- **Correctness:** `tests/test_metric_calculators.py::TestGoldenValues` — 17 golden-value tests for all implemented metrics.
 
 ---
 
@@ -118,13 +121,14 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE)
 - New metrics are added by: (1) implementing the function in `native_metric_calculators.py`, (2) adding a `MetricSpec` to `METRIC_CATALOG`, (3) adding to `METRIC_MEMBERSHIP`, (4) adding genome values to relevant profiles, (5) adding a field to the typed metrics dataclass in `metrics.py`.
 - The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth.
 - Profile structure is stable; new profiles are added by creating a new file in `profiles/`.
+- Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering.
 
 ---
 
 ## 12. Known Deviations
 
 - **No profile completeness validation:** There is no mechanism to verify that a profile provides values for all metrics with non-empty genomes. A profile missing a metric's params will only fail at evaluation time, not at profile registration.
-- **Weak golden-value coverage:** Only 5 tests in `test_metric_correctness.py` verify metric functions against independently computed known answers. Most metrics lack this verification (see risk register C-07).
+- **Golden-value coverage complete:** 17 tests in `tests/test_metric_calculators.py::TestGoldenValues` plus 8 Brier/QS golden-value tests cover all implemented metrics (C-07 closed 2026-04-02).
 - **Breaking rename:** The legacy `Brier` metric (unimplemented placeholder) was replaced by `Brier_sample` and `Brier_point` (implemented). The field in `ClassificationSampleEvaluationMetrics` was renamed from `Brier` to `Brier_sample`. External consumers accessing `.Brier` on classification sample results must update to `.Brier_sample`.
 
 ---
diff --git a/documentation/CICs/NativeEvaluator.md b/documentation/CICs/NativeEvaluator.md
index e69011d..fed3dc6 100644
--- a/documentation/CICs/NativeEvaluator.md
+++ b/documentation/CICs/NativeEvaluator.md
@@ -2,7 +2,7 @@
 
 **Status:** Active  
 **Owner:** Evaluation Core  
-**Last reviewed:** 2026-04-02
+**Last reviewed:** 2026-04-04
 **Related ADRs:** ADR-010 (Ontology), ADR-011 (Topology), ADR-032 (Schemas), ADR-042 (Metric Catalog)
 
 ---
@@ -105,6 +105,8 @@ schema = report.get_schema_results('time_series')  # dict → typed metrics data
 - `legacy_compatibility` default was flipped to `False` in Phase 3. The flag is retained for callers that need truncation behavior.
 - Config validation may be added to `__init__` to catch structural config errors at construction time rather than at evaluation time (currently a known gap — risk register C-02).
 - The `EvaluationReport` return type is stable; the internal `_calculate_metrics` dispatch may evolve as the `MetricCatalog` grows.
+- Exception wrapping added to `_calculate_metrics()` (2026-04-04, C-16): metric function exceptions are now caught and re-raised as `ValueError` naming the metric, task, and pred_type. Test: `test_metric_function_error_includes_metric_name`.
+- Step sentinel changed from hardcoded `999` to `float('inf')` (2026-04-04, C-17): steps >= 1000 are no longer silently dropped. Test: `test_step_values_above_999_not_silently_dropped`.
 
 ---
 
diff --git a/tests/test_metric_catalog.py b/tests/test_metric_catalog.py
index e621ce0..e1bec4e 100644
--- a/tests/test_metric_catalog.py
+++ b/tests/test_metric_catalog.py
@@ -119,6 +119,48 @@ def test_overrides_on_empty_genome_raises(self):
             resolve_metric_params("MSE", {"spurious": 1.0}, BASE_PROFILE)
 
 
+# ---------------------------------------------------------------------------
+# Red: bounds validation for hyperparameters (C-18)
+# ---------------------------------------------------------------------------
+
+class TestResolveMetricParamsBoundsRed:
+    """Bounds validation for metric hyperparameters."""
+
+    def test_alpha_above_one_raises(self):
+        """Coverage alpha must be in (0, 1)."""
+        with pytest.raises(ValueError, match="alpha"):
+            resolve_metric_params("Coverage", {"alpha": 1.5}, BASE_PROFILE)
+
+    def test_alpha_zero_raises(self):
+        """Coverage alpha=0 would cause division by zero in MIS."""
+        with pytest.raises(ValueError, match="alpha"):
+            resolve_metric_params("MIS", {"alpha": 0.0}, BASE_PROFILE)
+
+    def test_alpha_negative_raises(self):
+        with pytest.raises(ValueError, match="alpha"):
+            resolve_metric_params("Coverage", {"alpha": -0.1}, BASE_PROFILE)
+
+    def test_quantile_above_one_raises(self):
+        """QS quantile must be in (0, 1)."""
+        with pytest.raises(ValueError, match="quantile"):
+            resolve_metric_params("QS_sample", {"quantile": 1.0}, BASE_PROFILE)
+
+    def test_quantile_zero_raises(self):
+        with pytest.raises(ValueError, match="quantile"):
+            resolve_metric_params("QS_sample", {"quantile": 0.0}, BASE_PROFILE)
+
+    def test_quantile_negative_raises(self):
+        with pytest.raises(ValueError, match="quantile"):
+            resolve_metric_params("QS_point", {"quantile": -0.5}, BASE_PROFILE)
+
+    def test_qis_lower_quantile_above_upper_raises(self):
+        """QIS lower_quantile must be < upper_quantile."""
+        with pytest.raises(ValueError, match="quantile"):
+            resolve_metric_params(
+                "QIS", {"lower_quantile": 0.9, "upper_quantile": 0.1}, BASE_PROFILE
+            )
+
+
 # ---------------------------------------------------------------------------
 # Beige: structural integrity
 # ---------------------------------------------------------------------------
diff --git a/tests/test_native_evaluator.py b/tests/test_native_evaluator.py
index a0fa14f..269db92 100644
--- a/tests/test_native_evaluator.py
+++ b/tests/test_native_evaluator.py
@@ -347,6 +347,29 @@ def test_evaluate_twice_produces_identical_results(self):
         report2 = evaluator.evaluate(ef)
         assert report1.to_dict() == report2.to_dict()
 
+    def test_step_values_above_999_not_silently_dropped(self):
+        """Steps >= 1000 must not be silently dropped by a hardcoded sentinel (C-17)."""
+        n = 4
+        ef = EvaluationFrame(
+            y_true=np.zeros(n),
+            y_pred=np.zeros((n, 1)),
+            identifiers={
+                'time':   np.array([2000, 2000, 2001, 2001]),
+                'unit':   np.array([1, 2, 1, 2]),
+                'origin': np.zeros(n, dtype=int),
+                'step':   np.array([1000, 1000, 1001, 1001]),
+            },
+            metadata={'target': 'test_target'},
+        )
+        config = _regression_point_config(steps=[1000, 1001])
+        report = NativeEvaluator(config).evaluate(ef, legacy_compatibility=False)
+        step_results = report.to_dict()['schemas']['step']
+        # Metrics must be computed (non-empty dict), not just pre-initialized
+        assert 'MSE' in step_results.get('step1000', {}), \
+            "Step 1000 metrics were silently dropped by sentinel"
+        assert 'MSE' in step_results.get('step1001', {}), \
+            "Step 1001 metrics were silently dropped by sentinel"
+
     def test_sample_predictions_produce_point_pred_type_false(self):
         n = 4
         ef = EvaluationFrame(
@@ -418,6 +441,24 @@ def test_empty_config_accepted_at_init_fails_at_evaluate(self):
         evaluator = NativeEvaluator({})  # does NOT raise — C-02
         with pytest.raises((ValueError, KeyError)):
             evaluator.evaluate(ef)
+
+    def test_metric_function_error_includes_metric_name(self):
+        """When a metric function raises, the error message must name the metric (C-16)."""
+        import dataclasses
+        from unittest.mock import patch, MagicMock
+        from views_evaluation.evaluation.metric_catalog import METRIC_CATALOG
+
+        ef = _make_parallelogram_ef(n_origins=1, n_steps=2, n_units=2)
+        config = _regression_point_config(steps=[1, 2], metrics=['MSE'])
+
+        # Inject a failure into MSE's function
+        original_spec = METRIC_CATALOG['MSE']
+        broken_fn = MagicMock(side_effect=RuntimeError("sklearn internal error"))
+        broken_spec = dataclasses.replace(original_spec, function=broken_fn)
+        with patch.dict(METRIC_CATALOG, {'MSE': broken_spec}):
+            with pytest.raises(ValueError, match="MSE"):
+                NativeEvaluator(config).evaluate(ef)
+
     def test_classification_metric_on_regression_target_raises(self):
         """AP is only valid for classification; using it with regression_targets must fail."""
         ef = _make_parallelogram_ef(n_origins=1, n_steps=2, n_units=2)
diff --git a/views_evaluation/evaluation/metric_catalog.py b/views_evaluation/evaluation/metric_catalog.py
index 7eae282..ed6f50a 100644
--- a/views_evaluation/evaluation/metric_catalog.py
+++ b/views_evaluation/evaluation/metric_catalog.py
@@ -41,6 +41,9 @@
     calculate_jeffreys_native,
 )
 
+# Parameters that must be in the open interval (0, 1)
+_UNIT_INTERVAL_EXCLUSIVE = {"alpha", "quantile", "lower_quantile", "upper_quantile"}
+
 
 @dataclass(frozen=True)
 class MetricSpec:
@@ -192,4 +195,21 @@ def resolve_metric_params(
             f"All hyperparameters must be explicitly set."
         )
 
+    # Bounds validation for probability/proportion parameters
+    for param, value in resolved.items():
+        if param in _UNIT_INTERVAL_EXCLUSIVE:
+            if not (0 < value < 1):
+                raise ValueError(
+                    f"Metric '{metric_name}' parameter '{param}' must be in (0, 1), "
+                    f"got {value}."
+                )
+
+    # Cross-parameter validation
+    if "lower_quantile" in resolved and "upper_quantile" in resolved:
+        if resolved["lower_quantile"] >= resolved["upper_quantile"]:
+            raise ValueError(
+                f"Metric '{metric_name}': lower_quantile ({resolved['lower_quantile']}) "
+                f"must be less than upper_quantile ({resolved['upper_quantile']})."
+            )
+
     return resolved
diff --git a/views_evaluation/evaluation/native_evaluator.py b/views_evaluation/evaluation/native_evaluator.py
index bfb9a82..1241cfe 100644
--- a/views_evaluation/evaluation/native_evaluator.py
+++ b/views_evaluation/evaluation/native_evaluator.py
@@ -70,7 +70,12 @@ def _calculate_metrics(self, ef: EvaluationFrame, metrics_list: List[str],
             spec = METRIC_CATALOG[m]
             overrides = self.metric_overrides.get(m, {})
             resolved = resolve_metric_params(m, overrides, self.profile)
-            results[m] = spec.function(ef.y_true, ef.y_pred, **resolved)
+            try:
+                results[m] = spec.function(ef.y_true, ef.y_pred, **resolved)
+            except Exception as e:
+                raise ValueError(
+                    f"Metric '{m}' failed for ({task}, {pred_type}): {e}"
+                ) from e
         return results
 
     def evaluate(self, ef: EvaluationFrame, legacy_compatibility: bool = False) -> EvaluationReport:
@@ -102,7 +107,7 @@ def evaluate(self, ef: EvaluationFrame, legacy_compatibility: bool = False) -> E
             step_results = {f"step{str(s).zfill(2)}": {} for s in config_steps}
 
         # LEGACY PARITY: Truncate steps to the shortest sequence length if in compat mode
-        max_allowed_step = 999
+        max_allowed_step = float('inf')
         if legacy_compatibility:
             origin_indices = ef.get_group_indices('origin')
             seq_lengths = []

From 904880b02d7a97e77e255a339b296a521d33615b Mon Sep 17 00:00:00 2001
From: Polichinl <simmaa@prio.org>
Date: Sat, 4 Apr 2026 02:26:17 +0200
Subject: [PATCH 2/3] =?UTF-8?q?test:=20close=20test=20gaps=20and=20remove?=
 =?UTF-8?q?=20dead=20code=20=E2=80=94=20object-dtype=20guard,=20cross-sche?=
 =?UTF-8?q?ma=20consistency,=20C-10?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add object-dtype rejection to EvaluationFrame._validate() (ADR-011 Pure NumPy contract)
- Remove 22 lines of dead pandas/object-dtype branches from _guard_shapes (closes C-10)
- Add 5 new tests: object-dtype rejection (2), malformed report dict (1),
  NaN metric detectability (1), cross-schema MSE consistency (1)
- 245 tests passing, risk register: 3 open concerns remain

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 tests/test_evaluation_frame.py                | 20 +++++++
 tests/test_evaluation_report.py               | 11 ++++
 tests/test_native_evaluator.py                | 54 +++++++++++++++++++
 .../evaluation/evaluation_frame.py            |  6 +++
 .../evaluation/native_metric_calculators.py   | 29 ++--------
 5 files changed, 95 insertions(+), 25 deletions(-)

diff --git a/tests/test_evaluation_frame.py b/tests/test_evaluation_frame.py
index 5f9724c..90c6cef 100644
--- a/tests/test_evaluation_frame.py
+++ b/tests/test_evaluation_frame.py
@@ -214,6 +214,26 @@ def test_select_indices_with_full_index_range(self):
 
 class TestEvaluationFrameRed:
 
+    def test_object_dtype_y_true_raises(self):
+        """Object-dtype y_true must be rejected — Pure NumPy contract (ADR-011)."""
+        n = 2
+        with pytest.raises(ValueError, match="numeric"):
+            EvaluationFrame(
+                y_true=np.array([[1, 2], [3, 4]], dtype=object),
+                y_pred=np.ones((n, 1)),
+                identifiers=_make_identifiers(n),
+            )
+
+    def test_object_dtype_y_pred_raises(self):
+        """Object-dtype y_pred must be rejected — Pure NumPy contract (ADR-011)."""
+        n = 2
+        with pytest.raises(ValueError, match="numeric"):
+            EvaluationFrame(
+                y_true=np.ones(n),
+                y_pred=np.array([[1, 2], [3, 4]], dtype=object),
+                identifiers=_make_identifiers(n),
+            )
+
     def test_y_pred_row_mismatch_raises(self):
         with pytest.raises(ValueError, match="mismatch"):
             EvaluationFrame(np.ones(5), np.ones((4, 1)), _make_identifiers(5))
diff --git a/tests/test_evaluation_report.py b/tests/test_evaluation_report.py
index 5bb4cb0..f4c4220 100644
--- a/tests/test_evaluation_report.py
+++ b/tests/test_evaluation_report.py
@@ -198,6 +198,17 @@ def test_all_four_task_pred_type_combinations_resolve_correctly(self):
 
 class TestEvaluationReportRed:
 
+    def test_non_dict_schema_value_fails_at_access(self):
+        """Malformed result dict: schema value is a string, not a dict.
+
+        Construction succeeds (no deep validation), but get_schema_results
+        fails when it tries to iterate the non-dict value.
+        """
+        results = {'month': 'not_a_dict', 'time_series': {}, 'step': {}}
+        report = EvaluationReport('t', 'regression', 'point', results)
+        with pytest.raises(AttributeError):
+            report.get_schema_results('month')
+
     def test_get_schema_results_unknown_schema_raises_key_error(self):
         report = EvaluationReport('t', 'regression', 'point', {})
         with pytest.raises(KeyError, match="nonexistent"):
diff --git a/tests/test_native_evaluator.py b/tests/test_native_evaluator.py
index 269db92..b8fb3e8 100644
--- a/tests/test_native_evaluator.py
+++ b/tests/test_native_evaluator.py
@@ -370,6 +370,60 @@ def test_step_values_above_999_not_silently_dropped(self):
         assert 'MSE' in step_results.get('step1001', {}), \
             "Step 1001 metrics were silently dropped by sentinel"
 
+    def test_nan_metric_result_is_finite_checkable(self):
+        """Metric results that are NaN (e.g., Pearson on constant data) must be
+        detectable via np.isfinite. This documents that NaN can appear in results
+        when data is degenerate, and callers should check."""
+        n = 4
+        ef = EvaluationFrame(
+            y_true=np.array([1.0, 1.0, 1.0, 1.0]),  # constant → Pearson = NaN
+            y_pred=np.array([[1.0], [1.0], [1.0], [1.0]]),
+            identifiers={
+                'time':   np.array([100, 100, 101, 101]),
+                'unit':   np.array([1, 2, 1, 2]),
+                'origin': np.zeros(n, dtype=int),
+                'step':   np.array([1, 1, 2, 2]),
+            },
+            metadata={'target': 'test_target'},
+        )
+        config = _regression_point_config(steps=[1, 2], metrics=['Pearson'])
+        report = NativeEvaluator(config).evaluate(ef)
+        month_results = report.to_dict()['schemas']['month']
+        pearson_val = month_results['month100']['Pearson']
+        assert np.isnan(pearson_val), "Pearson on constant data should be NaN"
+        # Callers can detect this with np.isfinite
+        assert not np.isfinite(pearson_val)
+
+    def test_cross_schema_consistency_mse_values(self):
+        """MSE computed via month-wise on a single-month window must equal
+        step-wise MSE for the same data slice."""
+        # Single origin, single step, single month → all schemas see same data
+        n = 4
+        y_true = np.array([1.0, 2.0, 3.0, 4.0])
+        y_pred = np.array([[1.5], [2.5], [3.5], [4.5]])
+        ef = EvaluationFrame(
+            y_true=y_true,
+            y_pred=y_pred,
+            identifiers={
+                'time':   np.array([100, 100, 100, 100]),
+                'unit':   np.array([1, 2, 3, 4]),
+                'origin': np.zeros(n, dtype=int),
+                'step':   np.ones(n, dtype=int),
+            },
+            metadata={'target': 'test_target'},
+        )
+        config = _regression_point_config(steps=[1], metrics=['MSE'])
+        report = NativeEvaluator(config).evaluate(ef)
+        schemas = report.to_dict()['schemas']
+        mse_month = schemas['month']['month100']['MSE']
+        mse_step = schemas['step']['step01']['MSE']
+        mse_ts = schemas['time_series']['ts00']['MSE']
+        # All three schemas see the same 4 observations → same MSE
+        assert mse_month == pytest.approx(mse_step, abs=1e-12)
+        assert mse_month == pytest.approx(mse_ts, abs=1e-12)
+        # And the value is correct: mean((0.5)^2) = 0.25
+        assert mse_month == pytest.approx(0.25, abs=1e-12)
+
     def test_sample_predictions_produce_point_pred_type_false(self):
         n = 4
         ef = EvaluationFrame(
diff --git a/views_evaluation/evaluation/evaluation_frame.py b/views_evaluation/evaluation/evaluation_frame.py
index 90f955e..0a8a5c6 100644
--- a/views_evaluation/evaluation/evaluation_frame.py
+++ b/views_evaluation/evaluation/evaluation_frame.py
@@ -28,6 +28,12 @@ def _validate(y_true: np.ndarray, y_pred: np.ndarray, identifiers: Dict[str, np.
         if y_pred.shape[0] != n_rows:
             raise ValueError(f"y_pred rows ({y_pred.shape[0]}) mismatch y_true ({n_rows})")
 
+        # ADR-011: Pure NumPy contract — reject object-dtype arrays
+        if y_true.dtype == object:
+            raise ValueError("y_true must be numeric (got dtype=object)")
+        if y_pred.dtype == object:
+            raise ValueError("y_pred must be numeric (got dtype=object)")
+
         # Rectangular sample validation: y_pred must be a dense 2D array
         if y_pred.ndim != 2:
             raise ValueError(
diff --git a/views_evaluation/evaluation/native_metric_calculators.py b/views_evaluation/evaluation/native_metric_calculators.py
index f04ea8a..6c80707 100644
--- a/views_evaluation/evaluation/native_metric_calculators.py
+++ b/views_evaluation/evaluation/native_metric_calculators.py
@@ -6,33 +6,12 @@
 from scipy.stats import wasserstein_distance, pearsonr
 
 def _guard_shapes(y_true: np.ndarray, y_pred: np.ndarray):
-    """Internal guard to prevent broadcasting accidents. Handles conversion from legacy pandas.
+    """Internal guard to prevent broadcasting accidents.
 
-    Defense-in-depth: runs even when called via NativeEvaluator, which
-    validates data at construction through EvaluationFrame._validate() first.
+    Assumes numeric NumPy arrays (guaranteed by EvaluationFrame._validate()).
+    Validates shapes and normalises dimensions for metric functions.
     """
-    if hasattr(y_true, "values"):
-        # Extract values from Series/DataFrame
-        y_true = y_true.values
-    if hasattr(y_pred, "values"):
-        y_pred = y_pred.values
-
-    # Handle lists-in-cells (legacy structure)
-    def ensure_array(x):
-        if isinstance(x, (list, np.ndarray)):
-            if len(x) > 0 and isinstance(x[0], (list, np.ndarray)):
-                return np.array([ensure_array(i) for i in x])
-            return np.array(x)
-        return np.array([x])
-
-    if y_true.dtype == object:
-        y_true = np.array([x[0] if isinstance(x, (list, np.ndarray)) else x for x in y_true])
-    if y_pred.dtype == object:
-        y_pred = np.array([ensure_array(x).flatten() for x in y_pred])
-
-    # Shape validation — unconditional, runs for all dtypes (ADR-013)
-    if y_true.ndim == 2 and y_true.shape[1] == 1:
-        y_true = y_true.flatten()
+    # Shape validation (ADR-013)
     if y_true.ndim != 1:
         raise ValueError(f"y_true must be 1D, got shape {y_true.shape}")
 

From ce26dc1e1996a805bfa4826bbbd61fbc4a8e1f5e Mon Sep 17 00:00:00 2001
From: Polichinl <simmaa@prio.org>
Date: Sat, 4 Apr 2026 02:29:28 +0200
Subject: [PATCH 3/3] fix: skip report tests when pandas is not installed (CI
 fix)

test_evaluation_report.py imported pandas at module level, causing
a collection error in CI where pandas is not installed (optional
dependency via [dataframe] extra). Use pytest.importorskip to skip
gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 tests/test_evaluation_report.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/test_evaluation_report.py b/tests/test_evaluation_report.py
index f4c4220..8548bd1 100644
--- a/tests/test_evaluation_report.py
+++ b/tests/test_evaluation_report.py
@@ -6,11 +6,11 @@
   BEIGE — empty schema, multiple metrics per group, raw schema passthrough
   RED   — unknown schema key, invalid task/pred_type combination
 """
-import pandas as pd
 import pytest
+pd = pytest.importorskip("pandas")
 
-from views_evaluation.evaluation.evaluation_report import EvaluationReport
-from views_evaluation.evaluation.metrics import (
+from views_evaluation.evaluation.evaluation_report import EvaluationReport  # noqa: E402
+from views_evaluation.evaluation.metrics import (  # noqa: E402
     RegressionPointEvaluationMetrics,
     RegressionSampleEvaluationMetrics,
     ClassificationPointEvaluationMetrics,