From 2578532394efc6733f6d23d3c5640c15ac8343ee Mon Sep 17 00:00:00 2001 From: Polichinl Date: Sat, 4 Apr 2026 02:07:27 +0200 Subject: [PATCH 1/3] =?UTF-8?q?fix:=20close=203=20risk=20register=20concer?= =?UTF-8?q?ns=20=E2=80=94=20bounds=20validation,=20exception=20context,=20?= =?UTF-8?q?step=20sentinel?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address C-16, C-17, C-18 identified by risk register review with TDD: - C-16: Wrap metric function calls in _calculate_metrics() with try/except that re-raises as ValueError naming the metric, task, and pred_type - C-17: Replace hardcoded max_allowed_step=999 with float('inf') so steps >= 1000 are not silently dropped - C-18: Add bounds validation in resolve_metric_params() for alpha, quantile, lower_quantile, upper_quantile — all must be in (0, 1). Cross-validation for QIS lower_quantile < upper_quantile Also: update CICs (MetricCatalog, NativeEvaluator) and ADRs (011, 014) with Known Deviations sections documenting C-02 and C-05. Close C-14 (stale editable install metadata). Upgrade C-02 from Tier 3 to Tier 2. 9 new tests, 240 total passing. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../ADRs/011_topology_and_dependency_rules.md | 4 ++ .../014_boundary_contracts_and_validation.md | 4 ++ documentation/CICs/MetricCatalog.md | 10 +++-- documentation/CICs/NativeEvaluator.md | 4 +- tests/test_metric_catalog.py | 42 +++++++++++++++++++ tests/test_native_evaluator.py | 41 ++++++++++++++++++ views_evaluation/evaluation/metric_catalog.py | 20 +++++++++ .../evaluation/native_evaluator.py | 9 +++- 8 files changed, 128 insertions(+), 6 deletions(-) diff --git a/documentation/ADRs/011_topology_and_dependency_rules.md b/documentation/ADRs/011_topology_and_dependency_rules.md index b482fc7..92622c2 100644 --- a/documentation/ADRs/011_topology_and_dependency_rules.md +++ b/documentation/ADRs/011_topology_and_dependency_rules.md @@ -53,3 +53,7 @@ If a dependency feels “convenient but wrong,” it probably is. ### Negative - Requires a "middle-man" (Adapter) to convert DataFrames to `EvaluationFrame`. - Small amount of boilerplate for simple scripts. + +### Known Deviations + +- **sklearn/scipy in Level 0:** `native_metric_calculators.py` imports `sklearn.metrics` (AP, MTD) and `scipy.stats` (EMD, Pearson) at module level. These 4 of ~25 metrics violate the "no external imports except numpy" claim. The ADR permits `scipy`; `sklearn` is a pragmatic deviation pending pure-NumPy replacements or migration to a Level 1 module. Tracked as risk register C-05 (Tier 3). diff --git a/documentation/ADRs/014_boundary_contracts_and_validation.md b/documentation/ADRs/014_boundary_contracts_and_validation.md index dc46108..f8b1b30 100644 --- a/documentation/ADRs/014_boundary_contracts_and_validation.md +++ b/documentation/ADRs/014_boundary_contracts_and_validation.md @@ -46,3 +46,7 @@ Validation failures must be logged and raised explicitly (ADR-013). Warnings are ### Negative - Requires explicit schemas or validation logic. - Increases up-front configuration clarity requirements. + +### Known Deviations + +- **NativeEvaluator defers config validation:** `NativeEvaluator.__init__` only validates the profile name. Missing or malformed config keys (`steps`, target lists, metric lists) are not caught until `evaluate()` is called, producing cryptic errors deep in the call stack. This violates Section 2 ("validate at entry, before execution begins"). Tracked as risk register C-02 (Tier 2, High). diff --git a/documentation/CICs/MetricCatalog.md b/documentation/CICs/MetricCatalog.md index c4040c1..25b4d6c 100644 --- a/documentation/CICs/MetricCatalog.md +++ b/documentation/CICs/MetricCatalog.md @@ -2,7 +2,7 @@ **Status:** Active **Owner:** Evaluation Core -**Last reviewed:** 2026-03-31 +**Last reviewed:** 2026-04-04 **Related ADRs:** ADR-042 (Metric Catalog), ADR-012 (Authority), ADR-013 (Observability) --- @@ -60,6 +60,8 @@ A genome registry and Chain of Responsibility resolver for evaluation metric hyp - `ValueError` if a resolved parameter is `None`. - `ValueError` if overrides contain unknown parameters not in the genome. - `ValueError` if overrides are provided for a metric with empty genome. +- `ValueError` if a probability/proportion parameter (`alpha`, `quantile`, `lower_quantile`, `upper_quantile`) is not in the open interval (0, 1). +- `ValueError` if `lower_quantile >= upper_quantile` for metrics requiring both (e.g. QIS). All failures are immediate and explicit. No warnings, no fallbacks, no silent degradation. @@ -109,7 +111,8 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE) - **Green:** `tests/test_metric_catalog.py` — registry snapshot integrity, resolver happy path, genome completeness checks. - **Beige:** `tests/test_metric_catalog.py` — partial overrides, profile-only resolution, edge case param values. - **Red:** `tests/test_metric_catalog.py` — unknown metrics, unimplemented metrics, missing params, None values, unknown overrides. -- **Correctness:** `tests/test_metric_correctness.py` — golden-value tests (5 tests; coverage gap noted). +- **Red (bounds):** `tests/test_metric_catalog.py::TestResolveMetricParamsBoundsRed` — 7 tests for out-of-range alpha/quantile and crossed QIS quantiles. +- **Correctness:** `tests/test_metric_calculators.py::TestGoldenValues` — 17 golden-value tests for all implemented metrics. --- @@ -118,13 +121,14 @@ params = resolve_metric_params("MSE", {}, BASE_PROFILE) - New metrics are added by: (1) implementing the function in `native_metric_calculators.py`, (2) adding a `MetricSpec` to `METRIC_CATALOG`, (3) adding to `METRIC_MEMBERSHIP`, (4) adding genome values to relevant profiles, (5) adding a field to the typed metrics dataclass in `metrics.py`. - The legacy dispatch dicts were removed in Phase 3. `METRIC_MEMBERSHIP` is the single source of truth. - Profile structure is stable; new profiles are added by creating a new file in `profiles/`. +- Bounds validation added for probability/proportion parameters (2026-04-04, C-18): `alpha`, `quantile`, `lower_quantile`, `upper_quantile` must be in (0, 1). Cross-parameter validation for QIS quantile ordering. --- ## 12. Known Deviations - **No profile completeness validation:** There is no mechanism to verify that a profile provides values for all metrics with non-empty genomes. A profile missing a metric's params will only fail at evaluation time, not at profile registration. -- **Weak golden-value coverage:** Only 5 tests in `test_metric_correctness.py` verify metric functions against independently computed known answers. Most metrics lack this verification (see risk register C-07). +- **Golden-value coverage complete:** 17 tests in `tests/test_metric_calculators.py::TestGoldenValues` plus 8 Brier/QS golden-value tests cover all implemented metrics (C-07 closed 2026-04-02). - **Breaking rename:** The legacy `Brier` metric (unimplemented placeholder) was replaced by `Brier_sample` and `Brier_point` (implemented). The field in `ClassificationSampleEvaluationMetrics` was renamed from `Brier` to `Brier_sample`. External consumers accessing `.Brier` on classification sample results must update to `.Brier_sample`. --- diff --git a/documentation/CICs/NativeEvaluator.md b/documentation/CICs/NativeEvaluator.md index e69011d..fed3dc6 100644 --- a/documentation/CICs/NativeEvaluator.md +++ b/documentation/CICs/NativeEvaluator.md @@ -2,7 +2,7 @@ **Status:** Active **Owner:** Evaluation Core -**Last reviewed:** 2026-04-02 +**Last reviewed:** 2026-04-04 **Related ADRs:** ADR-010 (Ontology), ADR-011 (Topology), ADR-032 (Schemas), ADR-042 (Metric Catalog) --- @@ -105,6 +105,8 @@ schema = report.get_schema_results('time_series') # dict → typed metrics data - `legacy_compatibility` default was flipped to `False` in Phase 3. The flag is retained for callers that need truncation behavior. - Config validation may be added to `__init__` to catch structural config errors at construction time rather than at evaluation time (currently a known gap — risk register C-02). - The `EvaluationReport` return type is stable; the internal `_calculate_metrics` dispatch may evolve as the `MetricCatalog` grows. +- Exception wrapping added to `_calculate_metrics()` (2026-04-04, C-16): metric function exceptions are now caught and re-raised as `ValueError` naming the metric, task, and pred_type. Test: `test_metric_function_error_includes_metric_name`. +- Step sentinel changed from hardcoded `999` to `float('inf')` (2026-04-04, C-17): steps >= 1000 are no longer silently dropped. Test: `test_step_values_above_999_not_silently_dropped`. --- diff --git a/tests/test_metric_catalog.py b/tests/test_metric_catalog.py index e621ce0..e1bec4e 100644 --- a/tests/test_metric_catalog.py +++ b/tests/test_metric_catalog.py @@ -119,6 +119,48 @@ def test_overrides_on_empty_genome_raises(self): resolve_metric_params("MSE", {"spurious": 1.0}, BASE_PROFILE) +# --------------------------------------------------------------------------- +# Red: bounds validation for hyperparameters (C-18) +# --------------------------------------------------------------------------- + +class TestResolveMetricParamsBoundsRed: + """Bounds validation for metric hyperparameters.""" + + def test_alpha_above_one_raises(self): + """Coverage alpha must be in (0, 1).""" + with pytest.raises(ValueError, match="alpha"): + resolve_metric_params("Coverage", {"alpha": 1.5}, BASE_PROFILE) + + def test_alpha_zero_raises(self): + """Coverage alpha=0 would cause division by zero in MIS.""" + with pytest.raises(ValueError, match="alpha"): + resolve_metric_params("MIS", {"alpha": 0.0}, BASE_PROFILE) + + def test_alpha_negative_raises(self): + with pytest.raises(ValueError, match="alpha"): + resolve_metric_params("Coverage", {"alpha": -0.1}, BASE_PROFILE) + + def test_quantile_above_one_raises(self): + """QS quantile must be in (0, 1).""" + with pytest.raises(ValueError, match="quantile"): + resolve_metric_params("QS_sample", {"quantile": 1.0}, BASE_PROFILE) + + def test_quantile_zero_raises(self): + with pytest.raises(ValueError, match="quantile"): + resolve_metric_params("QS_sample", {"quantile": 0.0}, BASE_PROFILE) + + def test_quantile_negative_raises(self): + with pytest.raises(ValueError, match="quantile"): + resolve_metric_params("QS_point", {"quantile": -0.5}, BASE_PROFILE) + + def test_qis_lower_quantile_above_upper_raises(self): + """QIS lower_quantile must be < upper_quantile.""" + with pytest.raises(ValueError, match="quantile"): + resolve_metric_params( + "QIS", {"lower_quantile": 0.9, "upper_quantile": 0.1}, BASE_PROFILE + ) + + # --------------------------------------------------------------------------- # Beige: structural integrity # --------------------------------------------------------------------------- diff --git a/tests/test_native_evaluator.py b/tests/test_native_evaluator.py index a0fa14f..269db92 100644 --- a/tests/test_native_evaluator.py +++ b/tests/test_native_evaluator.py @@ -347,6 +347,29 @@ def test_evaluate_twice_produces_identical_results(self): report2 = evaluator.evaluate(ef) assert report1.to_dict() == report2.to_dict() + def test_step_values_above_999_not_silently_dropped(self): + """Steps >= 1000 must not be silently dropped by a hardcoded sentinel (C-17).""" + n = 4 + ef = EvaluationFrame( + y_true=np.zeros(n), + y_pred=np.zeros((n, 1)), + identifiers={ + 'time': np.array([2000, 2000, 2001, 2001]), + 'unit': np.array([1, 2, 1, 2]), + 'origin': np.zeros(n, dtype=int), + 'step': np.array([1000, 1000, 1001, 1001]), + }, + metadata={'target': 'test_target'}, + ) + config = _regression_point_config(steps=[1000, 1001]) + report = NativeEvaluator(config).evaluate(ef, legacy_compatibility=False) + step_results = report.to_dict()['schemas']['step'] + # Metrics must be computed (non-empty dict), not just pre-initialized + assert 'MSE' in step_results.get('step1000', {}), \ + "Step 1000 metrics were silently dropped by sentinel" + assert 'MSE' in step_results.get('step1001', {}), \ + "Step 1001 metrics were silently dropped by sentinel" + def test_sample_predictions_produce_point_pred_type_false(self): n = 4 ef = EvaluationFrame( @@ -418,6 +441,24 @@ def test_empty_config_accepted_at_init_fails_at_evaluate(self): evaluator = NativeEvaluator({}) # does NOT raise — C-02 with pytest.raises((ValueError, KeyError)): evaluator.evaluate(ef) + + def test_metric_function_error_includes_metric_name(self): + """When a metric function raises, the error message must name the metric (C-16).""" + import dataclasses + from unittest.mock import patch, MagicMock + from views_evaluation.evaluation.metric_catalog import METRIC_CATALOG + + ef = _make_parallelogram_ef(n_origins=1, n_steps=2, n_units=2) + config = _regression_point_config(steps=[1, 2], metrics=['MSE']) + + # Inject a failure into MSE's function + original_spec = METRIC_CATALOG['MSE'] + broken_fn = MagicMock(side_effect=RuntimeError("sklearn internal error")) + broken_spec = dataclasses.replace(original_spec, function=broken_fn) + with patch.dict(METRIC_CATALOG, {'MSE': broken_spec}): + with pytest.raises(ValueError, match="MSE"): + NativeEvaluator(config).evaluate(ef) + def test_classification_metric_on_regression_target_raises(self): """AP is only valid for classification; using it with regression_targets must fail.""" ef = _make_parallelogram_ef(n_origins=1, n_steps=2, n_units=2) diff --git a/views_evaluation/evaluation/metric_catalog.py b/views_evaluation/evaluation/metric_catalog.py index 7eae282..ed6f50a 100644 --- a/views_evaluation/evaluation/metric_catalog.py +++ b/views_evaluation/evaluation/metric_catalog.py @@ -41,6 +41,9 @@ calculate_jeffreys_native, ) +# Parameters that must be in the open interval (0, 1) +_UNIT_INTERVAL_EXCLUSIVE = {"alpha", "quantile", "lower_quantile", "upper_quantile"} + @dataclass(frozen=True) class MetricSpec: @@ -192,4 +195,21 @@ def resolve_metric_params( f"All hyperparameters must be explicitly set." ) + # Bounds validation for probability/proportion parameters + for param, value in resolved.items(): + if param in _UNIT_INTERVAL_EXCLUSIVE: + if not (0 < value < 1): + raise ValueError( + f"Metric '{metric_name}' parameter '{param}' must be in (0, 1), " + f"got {value}." + ) + + # Cross-parameter validation + if "lower_quantile" in resolved and "upper_quantile" in resolved: + if resolved["lower_quantile"] >= resolved["upper_quantile"]: + raise ValueError( + f"Metric '{metric_name}': lower_quantile ({resolved['lower_quantile']}) " + f"must be less than upper_quantile ({resolved['upper_quantile']})." + ) + return resolved diff --git a/views_evaluation/evaluation/native_evaluator.py b/views_evaluation/evaluation/native_evaluator.py index bfb9a82..1241cfe 100644 --- a/views_evaluation/evaluation/native_evaluator.py +++ b/views_evaluation/evaluation/native_evaluator.py @@ -70,7 +70,12 @@ def _calculate_metrics(self, ef: EvaluationFrame, metrics_list: List[str], spec = METRIC_CATALOG[m] overrides = self.metric_overrides.get(m, {}) resolved = resolve_metric_params(m, overrides, self.profile) - results[m] = spec.function(ef.y_true, ef.y_pred, **resolved) + try: + results[m] = spec.function(ef.y_true, ef.y_pred, **resolved) + except Exception as e: + raise ValueError( + f"Metric '{m}' failed for ({task}, {pred_type}): {e}" + ) from e return results def evaluate(self, ef: EvaluationFrame, legacy_compatibility: bool = False) -> EvaluationReport: @@ -102,7 +107,7 @@ def evaluate(self, ef: EvaluationFrame, legacy_compatibility: bool = False) -> E step_results = {f"step{str(s).zfill(2)}": {} for s in config_steps} # LEGACY PARITY: Truncate steps to the shortest sequence length if in compat mode - max_allowed_step = 999 + max_allowed_step = float('inf') if legacy_compatibility: origin_indices = ef.get_group_indices('origin') seq_lengths = [] From 904880b02d7a97e77e255a339b296a521d33615b Mon Sep 17 00:00:00 2001 From: Polichinl Date: Sat, 4 Apr 2026 02:26:17 +0200 Subject: [PATCH 2/3] =?UTF-8?q?test:=20close=20test=20gaps=20and=20remove?= =?UTF-8?q?=20dead=20code=20=E2=80=94=20object-dtype=20guard,=20cross-sche?= =?UTF-8?q?ma=20consistency,=20C-10?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add object-dtype rejection to EvaluationFrame._validate() (ADR-011 Pure NumPy contract) - Remove 22 lines of dead pandas/object-dtype branches from _guard_shapes (closes C-10) - Add 5 new tests: object-dtype rejection (2), malformed report dict (1), NaN metric detectability (1), cross-schema MSE consistency (1) - 245 tests passing, risk register: 3 open concerns remain Co-Authored-By: Claude Opus 4.6 (1M context) --- tests/test_evaluation_frame.py | 20 +++++++ tests/test_evaluation_report.py | 11 ++++ tests/test_native_evaluator.py | 54 +++++++++++++++++++ .../evaluation/evaluation_frame.py | 6 +++ .../evaluation/native_metric_calculators.py | 29 ++-------- 5 files changed, 95 insertions(+), 25 deletions(-) diff --git a/tests/test_evaluation_frame.py b/tests/test_evaluation_frame.py index 5f9724c..90c6cef 100644 --- a/tests/test_evaluation_frame.py +++ b/tests/test_evaluation_frame.py @@ -214,6 +214,26 @@ def test_select_indices_with_full_index_range(self): class TestEvaluationFrameRed: + def test_object_dtype_y_true_raises(self): + """Object-dtype y_true must be rejected — Pure NumPy contract (ADR-011).""" + n = 2 + with pytest.raises(ValueError, match="numeric"): + EvaluationFrame( + y_true=np.array([[1, 2], [3, 4]], dtype=object), + y_pred=np.ones((n, 1)), + identifiers=_make_identifiers(n), + ) + + def test_object_dtype_y_pred_raises(self): + """Object-dtype y_pred must be rejected — Pure NumPy contract (ADR-011).""" + n = 2 + with pytest.raises(ValueError, match="numeric"): + EvaluationFrame( + y_true=np.ones(n), + y_pred=np.array([[1, 2], [3, 4]], dtype=object), + identifiers=_make_identifiers(n), + ) + def test_y_pred_row_mismatch_raises(self): with pytest.raises(ValueError, match="mismatch"): EvaluationFrame(np.ones(5), np.ones((4, 1)), _make_identifiers(5)) diff --git a/tests/test_evaluation_report.py b/tests/test_evaluation_report.py index 5bb4cb0..f4c4220 100644 --- a/tests/test_evaluation_report.py +++ b/tests/test_evaluation_report.py @@ -198,6 +198,17 @@ def test_all_four_task_pred_type_combinations_resolve_correctly(self): class TestEvaluationReportRed: + def test_non_dict_schema_value_fails_at_access(self): + """Malformed result dict: schema value is a string, not a dict. + + Construction succeeds (no deep validation), but get_schema_results + fails when it tries to iterate the non-dict value. + """ + results = {'month': 'not_a_dict', 'time_series': {}, 'step': {}} + report = EvaluationReport('t', 'regression', 'point', results) + with pytest.raises(AttributeError): + report.get_schema_results('month') + def test_get_schema_results_unknown_schema_raises_key_error(self): report = EvaluationReport('t', 'regression', 'point', {}) with pytest.raises(KeyError, match="nonexistent"): diff --git a/tests/test_native_evaluator.py b/tests/test_native_evaluator.py index 269db92..b8fb3e8 100644 --- a/tests/test_native_evaluator.py +++ b/tests/test_native_evaluator.py @@ -370,6 +370,60 @@ def test_step_values_above_999_not_silently_dropped(self): assert 'MSE' in step_results.get('step1001', {}), \ "Step 1001 metrics were silently dropped by sentinel" + def test_nan_metric_result_is_finite_checkable(self): + """Metric results that are NaN (e.g., Pearson on constant data) must be + detectable via np.isfinite. This documents that NaN can appear in results + when data is degenerate, and callers should check.""" + n = 4 + ef = EvaluationFrame( + y_true=np.array([1.0, 1.0, 1.0, 1.0]), # constant → Pearson = NaN + y_pred=np.array([[1.0], [1.0], [1.0], [1.0]]), + identifiers={ + 'time': np.array([100, 100, 101, 101]), + 'unit': np.array([1, 2, 1, 2]), + 'origin': np.zeros(n, dtype=int), + 'step': np.array([1, 1, 2, 2]), + }, + metadata={'target': 'test_target'}, + ) + config = _regression_point_config(steps=[1, 2], metrics=['Pearson']) + report = NativeEvaluator(config).evaluate(ef) + month_results = report.to_dict()['schemas']['month'] + pearson_val = month_results['month100']['Pearson'] + assert np.isnan(pearson_val), "Pearson on constant data should be NaN" + # Callers can detect this with np.isfinite + assert not np.isfinite(pearson_val) + + def test_cross_schema_consistency_mse_values(self): + """MSE computed via month-wise on a single-month window must equal + step-wise MSE for the same data slice.""" + # Single origin, single step, single month → all schemas see same data + n = 4 + y_true = np.array([1.0, 2.0, 3.0, 4.0]) + y_pred = np.array([[1.5], [2.5], [3.5], [4.5]]) + ef = EvaluationFrame( + y_true=y_true, + y_pred=y_pred, + identifiers={ + 'time': np.array([100, 100, 100, 100]), + 'unit': np.array([1, 2, 3, 4]), + 'origin': np.zeros(n, dtype=int), + 'step': np.ones(n, dtype=int), + }, + metadata={'target': 'test_target'}, + ) + config = _regression_point_config(steps=[1], metrics=['MSE']) + report = NativeEvaluator(config).evaluate(ef) + schemas = report.to_dict()['schemas'] + mse_month = schemas['month']['month100']['MSE'] + mse_step = schemas['step']['step01']['MSE'] + mse_ts = schemas['time_series']['ts00']['MSE'] + # All three schemas see the same 4 observations → same MSE + assert mse_month == pytest.approx(mse_step, abs=1e-12) + assert mse_month == pytest.approx(mse_ts, abs=1e-12) + # And the value is correct: mean((0.5)^2) = 0.25 + assert mse_month == pytest.approx(0.25, abs=1e-12) + def test_sample_predictions_produce_point_pred_type_false(self): n = 4 ef = EvaluationFrame( diff --git a/views_evaluation/evaluation/evaluation_frame.py b/views_evaluation/evaluation/evaluation_frame.py index 90f955e..0a8a5c6 100644 --- a/views_evaluation/evaluation/evaluation_frame.py +++ b/views_evaluation/evaluation/evaluation_frame.py @@ -28,6 +28,12 @@ def _validate(y_true: np.ndarray, y_pred: np.ndarray, identifiers: Dict[str, np. if y_pred.shape[0] != n_rows: raise ValueError(f"y_pred rows ({y_pred.shape[0]}) mismatch y_true ({n_rows})") + # ADR-011: Pure NumPy contract — reject object-dtype arrays + if y_true.dtype == object: + raise ValueError("y_true must be numeric (got dtype=object)") + if y_pred.dtype == object: + raise ValueError("y_pred must be numeric (got dtype=object)") + # Rectangular sample validation: y_pred must be a dense 2D array if y_pred.ndim != 2: raise ValueError( diff --git a/views_evaluation/evaluation/native_metric_calculators.py b/views_evaluation/evaluation/native_metric_calculators.py index f04ea8a..6c80707 100644 --- a/views_evaluation/evaluation/native_metric_calculators.py +++ b/views_evaluation/evaluation/native_metric_calculators.py @@ -6,33 +6,12 @@ from scipy.stats import wasserstein_distance, pearsonr def _guard_shapes(y_true: np.ndarray, y_pred: np.ndarray): - """Internal guard to prevent broadcasting accidents. Handles conversion from legacy pandas. + """Internal guard to prevent broadcasting accidents. - Defense-in-depth: runs even when called via NativeEvaluator, which - validates data at construction through EvaluationFrame._validate() first. + Assumes numeric NumPy arrays (guaranteed by EvaluationFrame._validate()). + Validates shapes and normalises dimensions for metric functions. """ - if hasattr(y_true, "values"): - # Extract values from Series/DataFrame - y_true = y_true.values - if hasattr(y_pred, "values"): - y_pred = y_pred.values - - # Handle lists-in-cells (legacy structure) - def ensure_array(x): - if isinstance(x, (list, np.ndarray)): - if len(x) > 0 and isinstance(x[0], (list, np.ndarray)): - return np.array([ensure_array(i) for i in x]) - return np.array(x) - return np.array([x]) - - if y_true.dtype == object: - y_true = np.array([x[0] if isinstance(x, (list, np.ndarray)) else x for x in y_true]) - if y_pred.dtype == object: - y_pred = np.array([ensure_array(x).flatten() for x in y_pred]) - - # Shape validation — unconditional, runs for all dtypes (ADR-013) - if y_true.ndim == 2 and y_true.shape[1] == 1: - y_true = y_true.flatten() + # Shape validation (ADR-013) if y_true.ndim != 1: raise ValueError(f"y_true must be 1D, got shape {y_true.shape}") From ce26dc1e1996a805bfa4826bbbd61fbc4a8e1f5e Mon Sep 17 00:00:00 2001 From: Polichinl Date: Sat, 4 Apr 2026 02:29:28 +0200 Subject: [PATCH 3/3] fix: skip report tests when pandas is not installed (CI fix) test_evaluation_report.py imported pandas at module level, causing a collection error in CI where pandas is not installed (optional dependency via [dataframe] extra). Use pytest.importorskip to skip gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) --- tests/test_evaluation_report.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tests/test_evaluation_report.py b/tests/test_evaluation_report.py index f4c4220..8548bd1 100644 --- a/tests/test_evaluation_report.py +++ b/tests/test_evaluation_report.py @@ -6,11 +6,11 @@ BEIGE — empty schema, multiple metrics per group, raw schema passthrough RED — unknown schema key, invalid task/pred_type combination """ -import pandas as pd import pytest +pd = pytest.importorskip("pandas") -from views_evaluation.evaluation.evaluation_report import EvaluationReport -from views_evaluation.evaluation.metrics import ( +from views_evaluation.evaluation.evaluation_report import EvaluationReport # noqa: E402 +from views_evaluation.evaluation.metrics import ( # noqa: E402 RegressionPointEvaluationMetrics, RegressionSampleEvaluationMetrics, ClassificationPointEvaluationMetrics,