Add scientific validity checks to node scoring (train/test split, metric sanity, data leakage detection)

## Summary

AutoRecLab's reviewer can assign high scores (e.g. 92/100, `is_satisfactory=True`) to code that is scientifically invalid. For example, models trained and evaluated on the same data, missing train/test splits, or producing suspiciously perfect metrics (RMSE near 0). The current scoring system evaluates whether code **executes successfully and meets the listed requirements**, but does not check for elementary scientific validity.

## Current Behavior

In `treesearch/minimal_agent.py` (`score_code`), a node is marked satisfactory whenever:

1. The reviewer LLM marks it as not buggy, **and**
2. All requirements are individually marked as fulfilled.

Neither check inspects whether the experimental design is scientifically sound. A run can:

- Train and evaluate on the same dataset with no split → high "accuracy", satisfactory
- Report RMSE values like 0.0001 (clear data leakage) → still satisfactory if a requirement says "compute RMSE"
- Omit a baseline entirely → satisfactory if no requirement explicitly demands one

## Expected Behavior

Before marking a node satisfactory, AutoRecLab should run a lightweight scientific-validity check on the generated code and execution output. At minimum:

1. **Train/test separation**: verify the code uses an explicit split (e.g. `train_test_split`, k-fold, LensKit's `crossfold`)
2. **Suspicious metrics**: flag metric values that are physically implausible (e.g. RMSE < 0.01 on a real recommendation dataset, accuracy = 1.0)
3. **Baseline presence**: if the task involves comparison or evaluation, check that at least one comparison point exists (popular-item, random, mean predictor)
4. **Data leakage signals**: flag if the same DataFrame is used for both fitting and scoring without intermediate splitting

When any of these checks fail, the node should either be marked not-satisfactory or flagged with a `scientific_validity_warning` for the user.

## Why This Matters

This improvement would:

* **Prevent false positives** in the success metric. In our replication of the AutoRecLab preprint experiments, we observed runs scoring 92/100 with `is_satisfactory=True` whose generated code contained catastrophic overfitting
* **Make the satisfaction signal trustworthy** for downstream automation and reporting
* **Require only a few hours to implement** extend the existing reviewer prompt with an explicit "scientific validity" section, or add a dedicated post-scoring check function

## Suggested Implementation

A minimal version is a prompt extension in `score_code`: add to the review instructions an explicit list of validity checks (train/test split present, no data leakage, plausible metric ranges, baseline presence) and require the reviewer to mark the node not-satisfactory if any fail.

A more robust version is a small dedicated `validate_scientific_correctness(node)` function called after the existing scoring loop. It can pattern-match on the code (look for `.fit(...)` and `.predict(...)` operating on the same variable) and inspect parsed metric values from `node.term_out` against sensible thresholds.

## Context

We observed this consistently while running 16 experiments across three datasets (MovieLens 100K, Amazon Video Games, Last.FM) with the GPT-5 Nano and GPT-5.4 models. The reviewer's structural code review is reliable for distinguishing "buggy" from "running" nodes (as the documentation claims), but it does not catch the scientific issues that ultimately determine whether the experiment is publishable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add scientific validity checks to node scoring (train/test split, metric sanity, data leakage detection) #62

Summary

Current Behavior

Expected Behavior

Why This Matters

Suggested Implementation

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add scientific validity checks to node scoring (train/test split, metric sanity, data leakage detection) #62

Description

Summary

Current Behavior

Expected Behavior

Why This Matters

Suggested Implementation

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions