Skip to content

Add scientific validity checks to node scoring (train/test split, metric sanity, data leakage detection) #62

Description

@Artem-Dneprovskii

Summary

AutoRecLab's reviewer can assign high scores (e.g. 92/100, is_satisfactory=True) to code that is scientifically invalid. For example, models trained and evaluated on the same data, missing train/test splits, or producing suspiciously perfect metrics (RMSE near 0). The current scoring system evaluates whether code executes successfully and meets the listed requirements, but does not check for elementary scientific validity.

Current Behavior

In treesearch/minimal_agent.py (score_code), a node is marked satisfactory whenever:

  1. The reviewer LLM marks it as not buggy, and
  2. All requirements are individually marked as fulfilled.

Neither check inspects whether the experimental design is scientifically sound. A run can:

  • Train and evaluate on the same dataset with no split → high "accuracy", satisfactory
  • Report RMSE values like 0.0001 (clear data leakage) → still satisfactory if a requirement says "compute RMSE"
  • Omit a baseline entirely → satisfactory if no requirement explicitly demands one

Expected Behavior

Before marking a node satisfactory, AutoRecLab should run a lightweight scientific-validity check on the generated code and execution output. At minimum:

  1. Train/test separation: verify the code uses an explicit split (e.g. train_test_split, k-fold, LensKit's crossfold)
  2. Suspicious metrics: flag metric values that are physically implausible (e.g. RMSE < 0.01 on a real recommendation dataset, accuracy = 1.0)
  3. Baseline presence: if the task involves comparison or evaluation, check that at least one comparison point exists (popular-item, random, mean predictor)
  4. Data leakage signals: flag if the same DataFrame is used for both fitting and scoring without intermediate splitting

When any of these checks fail, the node should either be marked not-satisfactory or flagged with a scientific_validity_warning for the user.

Why This Matters

This improvement would:

  • Prevent false positives in the success metric. In our replication of the AutoRecLab preprint experiments, we observed runs scoring 92/100 with is_satisfactory=True whose generated code contained catastrophic overfitting
  • Make the satisfaction signal trustworthy for downstream automation and reporting
  • Require only a few hours to implement extend the existing reviewer prompt with an explicit "scientific validity" section, or add a dedicated post-scoring check function

Suggested Implementation

A minimal version is a prompt extension in score_code: add to the review instructions an explicit list of validity checks (train/test split present, no data leakage, plausible metric ranges, baseline presence) and require the reviewer to mark the node not-satisfactory if any fail.

A more robust version is a small dedicated validate_scientific_correctness(node) function called after the existing scoring loop. It can pattern-match on the code (look for .fit(...) and .predict(...) operating on the same variable) and inspect parsed metric values from node.term_out against sensible thresholds.

Context

We observed this consistently while running 16 experiments across three datasets (MovieLens 100K, Amazon Video Games, Last.FM) with the GPT-5 Nano and GPT-5.4 models. The reviewer's structural code review is reliable for distinguishing "buggy" from "running" nodes (as the documentation claims), but it does not catch the scientific issues that ultimately determine whether the experiment is publishable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions