Summary
AutoRecLab's reviewer can assign high scores (e.g. 92/100, is_satisfactory=True) to code that is scientifically invalid. For example, models trained and evaluated on the same data, missing train/test splits, or producing suspiciously perfect metrics (RMSE near 0). The current scoring system evaluates whether code executes successfully and meets the listed requirements, but does not check for elementary scientific validity.
Current Behavior
In treesearch/minimal_agent.py (score_code), a node is marked satisfactory whenever:
- The reviewer LLM marks it as not buggy, and
- All requirements are individually marked as fulfilled.
Neither check inspects whether the experimental design is scientifically sound. A run can:
- Train and evaluate on the same dataset with no split → high "accuracy", satisfactory
- Report RMSE values like 0.0001 (clear data leakage) → still satisfactory if a requirement says "compute RMSE"
- Omit a baseline entirely → satisfactory if no requirement explicitly demands one
Expected Behavior
Before marking a node satisfactory, AutoRecLab should run a lightweight scientific-validity check on the generated code and execution output. At minimum:
- Train/test separation: verify the code uses an explicit split (e.g.
train_test_split, k-fold, LensKit's crossfold)
- Suspicious metrics: flag metric values that are physically implausible (e.g. RMSE < 0.01 on a real recommendation dataset, accuracy = 1.0)
- Baseline presence: if the task involves comparison or evaluation, check that at least one comparison point exists (popular-item, random, mean predictor)
- Data leakage signals: flag if the same DataFrame is used for both fitting and scoring without intermediate splitting
When any of these checks fail, the node should either be marked not-satisfactory or flagged with a scientific_validity_warning for the user.
Why This Matters
This improvement would:
- Prevent false positives in the success metric. In our replication of the AutoRecLab preprint experiments, we observed runs scoring 92/100 with
is_satisfactory=True whose generated code contained catastrophic overfitting
- Make the satisfaction signal trustworthy for downstream automation and reporting
- Require only a few hours to implement extend the existing reviewer prompt with an explicit "scientific validity" section, or add a dedicated post-scoring check function
Suggested Implementation
A minimal version is a prompt extension in score_code: add to the review instructions an explicit list of validity checks (train/test split present, no data leakage, plausible metric ranges, baseline presence) and require the reviewer to mark the node not-satisfactory if any fail.
A more robust version is a small dedicated validate_scientific_correctness(node) function called after the existing scoring loop. It can pattern-match on the code (look for .fit(...) and .predict(...) operating on the same variable) and inspect parsed metric values from node.term_out against sensible thresholds.
Context
We observed this consistently while running 16 experiments across three datasets (MovieLens 100K, Amazon Video Games, Last.FM) with the GPT-5 Nano and GPT-5.4 models. The reviewer's structural code review is reliable for distinguishing "buggy" from "running" nodes (as the documentation claims), but it does not catch the scientific issues that ultimately determine whether the experiment is publishable.
Summary
AutoRecLab's reviewer can assign high scores (e.g. 92/100,
is_satisfactory=True) to code that is scientifically invalid. For example, models trained and evaluated on the same data, missing train/test splits, or producing suspiciously perfect metrics (RMSE near 0). The current scoring system evaluates whether code executes successfully and meets the listed requirements, but does not check for elementary scientific validity.Current Behavior
In
treesearch/minimal_agent.py(score_code), a node is marked satisfactory whenever:Neither check inspects whether the experimental design is scientifically sound. A run can:
Expected Behavior
Before marking a node satisfactory, AutoRecLab should run a lightweight scientific-validity check on the generated code and execution output. At minimum:
train_test_split, k-fold, LensKit'scrossfold)When any of these checks fail, the node should either be marked not-satisfactory or flagged with a
scientific_validity_warningfor the user.Why This Matters
This improvement would:
is_satisfactory=Truewhose generated code contained catastrophic overfittingSuggested Implementation
A minimal version is a prompt extension in
score_code: add to the review instructions an explicit list of validity checks (train/test split present, no data leakage, plausible metric ranges, baseline presence) and require the reviewer to mark the node not-satisfactory if any fail.A more robust version is a small dedicated
validate_scientific_correctness(node)function called after the existing scoring loop. It can pattern-match on the code (look for.fit(...)and.predict(...)operating on the same variable) and inspect parsed metric values fromnode.term_outagainst sensible thresholds.Context
We observed this consistently while running 16 experiments across three datasets (MovieLens 100K, Amazon Video Games, Last.FM) with the GPT-5 Nano and GPT-5.4 models. The reviewer's structural code review is reliable for distinguishing "buggy" from "running" nodes (as the documentation claims), but it does not catch the scientific issues that ultimately determine whether the experiment is publishable.