Skip to content

Clarify how initial skill S0 relates to the reported no-skill baseline #33

@LifeIsSoSolong

Description

@LifeIsSoSolong

Hi SkillOpt team,

Thanks again for releasing the split manifests and keeping the repo updated. While reproducing the experiments, I ran into a point that would be helpful to clarify in the paper/repo documentation.

From the training code, it looks like env.skill_init is loaded as the initial skill S0, and the final test evaluation runs both:

  1. skill_init on the test split as test_eval_baseline, and
  2. the selected best_skill on the test split as test_eval.

However, the current initial.md files are not uniformly empty across benchmarks:

  • searchqa/skills/initial.md is essentially empty: "No learned rules yet."
  • spreadsheetbench/skills/initial.md already contains a spreadsheet workflow, library guidance, and output requirements.
  • livemathematicianbench/skills/initial.md already contains MCQ heuristics.
  • officeqa/skills/initial.md already contains retrieval/evidence/final-answer discipline.

This makes me unsure how to interpret the S0 baseline produced by the training script relative to the baselines reported in the paper.

Could you clarify the following?

  1. In Table 1 / Table 5, does the No skill row correspond to an empty skill document, no injected skill document at all, or the benchmark's default system prompt?
  2. Does the test_eval_baseline score produced by scripts/train.py correspond to the paper's No skill result, or should it be interpreted separately as an initial/seed skill S0 result?
  3. For benchmarks where initial.md is non-empty, should that initial skill be considered a human-written skill, an LLM-generated skill, a benchmark/harness instruction scaffold, or a separate seed skill that is not one of the Table 1 baselines?
  4. Why are the initial skills non-empty for some benchmarks instead of starting from an empty skill for all benchmarks?
  5. What is the recommended way to reproduce the paper's No skill baseline from the released code: run eval_only.py with an empty skill file, omit env.skill_init, or use another config/path?

This matters for reproduction because summary.json reports a baseline_test_hard -> best_skill_test_hard improvement, but if baseline_test_hard is actually S0 rather than the paper's No skill, it should not be directly compared to the Table 1 no-skill row.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions