Hi SkillOpt team,
Thanks again for releasing the split manifests and keeping the repo updated. While reproducing the experiments, I ran into a point that would be helpful to clarify in the paper/repo documentation.
From the training code, it looks like env.skill_init is loaded as the initial skill S0, and the final test evaluation runs both:
skill_init on the test split as test_eval_baseline, and
- the selected
best_skill on the test split as test_eval.
However, the current initial.md files are not uniformly empty across benchmarks:
searchqa/skills/initial.md is essentially empty: "No learned rules yet."
spreadsheetbench/skills/initial.md already contains a spreadsheet workflow, library guidance, and output requirements.
livemathematicianbench/skills/initial.md already contains MCQ heuristics.
officeqa/skills/initial.md already contains retrieval/evidence/final-answer discipline.
This makes me unsure how to interpret the S0 baseline produced by the training script relative to the baselines reported in the paper.
Could you clarify the following?
- In Table 1 / Table 5, does the No skill row correspond to an empty skill document, no injected skill document at all, or the benchmark's default system prompt?
- Does the
test_eval_baseline score produced by scripts/train.py correspond to the paper's No skill result, or should it be interpreted separately as an initial/seed skill S0 result?
- For benchmarks where
initial.md is non-empty, should that initial skill be considered a human-written skill, an LLM-generated skill, a benchmark/harness instruction scaffold, or a separate seed skill that is not one of the Table 1 baselines?
- Why are the initial skills non-empty for some benchmarks instead of starting from an empty skill for all benchmarks?
- What is the recommended way to reproduce the paper's No skill baseline from the released code: run
eval_only.py with an empty skill file, omit env.skill_init, or use another config/path?
This matters for reproduction because summary.json reports a baseline_test_hard -> best_skill_test_hard improvement, but if baseline_test_hard is actually S0 rather than the paper's No skill, it should not be directly compared to the Table 1 no-skill row.
Thanks!
Hi SkillOpt team,
Thanks again for releasing the split manifests and keeping the repo updated. While reproducing the experiments, I ran into a point that would be helpful to clarify in the paper/repo documentation.
From the training code, it looks like
env.skill_initis loaded as the initial skillS0, and the final test evaluation runs both:skill_initon the test split astest_eval_baseline, andbest_skillon the test split astest_eval.However, the current
initial.mdfiles are not uniformly empty across benchmarks:searchqa/skills/initial.mdis essentially empty: "No learned rules yet."spreadsheetbench/skills/initial.mdalready contains a spreadsheet workflow, library guidance, and output requirements.livemathematicianbench/skills/initial.mdalready contains MCQ heuristics.officeqa/skills/initial.mdalready contains retrieval/evidence/final-answer discipline.This makes me unsure how to interpret the
S0baseline produced by the training script relative to the baselines reported in the paper.Could you clarify the following?
test_eval_baselinescore produced byscripts/train.pycorrespond to the paper's No skill result, or should it be interpreted separately as an initial/seed skill S0 result?initial.mdis non-empty, should that initial skill be considered a human-written skill, an LLM-generated skill, a benchmark/harness instruction scaffold, or a separate seed skill that is not one of the Table 1 baselines?eval_only.pywith an empty skill file, omitenv.skill_init, or use another config/path?This matters for reproduction because
summary.jsonreports abaseline_test_hard -> best_skill_test_hardimprovement, but ifbaseline_test_hardis actuallyS0rather than the paper'sNo skill, it should not be directly compared to the Table 1 no-skill row.Thanks!