Clarify how initial skill S0 relates to the reported no-skill baseline

Hi SkillOpt team,

Thanks again for releasing the split manifests and keeping the repo updated. While reproducing the experiments, I ran into a point that would be helpful to clarify in the paper/repo documentation.

From the training code, it looks like `env.skill_init` is loaded as the initial skill `S0`, and the final test evaluation runs both:

1. `skill_init` on the test split as `test_eval_baseline`, and
2. the selected `best_skill` on the test split as `test_eval`.

However, the current `initial.md` files are not uniformly empty across benchmarks:

- `searchqa/skills/initial.md` is essentially empty: "No learned rules yet."
- `spreadsheetbench/skills/initial.md` already contains a spreadsheet workflow, library guidance, and output requirements.
- `livemathematicianbench/skills/initial.md` already contains MCQ heuristics.
- `officeqa/skills/initial.md` already contains retrieval/evidence/final-answer discipline.

This makes me unsure how to interpret the `S0` baseline produced by the training script relative to the baselines reported in the paper.

Could you clarify the following?

1. In Table 1 / Table 5, does the **No skill** row correspond to an empty skill document, no injected skill document at all, or the benchmark's default system prompt?
2. Does the `test_eval_baseline` score produced by `scripts/train.py` correspond to the paper's **No skill** result, or should it be interpreted separately as an **initial/seed skill S0** result?
3. For benchmarks where `initial.md` is non-empty, should that initial skill be considered a human-written skill, an LLM-generated skill, a benchmark/harness instruction scaffold, or a separate seed skill that is not one of the Table 1 baselines?
4. Why are the initial skills non-empty for some benchmarks instead of starting from an empty skill for all benchmarks?
5. What is the recommended way to reproduce the paper's **No skill** baseline from the released code: run `eval_only.py` with an empty skill file, omit `env.skill_init`, or use another config/path?

This matters for reproduction because `summary.json` reports a `baseline_test_hard -> best_skill_test_hard` improvement, but if `baseline_test_hard` is actually `S0` rather than the paper's `No skill`, it should not be directly compared to the Table 1 no-skill row.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify how initial skill S0 relates to the reported no-skill baseline #33

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarify how initial skill S0 relates to the reported no-skill baseline #33

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions