Hi, thanks for the great paper.
I noticed a large discrepancy between SkillGrad and SkillOpt on SpreadsheetBench under seemingly similar GPT-5.4 settings. In particular, the "No Skill" baseline is 62.5 in SkillGrad, but only 41.4 in SkillOpt.
Could you clarify whether the benchmark version, evaluation harness, prompting setup, or any filtering/preprocessing differ between the two works?
The baseline gap itself seems larger than many reported optimization gains, so understanding the setup differences would be very helpful.
Thanks!

Hi, thanks for the great paper.
I noticed a large discrepancy between SkillGrad and SkillOpt on SpreadsheetBench under seemingly similar GPT-5.4 settings. In particular, the "No Skill" baseline is 62.5 in SkillGrad, but only 41.4 in SkillOpt.
Could you clarify whether the benchmark version, evaluation harness, prompting setup, or any filtering/preprocessing differ between the two works?
The baseline gap itself seems larger than many reported optimization gains, so understanding the setup differences would be very helpful.
Thanks!