Clarification on SpreadsheetBench Baseline Discrepancy

Hi, thanks for the great paper.

I noticed a large discrepancy between SkillGrad and SkillOpt on SpreadsheetBench under seemingly similar GPT-5.4 settings. In particular, the "No Skill" baseline is 62.5 in SkillGrad, but only 41.4 in SkillOpt.

Could you clarify whether the benchmark version, evaluation harness, prompting setup, or any filtering/preprocessing differ between the two works?

The baseline gap itself seems larger than many reported optimization gains, so understanding the setup differences would be very helpful.

Thanks!

<img width="2658" height="1695" alt="Image" src="https://github.com/user-attachments/assets/d98fc683-6f1a-42a2-b31e-161da07d3928" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on SpreadsheetBench Baseline Discrepancy #35

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarification on SpreadsheetBench Baseline Discrepancy #35

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions