-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Hi WritingBench authors,
Thank you for open-sourcing WritingBench and the evaluation scripts.
We are trying to reproduce the length subset result for Qwen2.5-7B-Instruct, and we would like to confirm whether our current setting is comparable to the leaderboard / paper setting.
What we have done:
- We aligned the generation configuration with the official recommendation as closely as possible:
temperature = 0.7top_p = 0.8top_k = 20max_tokens = 8192(limited by our serving platform)
- We also verified the scoring / aggregation logic carefully:
officialgenerate_response.py+evaluate_benchmark.py+calculate_scores.py
All of these routes consistently give us a result around:
Overall ≈ 5.50length_R ≈ 5.50
The only unavoidable difference we found is in the Claude judge call:
- we use claude_sonnet-4-5-20250929
- for Claude Sonnet 4.5, the provider does not allow
temperatureandtop_pto be specified at the same time - so for evaluation we used the closest compatible setting:
top_p = 0.95max_tokens = 2048temperatureomitted, letting the provider default apply
Given that the rest of the pipeline seems aligned, we would like to ask:
- Is the public leaderboard / paper result for
Qwen2.5-7B-Instructdirectly comparable to the current open-source evaluation scripts? - Was the reported result produced with the released Claude-based evaluation pipeline, or with the WritingBench critic model?
- Could the difference come from a different judge deployment environment or parameter handling?
We would really appreciate any clarification. Thank you very much!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels