Skip to content

Score variance of different runs #103

@hanningzhang

Description

@hanningzhang

Dear Authors,

Thank you so much for your benchmark. I am testing my Qwen3.5-35B-A3B with Opus-4.5 as the judge. I ran the benchmark 3 times, and the mean scores are 65.71%, 69.13%, and 74.26%, respectively. And the token cost varies a lot. I am wondering if this is considered normal? And is there any methods to reduce the variance? I am also curious how you calculate the AVG score on your leaderboard?

Here are the logs of my run:

2026-04-03 19:14:12,124 - INFO - Total tokens used: 1,662,453 (input: 1,625,709, output: 36,744)
2026-04-03 19:14:12,124 - INFO - Total API requests: 115
2026-04-03 19:14:12,124 - INFO - Avg tokens/task: 69,269
2026-04-03 19:14:12,124 - INFO - Mean score: 0.6571
2026-04-03 19:14:12,124 - INFO - Score per 1K tokens: 0.0095 (higher = more efficient)

2026-04-03 20:29:21,516 - INFO - Total tokens used: 1,549,510 (input: 1,514,401, output: 35,109)
2026-04-03 20:29:21,516 - INFO - Total API requests: 100
2026-04-03 20:29:21,516 - INFO - Avg tokens/task: 64,563
2026-04-03 20:29:21,516 - INFO - Mean score: 0.6913
2026-04-03 20:29:21,521 - INFO - Score per 1K tokens: 0.0107 (higher = more efficient)

2026-04-03 21:02:03,418 - INFO - Total tokens used: 1,524,989 (input: 1,497,458, output: 27,531)
2026-04-03 21:02:03,418 - INFO - Total API requests: 104
2026-04-03 21:02:03,418 - INFO - Avg tokens/task: 63,541
2026-04-03 21:02:03,418 - INFO - Mean score: 0.7426
2026-04-03 21:02:03,418 - INFO - Score per 1K tokens: 0.0117 (higher = more efficient)

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions