Benchmark results can not be reproduced

I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0
Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?

```
hf (pretrained=OpenNLPLab/TransNormerLLM-385M,trust_remote_code=True), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|Filter|n-shot|Metric|Value |   |Stderr|
|-----|-------|------|-----:|------|-----:|---|-----:|
|boolq|Yaml   |none  |     0|acc   |0.4859|±  |0.0087|
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results can not be reproduced #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark results can not be reproduced #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions