I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0
Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?
hf (pretrained=OpenNLPLab/TransNormerLLM-385M,trust_remote_code=True), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|Filter|n-shot|Metric|Value | |Stderr|
|-----|-------|------|-----:|------|-----:|---|-----:|
|boolq|Yaml |none | 0|acc |0.4859|± |0.0087|
I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0
Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?