seems that MoA does not work on MATH and QA with both weak and strong LLMs

I have thoroughly tested MoA (with one layer) on some objective benchmarks (less subjective compared to MT-bench), such as GSM8K, HotpotQA.
It seems that when the LLMs are 7B-level, it does not work anymore.
Here in my setting,
the three LLMs in layer one is `mistralai/Mistral-7B-Instruct-v0.1/2/3`, while the aggregator is `meta-llama/Meta-Llama-3.1-8B-Instruct`.
(before the experiment, I have tested each model's capability to solve the problem, the most powerful one is llama-3.1-8B).

Then, when applying MoA, I find that the performance decrease, for example, in GSM8K, the acc decreases from 75.1 to 61.3, where llama-3.1 solely achives 75.1, here rounds=0; while 61.3 is from rounds=1 that the intermidiate layer consists of the mistral-7B v0.1/2/3.

This finding also applies to HotpotQA.



Does anyone face the similar observation with me ? Any suggestions on how to use 7B-level llms ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seems that MoA does not work on MATH and QA with both weak and strong LLMs #41

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

seems that MoA does not work on MATH and QA with both weak and strong LLMs #41

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions