-
Notifications
You must be signed in to change notification settings - Fork 154
feat: moe-router-bypass-batch-size #2349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@avtc Do you have some rough numbers for vram saving in your setup with and without the PR? |
|
@Qubitium Hi, Without batching experts, forward of moe up/gate modules on the With |
Looks good! We will work on this PR after a newly |
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Qubitium
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup
…ng object Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>






@Qubitium Hi, this feature adds ability to process moe weights in chunks in MoeRoutingBypass mode. Allowing to quantize large MoE models with less VRAM and devices.
Currently for each expert weight in GPTQ the hessian accumulator created during forward pass, which for example for GLM-4.5-Air is around 17Gb for up/gate moe modules for one layer. Processing expert weights in chunks require less hessian accumulator matrices and thus VRAM.
Example option usage: