Skip to content

[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940

Open
mgehre-amd wants to merge 1 commit into
gfx11from
matthias.moe-modular-alias-fused-out
Open

[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940
mgehre-amd wants to merge 1 commit into
gfx11from
matthias.moe-modular-alias-fused-out

Conversation

@mgehre-amd

@mgehre-amd mgehre-amd commented May 19, 2026

Copy link
Copy Markdown

Summary

In the modular MoE pipeline the trailing TopKWeightAndReduceNoOP.apply() unconditionally copies fused_expert_output → output every layer. For experts that already write the post-reduce result to fused_out (e.g. HybridW4A16MoEExperts via moe_unpermute), this copy is pure overhead — one __amd_rocclr_copyBuffer per MoE layer (48 per decode step on Qwen3-Omni-30B-A3B).

Adds an opt-in accepts_output_alias() hook on FusedMoEExpertsModular. When the expert opts in (and there are no shared experts), the we pass in the output tensor as argument, so the results can be directly written there instead of doing an extra copy.

Benchmark — cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4, max_concurrency=1.

TPOT median Δ vs gfx11
gfx11 (this PR reverted) 12.82 ms
with this PR 12.68 ms −1.09%

copyBuffer per decode step: 48 → 0 (per profile capture).

Tests

No new test — the existing parametric tests/kernels/moe/test_hybrid_w4a16_moe.py exercises the modified FusedMoEKernelModularImpl end-to-end through HybridW4A16MoEExperts (the only expert that opts in). The aliasing branch is taken automatically when shared_experts is None. Existing tolerance vs torch reference holds; 38/38 pass locally.

Test plan

  • pytest tests/kernels/moe/test_hybrid_w4a16_moe.py -v
  • End-to-end Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit run; generated text matches a baseline run

Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
@mgehre-amd mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from 3ec42bc to fe1ef08 Compare May 19, 2026 13:11
@mgehre-amd mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from fe1ef08 to d134c72 Compare May 28, 2026 11:14
@mgehre-amd mgehre-amd requested a review from amd-callumm May 28, 2026 11:16
@mgehre-amd mgehre-amd marked this pull request as ready for review May 28, 2026 11:16
@mgehre-amd mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch 2 times, most recently from 1da299c to a373afe Compare June 8, 2026 22:55
Adds an opt-in mechanism for MoE expert kernels to write their result
directly into the final output buffer, eliminating the trailing
fused_out->output copy in TopKWeightAndReduceNoOP.apply().

Changes:
- modular_kernel.py: add accepts_output_alias() to FusedMoEExpertsModular
  (default False). When True, FusedMoEKernelModularImpl.apply() passes the
  pre-allocated output tensor as the fused_out buffer so the expert kernel
  writes to it directly. The shared_experts is None guard is omitted: in
  the non-async path _maybe_apply_shared_experts is never called during
  expert kernel execution, and shared expert I/O uses separate tensors from
  output. Fix two bugs introduced during development: self.shared_experts
  (non-existent attribute) -> shared_experts parameter; stale output_alias
  name in rocm_aiter block -> out.

- hybrid_w4a16_moe.py: HybridW4A16MoEExpertsModular.accepts_output_alias()
  returns True. The data-flow through the wvSplitK kernel (gemm1->act->
  gemm2->moe_unpermute) writes its final result to the output parameter
  after all reads of hidden_states are complete, making aliasing safe.

- topk_weight_and_reduce.py: TopKWeightAndReduceNoOP.apply() adds a
  data_ptr() equality check as a fallback for the identity check, covering
  cases where the aliased tensor is wrapped in a different Python object.

On Qwen3.5-35B-A3B-W4A16 decode (40 MoE layers, 128 steps, gfx1151):
  Memcpy DtoD calls: 7075 -> 1955  (-72%)
  DtoD GPU time:    15.1ms -> 4.2ms (-10.9ms)
Accuracy verified: arc_challenge 25-shot acc_norm 0.78 (no regression).

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.moe-modular-alias-fused-out branch from a373afe to 6a565c1 Compare June 9, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant