[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940
Open
mgehre-amd wants to merge 1 commit into
Open
[ROCm][MoE] Modular MoE: alias fused_out with output to skip finalize copy#940mgehre-amd wants to merge 1 commit into
mgehre-amd wants to merge 1 commit into
Conversation
mgehre-amd
commented
May 19, 2026
mgehre-amd
commented
May 19, 2026
mgehre-amd
commented
May 19, 2026
mgehre-amd
commented
May 19, 2026
3ec42bc to
fe1ef08
Compare
fe1ef08 to
d134c72
Compare
1da299c to
a373afe
Compare
Adds an opt-in mechanism for MoE expert kernels to write their result directly into the final output buffer, eliminating the trailing fused_out->output copy in TopKWeightAndReduceNoOP.apply(). Changes: - modular_kernel.py: add accepts_output_alias() to FusedMoEExpertsModular (default False). When True, FusedMoEKernelModularImpl.apply() passes the pre-allocated output tensor as the fused_out buffer so the expert kernel writes to it directly. The shared_experts is None guard is omitted: in the non-async path _maybe_apply_shared_experts is never called during expert kernel execution, and shared expert I/O uses separate tensors from output. Fix two bugs introduced during development: self.shared_experts (non-existent attribute) -> shared_experts parameter; stale output_alias name in rocm_aiter block -> out. - hybrid_w4a16_moe.py: HybridW4A16MoEExpertsModular.accepts_output_alias() returns True. The data-flow through the wvSplitK kernel (gemm1->act-> gemm2->moe_unpermute) writes its final result to the output parameter after all reads of hidden_states are complete, making aliasing safe. - topk_weight_and_reduce.py: TopKWeightAndReduceNoOP.apply() adds a data_ptr() equality check as a fallback for the identity check, covering cases where the aliased tensor is wrapped in a different Python object. On Qwen3.5-35B-A3B-W4A16 decode (40 MoE layers, 128 steps, gfx1151): Memcpy DtoD calls: 7075 -> 1955 (-72%) DtoD GPU time: 15.1ms -> 4.2ms (-10.9ms) Accuracy verified: arc_challenge 25-shot acc_norm 0.78 (no regression). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
a373afe to
6a565c1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In the modular MoE pipeline the trailing
TopKWeightAndReduceNoOP.apply()unconditionally copiesfused_expert_output → outputevery layer. For experts that already write the post-reduce result tofused_out(e.g.HybridW4A16MoEExpertsviamoe_unpermute), this copy is pure overhead — one__amd_rocclr_copyBufferper MoE layer (48 per decode step on Qwen3-Omni-30B-A3B).Adds an opt-in
accepts_output_alias()hook onFusedMoEExpertsModular. When the expert opts in (and there are no shared experts), the we pass in the output tensor as argument, so the results can be directly written there instead of doing an extra copy.Benchmark — cyankiwi/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit
Strix Halo (gfx1151), input-len=256, output-len=64, num-prompts=4, max_concurrency=1.
copyBufferper decode step: 48 → 0 (per profile capture).Tests
No new test — the existing parametric
tests/kernels/moe/test_hybrid_w4a16_moe.pyexercises the modifiedFusedMoEKernelModularImplend-to-end throughHybridW4A16MoEExperts(the only expert that opts in). The aliasing branch is taken automatically whenshared_experts is None. Existing tolerance vs torch reference holds; 38/38 pass locally.Test plan
pytest tests/kernels/moe/test_hybrid_w4a16_moe.py -v