Qwen3.5 MoE prototype by mgehre-amd · Pull Request #933 · ROCm/vllm

mgehre-amd · 2026-05-11T16:49:48Z

On top of #932
Includes #929

Headline table

On Intel/Qwen3.5-35B-A3B-int4-AutoRound

Workload	matthias.qwen35-perf-merged	+ MTP	lm_head int8:g32	+ both
Text 128 in / 128 out	62.9 tok/s	87.2 (+39%)	72.1 (+15%)	105.3 (+67%)
Text 128 + image 1024×800 / 128 out	62.8	56.5 (−10%)	71.7 (+14%)	67.3 (+7%)
Text 4096 in / 128 out	61.1	57.6 (−6%)	69.3 (+13%)	64.5 (+6%)

Numbers are decode tok/s (= 1000 / TPOT_ms; excludes prefill / TTFT). Higher is better. Best per row in bold.

Accuracy (gsm8k, N=80, greedy, temperature=0)

Column	flexible-extract	strict-match	Δ flex vs 84.0	Δ strict vs 82.0
matthias.qwen35-perf-merged	85.00	85.00	+1.00 pp	+3.00 pp
+ MTP	85.00	85.00	+1.00 pp	+3.00 pp
lm_head int8:g32	81.25	81.25	−2.75 pp	−0.75 pp
+ both	82.50	82.50	−1.50 pp	+0.50 pp

1. Environment

Component	Version
vLLM	branch `matthias.qwen35-perf-merged`, commit `f4bdadbb06`
torch	`2.10.0+rocm7.13.0a20260419`
transformers	`5.5.3`
GPU	AMD Strix Halo (gfx1151)
Model	`Intel/Qwen3.5-35B-A3B-int4-AutoRound`

3. TPOT bench

For each cell of the table:

  --model Intel/Qwen3.5-35B-A3B-int4-AutoRound \
  --trust-remote-code \
  --max-num-seqs 1 \
  --num-prompts 10 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --target-gpu-memory-gb 25 \
  --input-len <128|4096> \
  --output-len 128 \
  <column-specific flags from the table above>

For the multimodal row (1024×800 image, 128 text tokens, 128 output): use the same script with random-mm.

4. gsm8k accuracy

For each column:

lm_eval run --model vllm \
  --model_args '{"pretrained":"Intel/Qwen3.5-35B-A3B-int4-AutoRound", \
                 "trust_remote_code":true, \
                 "dtype":"auto", \
                 "gpu_memory_utilization":0.95, \
                 "max_model_len":4096, \
                 "max_num_seqs":1, \
                 "enforce_eager":false \
                 <plus column-specific JSON keys, e.g.: \
                  ,"speculative_config":{"method":"mtp","num_speculative_tokens":2} \
                  ,"dynamic_lm_head_quantization":"int8:g32" \
                 >}' \
  --tasks gsm8k --limit 80 --gen_kwargs 'temperature=0,seed=0' \
  --batch_size 1 --seed 0,1234,1234,1234 --log_samples \
  --output_path /tmp/gsm8k-col-<N>.json

The previous bench measured each kernel call with its own CUDA event pair and a synchronize() afterward. For sub-100us kernels on Strix Halo, the ~50us idle gap between iters lets the iGPU drop clock, inflating per-call time by 5-15% vs what the model run actually sees (the model launches back-to-back on a stream so the GPU never idles). This made the bench under-report bandwidth and overstate "improvements" from heuristic changes that simply pushed kernel time below the DVFS-induced floor. Switch bench_dynamic to capture iters_per_replay launches (sized so a single replay runs ~target_replay_ms wall) into a CUDA graph and time the replay end-to-end. Adaptive replay count keeps the same target_se_pct convergence behavior. Buffers still rotate via fn(i), so the cache-busting properties of the old loop are preserved. Validated on bf16 against the in-model profile of Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile): wvSplitK 1x1024x2048 bench old=30.1 us new=27.0 us profile=26.8 us wvSplitK 1x248320x2048 bench old=4357 us new=4329 us profile=4430 us The bench now matches the model-run time within ~1% on both shapes. Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0, max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is ~30s (was ~9s). Repeated runs (with a 60s cooldown between to let the iGPU stay near 60C) agree on 46/48 shapes within 1%; the remaining outliers are thermal noise floor that no measurement setting can remove without locking the GPU clock. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Wires INC (Intel Neural Compressor / auto-round) quantized models into the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already uses on ROCm. Auto-round emits its checkpoints in `auto_round:auto_gptq` packing (same on-disk layout as compressed-tensors `pack_quantized`), so the only INC-specific piece is registering the parameters under the GPTQ names (`w*_qweight` / `w*_scales` / `w*_qzeros`) that the standard FusedMoE expert-name mapping resolves; the conversion to ExLlama-shuffled `[E, N, K//8]` and the `HybridW4A16MoEExperts` modular-kernel install are reused from compressed-tensors. Verified on Strix Halo (gfx1151) with `Intel/Qwen3.5-35B-A3B-int4-AutoRound`: the `_rocm_C::fused_moe_wvSplitK_int4_gemm` kernel now drives the per-token MoE GEMMs on decode; non-MoE INT4 linears were already going through HybridW4A16LinearKernel via `choose_mp_linear_kernel`. Changes: - `vllm/platforms/rocm.py`: add `"inc"` to `supported_quantization` (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through `choose_mp_linear_kernel`, so ROCm support is no longer unconditionally rejected at config validation). - `vllm/model_executor/layers/quantization/inc.py`: in `apply_awq_quant_layer` / `apply_gptq_quant_layer`, when the gate passes (`is_rocm`, 4-bit, sym, group_size>0, FusedMoE, non-marlin), return the new `INCHybridW4A16MoEMethod` instead of falling back to the generic `MoeWNA16Method`. - `vllm/model_executor/layers/quantization/inc_moe.py` (new): `INCHybridW4A16MoEMethod` registers GPTQ-named params, drops the sym `qzeros` (7-sentinel) before the kernel sees them, aliases to the names the helper expects, and installs the modular kernel. Originals are freed after the repack so weight memory stays at the checkpoint footprint instead of doubling. - `vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py` (new): shared `setup_hybrid_w4a16_moe(method, layer)` extracted from compressed-tensors; called by both backends so the conversion + modular-kernel install lives in one place. - `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py`: `_process_weights_hybrid_w4a16` now delegates to the shared helper. Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound, synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager): decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms. Profile confirms `_rocm_C::fused_moe_wvSplitK_int4_gemm` is now on the decode hot path (was Triton MoeWNA16 before). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

This reverts commit f6bcb0b.

The K<1024 (gemm2 down-proj, e.g. Qwen3.5-35B-A3B K=512) branch of the moe_wvSplitK_int4 dispatcher routes to the (W=32, AC=32, U=2) template which compiles to 157 VGPRs/wave. Combined with WG=1024 threads (32 wave32) this caps occupancy at 32 active wave32/CU = 50% of peak on gfx1151 (1536 VGPRs/SIMD). Roofline analysis (notes/.../baseline.md) shows this kernel hits only 52% of LPDDR5X peak BW vs 78% for the sibling K>=1024 variant, consistent with insufficient in-flight memory ops to hide LDS+HBM latency. Switching this branch to the existing (W=16, AC=16, U=2) template drops VGPRs to 113/wave and WG to 512 threads (16 wave32), letting 3 WGs fit per CU = 48 active wave32/CU = 75% of peak (matching the K>=1024 sibling). Gated behind VLLM_MOE_WVSPLITK_INT4_TINY_K_LOW_VGPR (default ON). Set =0 to revert to the original (W=32, AC=32) heuristic that was optimal for K=768 on a different model. Changes: - csrc/rocm/skinny_gemms_int4.cu: env-gated runtime branch in MOE_WVSPLIT_INT4G_TILE; only affects N=1, K<1024, gfx1x dispatch. - ideas-research/p2-investigation.md: per-template VGPR table extracted from compiled binary, occupancy math, decision rationale. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

The Qwen MoE shared_expert_gate ReplicatedLinear(hidden_size, 1) produces a 1x1 scalar via hipBLASLt's MT64x96x32 macro tile + SplitK post-pass. Replace the M=N=1 path inside rocm_unquantized_gemm_impl with a fused elementwise-mul + reduction. Gated by VLLM_DISABLE_TINY_DOT_GEMM (default off = enabled); restricted to ROCm fp16/bf16 dispatched via rocm_unquantized_gemm. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

…hared_expert Two-phase epilogue fusion on the bf16 wvSplitK GEMM used by Qwen2MoE shared_expert: * Phase 1 (FUSED_SILU_MUL): fuse silu·multiply elementwise into the gate_up GEMM epilogue (one less kernel + one less node-floor per shared-expert block × 40 layers). * Phase 2 (FUSED_GATE_MUL): fold the trailing per-token scalar mul (F.sigmoid(self.expert_gate(x)) * out) into the down GEMM epilogue (4 kernels → 2 kernels per shared-expert block when combined with P1). Activation: env knob VLLM_BF16_WVSPLITK_FUSED_SILU=1 (default OFF). Bit-identical to unfused on canonical M=2048,K=512,N=1 bf16 micro-tests. gsm8k 84.0/82.0 (vs baseline 82.0/80.0; +2pp strict). TPOT measured impact across 3 standard workloads: text 128/128: -1.4% (Phase 1 -0.4%, Phase 2 -1.0%) mm 1024×800/128: -1.6% (Phase 1 -0.7%, Phase 2 -0.9%) text 4096/128: -1.5% (Phase 1 -0.8%, Phase 2 -0.7%) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Adds an optional, default-OFF roctx wrapper around the per-step decode forward pass to unblock single-pass PMC under cudagraph + torch.compile (rocprofv3 --selected-regions). Loads librocprofiler-sdk-roctx.so via ctypes only when VLLM_ROCTX_DECODE_REGION=1 is set; otherwise no library is loaded and the helper is a noop. The wrapper skips the first VLLM_ROCTX_WARMUP_STEPS (default 8) calls into _model_forward (warmup + first cudagraph replays) and then emits roctxProfilerResume / roctxRangePushA + Pop / Pause around the next VLLM_ROCTX_CAPTURE_STEPS (default 4) steady-state decode steps. After that, profile-region stays paused for the rest of the run, so PMC only samples the bounded post-warmup region. Companion launcher: tmp/qwen35-pmc/run-p5-single-pass.sh (not committed) runs rocprofv3 with --pmc <single-pass set> --marker-trace --selected-regions and a kernel-include-regex for the wvSplitK families the campaign cares about. Changes: - New file vllm/v1/worker/_roctx_decode_region.py with begin/end helpers - gpu_model_runner._model_forward calls begin/end via try/finally so an exception still pops the range and pauses the profiler. Co-authored-by: Claude Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Triton autotune sweep over the prefill MoE GEMM shapes (M=128 N=1024 K=2048 + M=128 N=2048 K=1024) found a joint-best config (BLOCK_M=128, BLOCK_N=32, BLOCK_K=128, GROUP_SIZE_M=8, num_warps=4, num_stages=1) that beats the prior default (BM=64, BN=64, BK=32, GM=8, nw=4, ns=2) by -19% joint kernel time. Root-cause: the prior _triton_config() clamped BLOCK_K to min(group_size, 32) = 32 — overly conservative because BLOCK_K can equal group_size (one scale per K-tile) and BLOCK_K=128 is uniformly faster on Strix Halo (gfx1151). This is a TTFT-only lever: the kernel is dispatched only when num_tokens > MAX_SKINNY_BATCH_SIZE (= 5), i.e. during prefill, never during decode. The decode path uses HIP wvSplitK_int4 and is untouched. Changes: - hybrid_w4a16_moe.py: TRITON_BLOCK_SIZE_M bumped 64->128 and _triton_config() returns the tuned dict, both gated by VLLM_HYBRID_W4A16_TRITON_TUNED_P1=1 (default ON; toggle to 0 to revert). - p1-investigation.md, p1-sweep.py, P-1-sweep.csv (1242 raw configs), RESULTS.md verdict. Verified end-to-end: tuned config dispatches cleanly via invoke_fused_moe_kernel_hybrid_triton (no compile error, no shape mismatch); post-edit standalone bench confirms GEMM1 114.7 us / GEMM2 53.2 us, matching sweep top-3 within do_bench noise. No TPOT bench (not a TPOT lever); no accuracy gate (autotune is correctness-invariant for Triton GEMMs). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Quantize lm_head to per-group-32 INT8 at model-load time, dispatched through a new GROUP_SIZE template parameter on wvSplitK_int8_hf_sml_ that absorbs a 2-D [M, K/G] scale tensor inside the K reduction loop. Activation: --dynamic-lm-head-quantization int8 # per-channel (production-default) --dynamic-lm-head-quantization int8:g32 # per-group-32 along K (recommended) --dynamic-lm-head-quantization int8:g64 --dynamic-lm-head-quantization int8:g128 Per-group-32 was selected from a sweep at gsm8k N=250: G=32 within ±1pp of baseline; G=64 -5.2pp (2.3σ regression); G=128 -4.4pp strict (1.9σ). Per-channel collapses gsm8k by ~22pp on the 248320-row Qwen3.5 lm_head. Activation perf: text 128/128 -12.9%, mm 1024×800 -12.4%, text 4096/128 -12.2% TPOT vs the pre-feature merged baseline; gsm8k unchanged. A debug-only fallback (VLLM_DYNAMIC_INT8_LM_HEAD_SIM=1) quantize-dequantizes back to bf16 and dispatches via dispatch_unquantized_gemm — perf-neutral, accuracy-equivalent — useful for validating quant schemes before kernel work. Implementation: - csrc/rocm/skinny_gemms_int8.cu: GROUP_SIZE template on wvSplitK_int8_hf_sml_, per-thread partial dot accumulated against scale[(m+y)*num_groups+group_idx] - csrc/rocm/torch_bindings.cpp + ops.h + vllm/_custom_ops.py: 2-D scale arg - vllm/model_executor/kernels/linear/mixed_precision/hip_w8a16.py: HipW8A16LinearKernel.can_implement relaxed to accept group_size in {32,64,128} - vllm/model_executor/layers/quantization/dynamic_int8_lm_head.py: parser + _resolve_group_size + per-group quant in process_weights_after_loading - vllm/config/model.py: docstring lists the new int8:gN forms Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd changed the base branch from gfx11 to matthias.kernel-fusion-experiments May 12, 2026 07:06

mgehre-amd added 9 commits May 12, 2026 01:10

Revert "Revise HIP_VERSION guard on atomicAdd (#921)"

bee5f4a

This reverts commit f6bcb0b.

mgehre-amd force-pushed the matthias.qwen35-perf-merged branch from f4bdadb to eefad19 Compare May 12, 2026 07:10

mgehre-amd closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5 MoE prototype#933

Qwen3.5 MoE prototype#933
mgehre-amd wants to merge 9 commits into
matthias.kernel-fusion-experimentsfrom
matthias.qwen35-perf-merged

mgehre-amd commented May 11, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented May 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline table

Accuracy (gsm8k, N=80, greedy, temperature=0)

1. Environment

3. TPOT bench

4. gsm8k accuracy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented May 11, 2026 •

edited by github-actions Bot

Loading