Skip to content

Qwen3.5 MoE prototype#933

Closed
mgehre-amd wants to merge 9 commits into
matthias.kernel-fusion-experimentsfrom
matthias.qwen35-perf-merged
Closed

Qwen3.5 MoE prototype#933
mgehre-amd wants to merge 9 commits into
matthias.kernel-fusion-experimentsfrom
matthias.qwen35-perf-merged

Conversation

@mgehre-amd

@mgehre-amd mgehre-amd commented May 11, 2026

Copy link
Copy Markdown

On top of #932
Includes #929

Headline table

On Intel/Qwen3.5-35B-A3B-int4-AutoRound

Workload matthias.qwen35-perf-merged + MTP lm_head int8:g32 + both
Text 128 in / 128 out 62.9 tok/s 87.2 (+39%) 72.1 (+15%) ** 105.3** (+67%)
Text 128 + image 1024×800 / 128 out 62.8 56.5 (−10%) 71.7 (+14%) 67.3 (+7%)
Text 4096 in / 128 out 61.1 57.6 (−6%) 69.3 (+13%) 64.5 (+6%)

Numbers are decode tok/s (= 1000 / TPOT_ms; excludes prefill / TTFT). Higher is better. Best per row in bold.

Accuracy (gsm8k, N=80, greedy, temperature=0)

Column flexible-extract strict-match Δ flex vs 84.0 Δ strict vs 82.0
matthias.qwen35-perf-merged 85.00 85.00 +1.00 pp +3.00 pp
+ MTP 85.00 85.00 +1.00 pp +3.00 pp
lm_head int8:g32 81.25 81.25 −2.75 pp −0.75 pp
+ both 82.50 82.50 −1.50 pp +0.50 pp

1. Environment

Component Version
vLLM branch matthias.qwen35-perf-merged, commit f4bdadbb06
torch 2.10.0+rocm7.13.0a20260419
transformers 5.5.3
GPU AMD Strix Halo (gfx1151)
Model Intel/Qwen3.5-35B-A3B-int4-AutoRound

3. TPOT bench

For each cell of the table:

  --model Intel/Qwen3.5-35B-A3B-int4-AutoRound \
  --trust-remote-code \
  --max-num-seqs 1 \
  --num-prompts 10 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --target-gpu-memory-gb 25 \
  --input-len <128|4096> \
  --output-len 128 \
  <column-specific flags from the table above>

For the multimodal row (1024×800 image, 128 text tokens, 128 output): use the same script with random-mm.

4. gsm8k accuracy

For each column:

lm_eval run --model vllm \
  --model_args '{"pretrained":"Intel/Qwen3.5-35B-A3B-int4-AutoRound", \
                 "trust_remote_code":true, \
                 "dtype":"auto", \
                 "gpu_memory_utilization":0.95, \
                 "max_model_len":4096, \
                 "max_num_seqs":1, \
                 "enforce_eager":false \
                 <plus column-specific JSON keys, e.g.: \
                  ,"speculative_config":{"method":"mtp","num_speculative_tokens":2} \
                  ,"dynamic_lm_head_quantization":"int8:g32" \
                 >}' \
  --tasks gsm8k --limit 80 --gen_kwargs 'temperature=0,seed=0' \
  --batch_size 1 --seed 0,1234,1234,1234 --log_samples \
  --output_path /tmp/gsm8k-col-<N>.json

@mgehre-amd mgehre-amd changed the base branch from gfx11 to matthias.kernel-fusion-experiments May 12, 2026 07:06
The previous bench measured each kernel call with its own CUDA event pair
and a synchronize() afterward. For sub-100us kernels on Strix Halo, the
~50us idle gap between iters lets the iGPU drop clock, inflating per-call
time by 5-15% vs what the model run actually sees (the model launches
back-to-back on a stream so the GPU never idles). This made the bench
under-report bandwidth and overstate "improvements" from heuristic
changes that simply pushed kernel time below the DVFS-induced floor.

Switch bench_dynamic to capture iters_per_replay launches (sized so a
single replay runs ~target_replay_ms wall) into a CUDA graph and time
the replay end-to-end. Adaptive replay count keeps the same
target_se_pct convergence behavior. Buffers still rotate via fn(i),
so the cache-busting properties of the old loop are preserved.

Validated on bf16 against the in-model profile of
Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile):
  wvSplitK 1x1024x2048   bench old=30.1 us  new=27.0 us  profile=26.8 us
  wvSplitK 1x248320x2048 bench old=4357 us  new=4329 us  profile=4430 us
The bench now matches the model-run time within ~1% on both shapes.

Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0,
max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is
~30s (was ~9s). Repeated runs (with a 60s cooldown between to let
the iGPU stay near 60C) agree on 46/48 shapes within 1%; the
remaining outliers are thermal noise floor that no measurement
setting can remove without locking the GPU clock.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Wires INC (Intel Neural Compressor / auto-round) quantized models into
the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already
uses on ROCm. Auto-round emits its checkpoints in
`auto_round:auto_gptq` packing (same on-disk layout as compressed-tensors
`pack_quantized`), so the only INC-specific piece is registering the
parameters under the GPTQ names (`w*_qweight` / `w*_scales` / `w*_qzeros`)
that the standard FusedMoE expert-name mapping resolves; the conversion
to ExLlama-shuffled `[E, N, K//8]` and the `HybridW4A16MoEExperts`
modular-kernel install are reused from compressed-tensors.

Verified on Strix Halo (gfx1151) with `Intel/Qwen3.5-35B-A3B-int4-AutoRound`:
the `_rocm_C::fused_moe_wvSplitK_int4_gemm` kernel now drives the per-token
MoE GEMMs on decode; non-MoE INT4 linears were already going through
HybridW4A16LinearKernel via `choose_mp_linear_kernel`.

Changes:
- `vllm/platforms/rocm.py`: add `"inc"` to `supported_quantization`
  (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through
  `choose_mp_linear_kernel`, so ROCm support is no longer
  unconditionally rejected at config validation).
- `vllm/model_executor/layers/quantization/inc.py`: in
  `apply_awq_quant_layer` / `apply_gptq_quant_layer`, when the gate
  passes (`is_rocm`, 4-bit, sym, group_size>0, FusedMoE, non-marlin),
  return the new `INCHybridW4A16MoEMethod` instead of falling back to
  the generic `MoeWNA16Method`.
- `vllm/model_executor/layers/quantization/inc_moe.py` (new):
  `INCHybridW4A16MoEMethod` registers GPTQ-named params, drops the
  sym `qzeros` (7-sentinel) before the kernel sees them, aliases to
  the names the helper expects, and installs the modular kernel.
  Originals are freed after the repack so weight memory stays at the
  checkpoint footprint instead of doubling.
- `vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py`
  (new): shared `setup_hybrid_w4a16_moe(method, layer)` extracted from
  compressed-tensors; called by both backends so the conversion +
  modular-kernel install lives in one place.
- `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py`:
  `_process_weights_hybrid_w4a16` now delegates to the shared helper.

Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound,
synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager):
  decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms.
  Profile confirms `_rocm_C::fused_moe_wvSplitK_int4_gemm` is now
  on the decode hot path (was Triton MoeWNA16 before).

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The K<1024 (gemm2 down-proj, e.g. Qwen3.5-35B-A3B K=512) branch of the
moe_wvSplitK_int4 dispatcher routes to the (W=32, AC=32, U=2) template
which compiles to 157 VGPRs/wave. Combined with WG=1024 threads
(32 wave32) this caps occupancy at 32 active wave32/CU = 50% of peak on
gfx1151 (1536 VGPRs/SIMD). Roofline analysis (notes/.../baseline.md)
shows this kernel hits only 52% of LPDDR5X peak BW vs 78% for the
sibling K>=1024 variant, consistent with insufficient in-flight memory
ops to hide LDS+HBM latency.

Switching this branch to the existing (W=16, AC=16, U=2) template drops
VGPRs to 113/wave and WG to 512 threads (16 wave32), letting 3 WGs fit
per CU = 48 active wave32/CU = 75% of peak (matching the K>=1024 sibling).

Gated behind VLLM_MOE_WVSPLITK_INT4_TINY_K_LOW_VGPR (default ON).
Set =0 to revert to the original (W=32, AC=32) heuristic that was
optimal for K=768 on a different model.

Changes:
- csrc/rocm/skinny_gemms_int4.cu: env-gated runtime branch in
  MOE_WVSPLIT_INT4G_TILE; only affects N=1, K<1024, gfx1x dispatch.
- ideas-research/p2-investigation.md: per-template VGPR table extracted
  from compiled binary, occupancy math, decision rationale.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The Qwen MoE shared_expert_gate ReplicatedLinear(hidden_size, 1)
produces a 1x1 scalar via hipBLASLt's MT64x96x32 macro tile +
SplitK post-pass.

Replace the M=N=1 path inside rocm_unquantized_gemm_impl with a
fused elementwise-mul + reduction. Gated by VLLM_DISABLE_TINY_DOT_GEMM
(default off = enabled); restricted to ROCm fp16/bf16 dispatched via
rocm_unquantized_gemm.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…hared_expert

Two-phase epilogue fusion on the bf16 wvSplitK GEMM used by Qwen2MoE
shared_expert:

* Phase 1 (FUSED_SILU_MUL): fuse silu·multiply elementwise into the
  gate_up GEMM epilogue (one less kernel + one less node-floor per
  shared-expert block × 40 layers).

* Phase 2 (FUSED_GATE_MUL): fold the trailing per-token scalar mul
  (F.sigmoid(self.expert_gate(x)) * out) into the down GEMM epilogue
  (4 kernels → 2 kernels per shared-expert block when combined with P1).

Activation: env knob VLLM_BF16_WVSPLITK_FUSED_SILU=1 (default OFF).
Bit-identical to unfused on canonical M=2048,K=512,N=1 bf16 micro-tests.
gsm8k 84.0/82.0 (vs baseline 82.0/80.0; +2pp strict).
TPOT measured impact across 3 standard workloads:
  text 128/128:    -1.4% (Phase 1 -0.4%, Phase 2 -1.0%)
  mm 1024×800/128: -1.6% (Phase 1 -0.7%, Phase 2 -0.9%)
  text 4096/128:   -1.5% (Phase 1 -0.8%, Phase 2 -0.7%)

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Adds an optional, default-OFF roctx wrapper around the per-step decode
forward pass to unblock single-pass PMC under cudagraph + torch.compile
(rocprofv3 --selected-regions). Loads librocprofiler-sdk-roctx.so via
ctypes only when VLLM_ROCTX_DECODE_REGION=1 is set; otherwise no library
is loaded and the helper is a noop.

The wrapper skips the first VLLM_ROCTX_WARMUP_STEPS (default 8) calls
into _model_forward (warmup + first cudagraph replays) and then emits
roctxProfilerResume / roctxRangePushA + Pop / Pause around the next
VLLM_ROCTX_CAPTURE_STEPS (default 4) steady-state decode steps. After
that, profile-region stays paused for the rest of the run, so PMC only
samples the bounded post-warmup region.

Companion launcher: tmp/qwen35-pmc/run-p5-single-pass.sh (not committed)
runs rocprofv3 with --pmc <single-pass set> --marker-trace
--selected-regions and a kernel-include-regex for the wvSplitK families
the campaign cares about.

Changes:
- New file vllm/v1/worker/_roctx_decode_region.py with begin/end helpers
- gpu_model_runner._model_forward calls begin/end via try/finally so an
  exception still pops the range and pauses the profiler.

Co-authored-by: Claude

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Triton autotune sweep over the prefill MoE GEMM shapes (M=128 N=1024 K=2048
+ M=128 N=2048 K=1024) found a joint-best config (BLOCK_M=128, BLOCK_N=32,
BLOCK_K=128, GROUP_SIZE_M=8, num_warps=4, num_stages=1) that beats the prior
default (BM=64, BN=64, BK=32, GM=8, nw=4, ns=2) by -19% joint kernel time.

Root-cause: the prior _triton_config() clamped BLOCK_K to min(group_size, 32)
= 32 — overly conservative because BLOCK_K can equal group_size (one scale
per K-tile) and BLOCK_K=128 is uniformly faster on Strix Halo (gfx1151).

This is a TTFT-only lever: the kernel is dispatched only when num_tokens >
MAX_SKINNY_BATCH_SIZE (= 5), i.e. during prefill, never during decode. The
decode path uses HIP wvSplitK_int4 and is untouched.

Changes:
- hybrid_w4a16_moe.py: TRITON_BLOCK_SIZE_M bumped 64->128 and _triton_config()
  returns the tuned dict, both gated by VLLM_HYBRID_W4A16_TRITON_TUNED_P1=1
  (default ON; toggle to 0 to revert).
- p1-investigation.md, p1-sweep.py, P-1-sweep.csv (1242 raw configs),
  RESULTS.md verdict.

Verified end-to-end: tuned config dispatches cleanly via
invoke_fused_moe_kernel_hybrid_triton (no compile error, no shape mismatch);
post-edit standalone bench confirms GEMM1 114.7 us / GEMM2 53.2 us, matching
sweep top-3 within do_bench noise. No TPOT bench (not a TPOT lever); no
accuracy gate (autotune is correctness-invariant for Triton GEMMs).

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Quantize lm_head to per-group-32 INT8 at model-load time, dispatched through
a new GROUP_SIZE template parameter on wvSplitK_int8_hf_sml_ that absorbs
a 2-D [M, K/G] scale tensor inside the K reduction loop. Activation:

    --dynamic-lm-head-quantization int8         # per-channel (production-default)
    --dynamic-lm-head-quantization int8:g32     # per-group-32 along K (recommended)
    --dynamic-lm-head-quantization int8:g64
    --dynamic-lm-head-quantization int8:g128

Per-group-32 was selected from a sweep at gsm8k N=250: G=32 within ±1pp of
baseline; G=64 -5.2pp (2.3σ regression); G=128 -4.4pp strict (1.9σ).
Per-channel collapses gsm8k by ~22pp on the 248320-row Qwen3.5 lm_head.

Activation perf: text 128/128 -12.9%, mm 1024×800 -12.4%, text 4096/128
-12.2% TPOT vs the pre-feature merged baseline; gsm8k unchanged.

A debug-only fallback (VLLM_DYNAMIC_INT8_LM_HEAD_SIM=1) quantize-dequantizes
back to bf16 and dispatches via dispatch_unquantized_gemm — perf-neutral,
accuracy-equivalent — useful for validating quant schemes before kernel work.

Implementation:
- csrc/rocm/skinny_gemms_int8.cu: GROUP_SIZE template on wvSplitK_int8_hf_sml_,
  per-thread partial dot accumulated against scale[(m+y)*num_groups+group_idx]
- csrc/rocm/torch_bindings.cpp + ops.h + vllm/_custom_ops.py: 2-D scale arg
- vllm/model_executor/kernels/linear/mixed_precision/hip_w8a16.py:
  HipW8A16LinearKernel.can_implement relaxed to accept group_size in {32,64,128}
- vllm/model_executor/layers/quantization/dynamic_int8_lm_head.py: parser
  + _resolve_group_size + per-group quant in process_weights_after_loading
- vllm/config/model.py: docstring lists the new int8:gN forms

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.qwen35-perf-merged branch from f4bdadb to eefad19 Compare May 12, 2026 07:10
@mgehre-amd mgehre-amd closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant