Qwen3.5 MoE prototype#933
Closed
mgehre-amd wants to merge 9 commits into
Closed
Conversation
The previous bench measured each kernel call with its own CUDA event pair and a synchronize() afterward. For sub-100us kernels on Strix Halo, the ~50us idle gap between iters lets the iGPU drop clock, inflating per-call time by 5-15% vs what the model run actually sees (the model launches back-to-back on a stream so the GPU never idles). This made the bench under-report bandwidth and overstate "improvements" from heuristic changes that simply pushed kernel time below the DVFS-induced floor. Switch bench_dynamic to capture iters_per_replay launches (sized so a single replay runs ~target_replay_ms wall) into a CUDA graph and time the replay end-to-end. Adaptive replay count keeps the same target_se_pct convergence behavior. Buffers still rotate via fn(i), so the cache-busting properties of the old loop are preserved. Validated on bf16 against the in-model profile of Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile): wvSplitK 1x1024x2048 bench old=30.1 us new=27.0 us profile=26.8 us wvSplitK 1x248320x2048 bench old=4357 us new=4329 us profile=4430 us The bench now matches the model-run time within ~1% on both shapes. Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0, max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is ~30s (was ~9s). Repeated runs (with a 60s cooldown between to let the iGPU stay near 60C) agree on 46/48 shapes within 1%; the remaining outliers are thermal noise floor that no measurement setting can remove without locking the GPU clock. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Wires INC (Intel Neural Compressor / auto-round) quantized models into the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already uses on ROCm. Auto-round emits its checkpoints in `auto_round:auto_gptq` packing (same on-disk layout as compressed-tensors `pack_quantized`), so the only INC-specific piece is registering the parameters under the GPTQ names (`w*_qweight` / `w*_scales` / `w*_qzeros`) that the standard FusedMoE expert-name mapping resolves; the conversion to ExLlama-shuffled `[E, N, K//8]` and the `HybridW4A16MoEExperts` modular-kernel install are reused from compressed-tensors. Verified on Strix Halo (gfx1151) with `Intel/Qwen3.5-35B-A3B-int4-AutoRound`: the `_rocm_C::fused_moe_wvSplitK_int4_gemm` kernel now drives the per-token MoE GEMMs on decode; non-MoE INT4 linears were already going through HybridW4A16LinearKernel via `choose_mp_linear_kernel`. Changes: - `vllm/platforms/rocm.py`: add `"inc"` to `supported_quantization` (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through `choose_mp_linear_kernel`, so ROCm support is no longer unconditionally rejected at config validation). - `vllm/model_executor/layers/quantization/inc.py`: in `apply_awq_quant_layer` / `apply_gptq_quant_layer`, when the gate passes (`is_rocm`, 4-bit, sym, group_size>0, FusedMoE, non-marlin), return the new `INCHybridW4A16MoEMethod` instead of falling back to the generic `MoeWNA16Method`. - `vllm/model_executor/layers/quantization/inc_moe.py` (new): `INCHybridW4A16MoEMethod` registers GPTQ-named params, drops the sym `qzeros` (7-sentinel) before the kernel sees them, aliases to the names the helper expects, and installs the modular kernel. Originals are freed after the repack so weight memory stays at the checkpoint footprint instead of doubling. - `vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py` (new): shared `setup_hybrid_w4a16_moe(method, layer)` extracted from compressed-tensors; called by both backends so the conversion + modular-kernel install lives in one place. - `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py`: `_process_weights_hybrid_w4a16` now delegates to the shared helper. Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound, synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager): decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms. Profile confirms `_rocm_C::fused_moe_wvSplitK_int4_gemm` is now on the decode hot path (was Triton MoeWNA16 before). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
This reverts commit f6bcb0b.
The K<1024 (gemm2 down-proj, e.g. Qwen3.5-35B-A3B K=512) branch of the moe_wvSplitK_int4 dispatcher routes to the (W=32, AC=32, U=2) template which compiles to 157 VGPRs/wave. Combined with WG=1024 threads (32 wave32) this caps occupancy at 32 active wave32/CU = 50% of peak on gfx1151 (1536 VGPRs/SIMD). Roofline analysis (notes/.../baseline.md) shows this kernel hits only 52% of LPDDR5X peak BW vs 78% for the sibling K>=1024 variant, consistent with insufficient in-flight memory ops to hide LDS+HBM latency. Switching this branch to the existing (W=16, AC=16, U=2) template drops VGPRs to 113/wave and WG to 512 threads (16 wave32), letting 3 WGs fit per CU = 48 active wave32/CU = 75% of peak (matching the K>=1024 sibling). Gated behind VLLM_MOE_WVSPLITK_INT4_TINY_K_LOW_VGPR (default ON). Set =0 to revert to the original (W=32, AC=32) heuristic that was optimal for K=768 on a different model. Changes: - csrc/rocm/skinny_gemms_int4.cu: env-gated runtime branch in MOE_WVSPLIT_INT4G_TILE; only affects N=1, K<1024, gfx1x dispatch. - ideas-research/p2-investigation.md: per-template VGPR table extracted from compiled binary, occupancy math, decision rationale. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The Qwen MoE shared_expert_gate ReplicatedLinear(hidden_size, 1) produces a 1x1 scalar via hipBLASLt's MT64x96x32 macro tile + SplitK post-pass. Replace the M=N=1 path inside rocm_unquantized_gemm_impl with a fused elementwise-mul + reduction. Gated by VLLM_DISABLE_TINY_DOT_GEMM (default off = enabled); restricted to ROCm fp16/bf16 dispatched via rocm_unquantized_gemm. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
…hared_expert Two-phase epilogue fusion on the bf16 wvSplitK GEMM used by Qwen2MoE shared_expert: * Phase 1 (FUSED_SILU_MUL): fuse silu·multiply elementwise into the gate_up GEMM epilogue (one less kernel + one less node-floor per shared-expert block × 40 layers). * Phase 2 (FUSED_GATE_MUL): fold the trailing per-token scalar mul (F.sigmoid(self.expert_gate(x)) * out) into the down GEMM epilogue (4 kernels → 2 kernels per shared-expert block when combined with P1). Activation: env knob VLLM_BF16_WVSPLITK_FUSED_SILU=1 (default OFF). Bit-identical to unfused on canonical M=2048,K=512,N=1 bf16 micro-tests. gsm8k 84.0/82.0 (vs baseline 82.0/80.0; +2pp strict). TPOT measured impact across 3 standard workloads: text 128/128: -1.4% (Phase 1 -0.4%, Phase 2 -1.0%) mm 1024×800/128: -1.6% (Phase 1 -0.7%, Phase 2 -0.9%) text 4096/128: -1.5% (Phase 1 -0.8%, Phase 2 -0.7%) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Adds an optional, default-OFF roctx wrapper around the per-step decode forward pass to unblock single-pass PMC under cudagraph + torch.compile (rocprofv3 --selected-regions). Loads librocprofiler-sdk-roctx.so via ctypes only when VLLM_ROCTX_DECODE_REGION=1 is set; otherwise no library is loaded and the helper is a noop. The wrapper skips the first VLLM_ROCTX_WARMUP_STEPS (default 8) calls into _model_forward (warmup + first cudagraph replays) and then emits roctxProfilerResume / roctxRangePushA + Pop / Pause around the next VLLM_ROCTX_CAPTURE_STEPS (default 4) steady-state decode steps. After that, profile-region stays paused for the rest of the run, so PMC only samples the bounded post-warmup region. Companion launcher: tmp/qwen35-pmc/run-p5-single-pass.sh (not committed) runs rocprofv3 with --pmc <single-pass set> --marker-trace --selected-regions and a kernel-include-regex for the wvSplitK families the campaign cares about. Changes: - New file vllm/v1/worker/_roctx_decode_region.py with begin/end helpers - gpu_model_runner._model_forward calls begin/end via try/finally so an exception still pops the range and pauses the profiler. Co-authored-by: Claude Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Triton autotune sweep over the prefill MoE GEMM shapes (M=128 N=1024 K=2048 + M=128 N=2048 K=1024) found a joint-best config (BLOCK_M=128, BLOCK_N=32, BLOCK_K=128, GROUP_SIZE_M=8, num_warps=4, num_stages=1) that beats the prior default (BM=64, BN=64, BK=32, GM=8, nw=4, ns=2) by -19% joint kernel time. Root-cause: the prior _triton_config() clamped BLOCK_K to min(group_size, 32) = 32 — overly conservative because BLOCK_K can equal group_size (one scale per K-tile) and BLOCK_K=128 is uniformly faster on Strix Halo (gfx1151). This is a TTFT-only lever: the kernel is dispatched only when num_tokens > MAX_SKINNY_BATCH_SIZE (= 5), i.e. during prefill, never during decode. The decode path uses HIP wvSplitK_int4 and is untouched. Changes: - hybrid_w4a16_moe.py: TRITON_BLOCK_SIZE_M bumped 64->128 and _triton_config() returns the tuned dict, both gated by VLLM_HYBRID_W4A16_TRITON_TUNED_P1=1 (default ON; toggle to 0 to revert). - p1-investigation.md, p1-sweep.py, P-1-sweep.csv (1242 raw configs), RESULTS.md verdict. Verified end-to-end: tuned config dispatches cleanly via invoke_fused_moe_kernel_hybrid_triton (no compile error, no shape mismatch); post-edit standalone bench confirms GEMM1 114.7 us / GEMM2 53.2 us, matching sweep top-3 within do_bench noise. No TPOT bench (not a TPOT lever); no accuracy gate (autotune is correctness-invariant for Triton GEMMs). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Quantize lm_head to per-group-32 INT8 at model-load time, dispatched through
a new GROUP_SIZE template parameter on wvSplitK_int8_hf_sml_ that absorbs
a 2-D [M, K/G] scale tensor inside the K reduction loop. Activation:
--dynamic-lm-head-quantization int8 # per-channel (production-default)
--dynamic-lm-head-quantization int8:g32 # per-group-32 along K (recommended)
--dynamic-lm-head-quantization int8:g64
--dynamic-lm-head-quantization int8:g128
Per-group-32 was selected from a sweep at gsm8k N=250: G=32 within ±1pp of
baseline; G=64 -5.2pp (2.3σ regression); G=128 -4.4pp strict (1.9σ).
Per-channel collapses gsm8k by ~22pp on the 248320-row Qwen3.5 lm_head.
Activation perf: text 128/128 -12.9%, mm 1024×800 -12.4%, text 4096/128
-12.2% TPOT vs the pre-feature merged baseline; gsm8k unchanged.
A debug-only fallback (VLLM_DYNAMIC_INT8_LM_HEAD_SIM=1) quantize-dequantizes
back to bf16 and dispatches via dispatch_unquantized_gemm — perf-neutral,
accuracy-equivalent — useful for validating quant schemes before kernel work.
Implementation:
- csrc/rocm/skinny_gemms_int8.cu: GROUP_SIZE template on wvSplitK_int8_hf_sml_,
per-thread partial dot accumulated against scale[(m+y)*num_groups+group_idx]
- csrc/rocm/torch_bindings.cpp + ops.h + vllm/_custom_ops.py: 2-D scale arg
- vllm/model_executor/kernels/linear/mixed_precision/hip_w8a16.py:
HipW8A16LinearKernel.can_implement relaxed to accept group_size in {32,64,128}
- vllm/model_executor/layers/quantization/dynamic_int8_lm_head.py: parser
+ _resolve_group_size + per-group quant in process_weights_after_loading
- vllm/config/model.py: docstring lists the new int8:gN forms
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
f4bdadb to
eefad19
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On top of #932
Includes #929
Headline table
On
Intel/Qwen3.5-35B-A3B-int4-AutoRoundNumbers are decode tok/s (= 1000 / TPOT_ms; excludes prefill / TTFT). Higher is better. Best per row in bold.
Accuracy (gsm8k, N=80, greedy, temperature=0)
1. Environment
matthias.qwen35-perf-merged, commitf4bdadbb062.10.0+rocm7.13.0a202604195.5.3Intel/Qwen3.5-35B-A3B-int4-AutoRound3. TPOT bench
For each cell of the table:
For the multimodal row (1024×800 image, 128 text tokens, 128 output): use the same script with random-mm.
4. gsm8k accuracy
For each column: