[bench] wvSplitK skinny GEMM: capture timed iters into a CUDA graph by mgehre-amd · Pull Request #928 · ROCm/vllm

mgehre-amd · 2026-05-08T15:25:04Z

The previous bench measured each kernel call with its own CUDA event pair and a synchronize() afterward. For sub-100us kernels on Strix Halo, the ~50us idle gap between iters lets the iGPU drop clock, inflating per-call time by 5-15% vs what the model run actually sees (the model launches back-to-back on a stream so the GPU never idles). This made the bench under-report bandwidth and overstate "improvements" from heuristic changes that simply pushed kernel time below the DVFS-induced floor.

Switch bench_dynamic to capture iters_per_replay launches (sized so a single replay runs ~target_replay_ms wall) into a CUDA graph and time the replay end-to-end. Adaptive replay count keeps the same target_se_pct convergence behavior. Buffers still rotate via fn(i), so the cache-busting properties of the old loop are preserved.

Validated on bf16 against the in-model profile of
Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile):
wvSplitK 1x1024x2048 bench old=30.1 us new=27.0 us profile=26.8 us
wvSplitK 1x248320x2048 bench old=4357 us new=4329 us profile=4430 us
The bench now matches the model-run time within ~1% on both shapes.

Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0, max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is ~30s (was ~9s). Repeated runs (with a 60s cooldown between to let the iGPU stay near 60C) agree on 46/48 shapes within 1%; the remaining outliers are thermal noise floor that no measurement setting can remove without locking the GPU clock.

The previous bench measured each kernel call with its own CUDA event pair and a synchronize() afterward. For sub-100us kernels on Strix Halo, the ~50us idle gap between iters lets the iGPU drop clock, inflating per-call time by 5-15% vs what the model run actually sees (the model launches back-to-back on a stream so the GPU never idles). This made the bench under-report bandwidth and overstate "improvements" from heuristic changes that simply pushed kernel time below the DVFS-induced floor. Switch bench_dynamic to capture iters_per_replay launches (sized so a single replay runs ~target_replay_ms wall) into a CUDA graph and time the replay end-to-end. Adaptive replay count keeps the same target_se_pct convergence behavior. Buffers still rotate via fn(i), so the cache-busting properties of the old loop are preserved. Validated on bf16 against the in-model profile of Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile): wvSplitK 1x1024x2048 bench old=30.1 us new=27.0 us profile=26.8 us wvSplitK 1x248320x2048 bench old=4357 us new=4329 us profile=4430 us The bench now matches the model-run time within ~1% on both shapes. Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0, max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is ~30s (was ~9s). Repeated runs (with a 60s cooldown between to let the iGPU stay near 60C) agree on 46/48 shapes within 1%; the remaining outliers are thermal noise floor that no measurement setting can remove without locking the GPU clock. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

…mm.py The script already handles both bf16 and fp16 via --dtype, so the _bf16 suffix in the filename was misleading. Drop it and update the docstring examples. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Drops AC=8 and AC=16 sweep instantiations to make room for AC=32, which has not been tested. Production dispatcher now uses AC=16 for every K%2048==0 cell (prior commit), so AC=8 is no longer interesting to sweep. AC=32 is the next plausible knob: with THRDS=32 and UNRL=4 the main GEMM loop stride becomes 32x32x4=4096 bytes, perfectly aligning with K=4096 and dividing K=8192 evenly. sizeof(bigType) becomes 64 B (one cache line) after the AIESW-34083 fix, which is the widest valid HIP vector_size for the bigType union members. Cost: doubles VGPR pressure vs AC=16 (16 VGPRs/thread for the register file vs 8) -- may drop occupancy. Empirical question. Grid: YT={1,2} x UR={1,2,4} x AC={32} x W={16,32} = 12 combos x 4 N x 3 kernel variants = 144 instantiations (down from 288). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Replaces the whole-replay event timing with per-kernel timing recorded inside the captured graph. PyTorch's torch.cuda.Event blocks the path that records queryable timestamps inside graph capture on ROCm (TORCH_CHECK(!external_) in c10/cuda/CUDAEvent.h, AIESW-34641), so the bench drops to a small ctypes shim that calls hipEventRecordWithFlags(hipEventRecordExternal) and hipEventElapsedTime directly on the raw hipEvent_t handle pulled out of torch.cuda.Event.cuda_event. Capture layout uses a single event chain of iters_per_replay+1 events (one per kernel boundary) rather than 2*iters_per_replay start/end pairs, halving the event count in the graph for the same number of per-kernel samples. Reduces run-to-run noise from 0.88% to 0.51% median (6.27% -> 2.54% max) across the full 76-cell sweep, and ~45 s wall per run. Numbers match the in-model profile better than the previous whole-replay median. The ctypes shim can be removed once PyTorch upstream lifts the ROCm external-event guard. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd added 4 commits June 2, 2026 04:06

mgehre-amd force-pushed the matthias.bench-skinny-gemm-cudagraph branch from 8a6fc6b to c52ff9e Compare June 2, 2026 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bench] wvSplitK skinny GEMM: capture timed iters into a CUDA graph#928

[bench] wvSplitK skinny GEMM: capture timed iters into a CUDA graph#928
mgehre-amd wants to merge 4 commits into
gfx11from
matthias.bench-skinny-gemm-cudagraph

mgehre-amd commented May 8, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented May 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented May 8, 2026 •

edited by github-actions Bot

Loading