Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs by eble-amd · Pull Request #886 · ROCm/vllm

eble-amd · 2026-04-17T18:45:22Z

Purpose

Improve GEMV performance on Radeon 8060S and similar GPUs.

Test Plan

vllm benchmark with Gemma 2B AWQ
pytest performance tests (new golden values included)

Test Results

Copied from commit message:

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 141 -> 149 GiB/s (+5%)
- 1x2560x2048:  162 -> 166 GiB/s (+2%)
- 1x32768x2048: 199 -> 200 GiB/s (+1%)
- Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mgehre-amd · 2026-05-28T11:03:14Z

d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here?

mgehre-amd · 2026-05-29T09:41:35Z

The effect of the delay might be the same as the effect of staggering the start addresses of each block

eble-amd · 2026-05-29T19:01:41Z

d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here?

Unless I'm looking at the wrong thing, this MR bumps it from 16 to 32.

When the int4 weight matrix exceeds L2 cache, wider memory loads (ACHUNK=32 vs 16) improve bandwidth on the wvSplitK_int4_g kernel. The L2 size is queried at runtime via hipDeviceProp, so the threshold adapts to different GPUs. Measured on Radeon 8060S (gfx1151, 2 MiB L2): - 1x2048x16384: 141 -> 149 GiB/s (+5%) - 1x2560x2048: 162 -> 166 GiB/s (+2%) - 1x32768x2048: 199 -> 200 GiB/s (+1%) - Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd · 2026-05-29T21:23:36Z

The effect of the delay might be the same as the effect of staggering the start addresses of each block

Since we discussed (elsewhere) spending time to dig deeper into why the staggering is helping, I removed the staggering commit from this PR so that it doesn't delay merging the other improvement.

mgehre-amd · 2026-06-01T10:15:56Z

-    else /* N=1: YTILE=2 beats YTILE=1 across all CuCount values */   \
-      WVSPLIT_INT4G_GS(2, 4, __N, _HAS_ZP)                            \
+    else { /* N=1: YTILE=2 beats YTILE=1 across all CuCount values */ \
+      if (M_in * K_in / 2 > get_l2_cache_size_int4()) {               \


Afaik, we are caching the activations into LDS and read the weights with non-temporal loads (i.e. bypassing L1/L2 caches). Why is the L2 cache size relevant here?

eble-amd · 2026-06-02T17:23:20Z

I've been rerunning more detailed parameter sweeps since rebasing, and I've seen some differences from what I saw in April. I also see that some changes to dispatch conditions/parameters strangely appear to improve the end-to-end benchmark but slow down specific shapes. I'm marking this as a draft again because I don't understand these things.

eble-amd requested review from mgehre-amd and roberteg16 April 17, 2026 18:45

eble-amd changed the title ~~Skinny int4 perf~~ Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs Apr 20, 2026

eble-amd force-pushed the skinny-int4-perf branch from 78a263f to acbca20 Compare May 29, 2026 21:16

eble-amd marked this pull request as ready for review May 29, 2026 21:26

mgehre-amd reviewed Jun 1, 2026

View reviewed changes

eble-amd marked this pull request as draft June 2, 2026 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
eble-amd wants to merge 1 commit into
ROCm:gfx11from
eble-amd:skinny-int4-perf

eble-amd commented Apr 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mgehre-amd commented May 28, 2026

Uh oh!

mgehre-amd commented May 29, 2026

Uh oh!

eble-amd commented May 29, 2026

Uh oh!

eble-amd commented May 29, 2026

Uh oh!

mgehre-amd Jun 1, 2026

Uh oh!

eble-amd commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eble-amd commented Apr 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Uh oh!

mgehre-amd commented May 28, 2026

Uh oh!

mgehre-amd commented May 29, 2026

Uh oh!

eble-amd commented May 29, 2026

Uh oh!

eble-amd commented May 29, 2026

Uh oh!

mgehre-amd Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

eble-amd commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eble-amd commented Apr 17, 2026 •

edited by github-actions Bot

Loading