Skip to content

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886

Draft
eble-amd wants to merge 1 commit into
ROCm:gfx11from
eble-amd:skinny-int4-perf
Draft

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
eble-amd wants to merge 1 commit into
ROCm:gfx11from
eble-amd:skinny-int4-perf

Conversation

@eble-amd

@eble-amd eble-amd commented Apr 17, 2026

Copy link
Copy Markdown

Purpose

Improve GEMV performance on Radeon 8060S and similar GPUs.

Test Plan

  • vllm benchmark with Gemma 2B AWQ
  • pytest performance tests (new golden values included)

Test Results

Copied from commit message:

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 141 -> 149 GiB/s (+5%)
- 1x2560x2048:  162 -> 166 GiB/s (+2%)
- 1x32768x2048: 199 -> 200 GiB/s (+1%)
- Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@eble-amd eble-amd changed the title Skinny int4 perf Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs Apr 20, 2026
@mgehre-amd

Copy link
Copy Markdown

d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here?

@mgehre-amd

Copy link
Copy Markdown

The effect of the delay might be the same as the effect of staggering the start addresses of each block

@eble-amd

Copy link
Copy Markdown
Author

d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here?

Unless I'm looking at the wrong thing, this MR bumps it from 16 to 32.

When the int4 weight matrix exceeds L2 cache, wider memory loads
(ACHUNK=32 vs 16) improve bandwidth on the wvSplitK_int4_g kernel.  The
L2 size is queried at runtime via hipDeviceProp, so the threshold adapts
to different GPUs.

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 141 -> 149 GiB/s (+5%)
- 1x2560x2048:  162 -> 166 GiB/s (+2%)
- 1x32768x2048: 199 -> 200 GiB/s (+1%)
- Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
@eble-amd eble-amd force-pushed the skinny-int4-perf branch from 78a263f to acbca20 Compare May 29, 2026 21:16
@eble-amd

Copy link
Copy Markdown
Author

The effect of the delay might be the same as the effect of staggering the start addresses of each block

Since we discussed (elsewhere) spending time to dig deeper into why the staggering is helping, I removed the staggering commit from this PR so that it doesn't delay merging the other improvement.

@eble-amd eble-amd marked this pull request as ready for review May 29, 2026 21:26
else /* N=1: YTILE=2 beats YTILE=1 across all CuCount values */ \
WVSPLIT_INT4G_GS(2, 4, __N, _HAS_ZP) \
else { /* N=1: YTILE=2 beats YTILE=1 across all CuCount values */ \
if (M_in * K_in / 2 > get_l2_cache_size_int4()) { \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik, we are caching the activations into LDS and read the weights with non-temporal loads (i.e. bypassing L1/L2 caches). Why is the L2 cache size relevant here?

@eble-amd

eble-amd commented Jun 2, 2026

Copy link
Copy Markdown
Author

I've been rerunning more detailed parameter sweeps since rebasing, and I've seen some differences from what I saw in April. I also see that some changes to dispatch conditions/parameters strangely appear to improve the end-to-end benchmark but slow down specific shapes. I'm marking this as a draft again because I don't understand these things.

@eble-amd eble-amd marked this pull request as draft June 2, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants