Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886eble-amd wants to merge 1 commit into
Conversation
|
d6fd7b6 had tuned the non-quantized version of the skinny GEMM for K=2048 by increasing the AC template argument from 8 to 16. Would the same help here? |
|
The effect of the delay might be the same as the effect of staggering the start addresses of each block |
Unless I'm looking at the wrong thing, this MR bumps it from 16 to 32. |
When the int4 weight matrix exceeds L2 cache, wider memory loads (ACHUNK=32 vs 16) improve bandwidth on the wvSplitK_int4_g kernel. The L2 size is queried at runtime via hipDeviceProp, so the threshold adapts to different GPUs. Measured on Radeon 8060S (gfx1151, 2 MiB L2): - 1x2048x16384: 141 -> 149 GiB/s (+5%) - 1x2560x2048: 162 -> 166 GiB/s (+2%) - 1x32768x2048: 199 -> 200 GiB/s (+1%) - Overall for Gemma-2B AWQ int4: 67.5% -> 69.3% roofline Signed-off-by: Dan Eble <Dan.Eble@amd.com>
78a263f to
acbca20
Compare
Since we discussed (elsewhere) spending time to dig deeper into why the staggering is helping, I removed the staggering commit from this PR so that it doesn't delay merging the other improvement. |
| else /* N=1: YTILE=2 beats YTILE=1 across all CuCount values */ \ | ||
| WVSPLIT_INT4G_GS(2, 4, __N, _HAS_ZP) \ | ||
| else { /* N=1: YTILE=2 beats YTILE=1 across all CuCount values */ \ | ||
| if (M_in * K_in / 2 > get_l2_cache_size_int4()) { \ |
There was a problem hiding this comment.
Afaik, we are caching the activations into LDS and read the weights with non-temporal loads (i.e. bypassing L1/L2 caches). Why is the L2 cache size relevant here?
|
I've been rerunning more detailed parameter sweeps since rebasing, and I've seen some differences from what I saw in April. I also see that some changes to dispatch conditions/parameters strangely appear to improve the end-to-end benchmark but slow down specific shapes. I'm marking this as a draft again because I don't understand these things. |
Purpose
Improve GEMV performance on Radeon 8060S and similar GPUs.
Test Plan
Test Results
Copied from commit message:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.