[ROCm][gfx11] Restore TRITON_ATTN priority on gfx11#994
Merged
Conversation
f60d86e to
5bc109d
Compare
|
Your CI failure might be fixed by #991 |
mgehre-amd
approved these changes
Jun 10, 2026
mgehre-amd
reviewed
Jun 10, 2026
_get_backend_priorities unconditionally prepended ROCM_ATTN at top priority, which overrode the gfx1x (RDNA) block that intentionally ranks TRITON_ATTN ahead of ROCM_ATTN. On gfx1151 this selected ROCM_ATTN and routed decode through the slower kernel_paged_attention_2d path instead of TRITON_ATTN's kernel_unified_attention. Remove the prepend so the RDNA ordering is honored again. KV-connector correctness for ROCM_ATTN is already enforced by its supports_kv_connector()=False, so the prepend's use_kv_connector guard was redundant; drop it and the now-unused parameter. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Robert Esclapez Garcia <robert.garcia@amd.com>
5bc109d to
08b7a33
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On RDNA (gfx11/gfx12), the attention backend selector was picking
ROCM_ATTNinstead ofTRITON_ATTNfor the auto-selected (no--attention-backend) case.ROCM_ATTNroutes decode throughchunked_prefill_paged_decode→kernel_paged_attention_2d, which is significantly slower on RDNA thanTRITON_ATTN'skernel_unified_attention. On gfx1151 (Strix Halo) this is a ~7× per-decode regression.This PR:
ROCM_ATTNprepend in_get_backend_priorities(ROCm) so the existing gfx1x block — which intentionally ranksTRITON_ATTNahead ofROCM_ATTN— is honored again.Root cause
Wrong upstream merge of commit 95b4d2b. This commit changed preferred order for gfx11.
_get_backend_priorities(invllm/platforms/rocm.py) was prependingROCM_ATTNat priority 0 wheneveruse_kv_connectorwas false:Because
get_valid_backendsassigns priority by list index (lowest wins), the priority-0 prepend overrode the gfx1x preference, soROCM_ATTNwas selected on RDNA. The prepend'snot use_kv_connectorguard was also redundant: connector compatibility is already enforced invalidate_configurationviasupports_kv_connector(), whichRocmAttentionBackendoverrides toFalse. Removing the prepend (and the now-unuseduse_kv_connectorparameter) is therefore safe for the connector case and restores correct RDNA ordering.Profiling evidence (gfx1151, Qwen3 W4A16 MoE)
Same op (
vllm::unified_attention_with_output), same input shapes; only the selected backend's kernel differs:TRITON_ATTN(expected)kernel_unified_attention+reduce_segmentsROCM_ATTN(regressed)kernel_paged_attention_2dW4A16 GEMM time was unchanged between the two runs; the entire end-to-end delta came from this attention-kernel difference.
Test plan
kernel_unified_attentionpath.