Skip to content

[ROCm][gfx11] Restore TRITON_ATTN priority on gfx11#994

Merged
mgehre-amd merged 1 commit into
gfx11from
rogarcia.fix_priority_attn_backend_gfx11
Jun 10, 2026
Merged

[ROCm][gfx11] Restore TRITON_ATTN priority on gfx11#994
mgehre-amd merged 1 commit into
gfx11from
rogarcia.fix_priority_attn_backend_gfx11

Conversation

@roberteg16

@roberteg16 roberteg16 commented Jun 10, 2026

Copy link
Copy Markdown

On RDNA (gfx11/gfx12), the attention backend selector was picking ROCM_ATTN instead of TRITON_ATTN for the auto-selected (no --attention-backend) case. ROCM_ATTN routes decode through chunked_prefill_paged_decodekernel_paged_attention_2d, which is significantly slower on RDNA than TRITON_ATTN's kernel_unified_attention. On gfx1151 (Strix Halo) this is a ~7× per-decode regression.

This PR:

  1. Removes the unconditional ROCM_ATTN prepend in _get_backend_priorities (ROCm) so the existing gfx1x block — which intentionally ranks TRITON_ATTN ahead of ROCM_ATTN — is honored again.

Root cause

Wrong upstream merge of commit 95b4d2b. This commit changed preferred order for gfx11.

_get_backend_priorities (in vllm/platforms/rocm.py) was prepending ROCM_ATTN at priority 0 whenever use_kv_connector was false:

backends = []
# ROCM_ATTN uses (2, num_blocks, ...) KV cache layout which is
# incompatible with KV connectors that require blocks-first layout.
if not use_kv_connector:
    backends.append(AttentionBackendEnum.ROCM_ATTN)
...
if on_gfx1x():
    # On RDNA (gfx11/gfx12), TRITON_ATTN is faster than ROCM_ATTN ...
    backends.append(AttentionBackendEnum.TRITON_ATTN)
    backends.append(AttentionBackendEnum.ROCM_ATTN)

Because get_valid_backends assigns priority by list index (lowest wins), the priority-0 prepend overrode the gfx1x preference, so ROCM_ATTN was selected on RDNA. The prepend's not use_kv_connector guard was also redundant: connector compatibility is already enforced in validate_configuration via supports_kv_connector(), which RocmAttentionBackend overrides to False. Removing the prepend (and the now-unused use_kv_connector parameter) is therefore safe for the connector case and restores correct RDNA ordering.

Profiling evidence (gfx1151, Qwen3 W4A16 MoE)

Same op (vllm::unified_attention_with_output), same input shapes; only the selected backend's kernel differs:

Path Decode kernel per-instance total (×1270 decodes)
TRITON_ATTN (expected) kernel_unified_attention + reduce_segments ~35 µs ~44 ms
ROCM_ATTN (regressed) kernel_paged_attention_2d ~248 µs ~315 ms

W4A16 GEMM time was unchanged between the two runs; the entire end-to-end delta came from this attention-kernel difference.

Test plan

  • On-device gfx1151 benchmark confirming decode attention returns to the kernel_unified_attention path.

@roberteg16 roberteg16 requested a review from mgehre-amd June 10, 2026 13:57
@roberteg16 roberteg16 requested a review from dllehr-amd as a code owner June 10, 2026 13:57
@roberteg16 roberteg16 changed the title [ROCm][gfx11] Restore TRITON_ATTN priority on RDNA; mark TURBOQUANT KV-connector incompatible [ROCm][gfx11] Restore TRITON_ATTN priority on RDNA Jun 10, 2026
@roberteg16 roberteg16 force-pushed the rogarcia.fix_priority_attn_backend_gfx11 branch from f60d86e to 5bc109d Compare June 10, 2026 13:59
@roberteg16 roberteg16 requested a review from amd-callumm June 10, 2026 14:00
@roberteg16 roberteg16 changed the title [ROCm][gfx11] Restore TRITON_ATTN priority on RDNA [ROCm][gfx11] Restore TRITON_ATTN priority on gfx11 Jun 10, 2026
@mgehre-amd

Copy link
Copy Markdown

Your CI failure might be fixed by #991

Comment thread vllm/platforms/rocm.py
@mgehre-amd mgehre-amd removed the request for review from dllehr-amd June 10, 2026 14:47
_get_backend_priorities unconditionally prepended ROCM_ATTN at top
priority, which overrode the gfx1x (RDNA) block that intentionally ranks
TRITON_ATTN ahead of ROCM_ATTN. On gfx1151 this selected ROCM_ATTN and
routed decode through the slower kernel_paged_attention_2d path instead
of TRITON_ATTN's kernel_unified_attention. Remove the prepend so the
RDNA ordering is honored again. KV-connector correctness for ROCM_ATTN
is already enforced by its supports_kv_connector()=False, so the prepend's
use_kv_connector guard was redundant; drop it and the now-unused parameter.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Robert Esclapez Garcia <robert.garcia@amd.com>
@roberteg16 roberteg16 force-pushed the rogarcia.fix_priority_attn_backend_gfx11 branch from 5bc109d to 08b7a33 Compare June 10, 2026 14:54
@mgehre-amd mgehre-amd merged commit 535c582 into gfx11 Jun 10, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants