[ROCm][gfx11] Restore TRITON_ATTN priority on gfx11 by roberteg16 · Pull Request #994 · ROCm/vllm

roberteg16 · 2026-06-10T13:57:14Z

On RDNA (gfx11/gfx12), the attention backend selector was picking ROCM_ATTN instead of TRITON_ATTN for the auto-selected (no --attention-backend) case. ROCM_ATTN routes decode through chunked_prefill_paged_decode → kernel_paged_attention_2d, which is significantly slower on RDNA than TRITON_ATTN's kernel_unified_attention. On gfx1151 (Strix Halo) this is a ~7× per-decode regression.

This PR:

Removes the unconditional ROCM_ATTN prepend in _get_backend_priorities (ROCm) so the existing gfx1x block — which intentionally ranks TRITON_ATTN ahead of ROCM_ATTN — is honored again.

Root cause

Wrong upstream merge of commit 95b4d2b. This commit changed preferred order for gfx11.

_get_backend_priorities (in vllm/platforms/rocm.py) was prepending ROCM_ATTN at priority 0 whenever use_kv_connector was false:

backends = []
# ROCM_ATTN uses (2, num_blocks, ...) KV cache layout which is
# incompatible with KV connectors that require blocks-first layout.
if not use_kv_connector:
    backends.append(AttentionBackendEnum.ROCM_ATTN)
...
if on_gfx1x():
    # On RDNA (gfx11/gfx12), TRITON_ATTN is faster than ROCM_ATTN ...
    backends.append(AttentionBackendEnum.TRITON_ATTN)
    backends.append(AttentionBackendEnum.ROCM_ATTN)

Because get_valid_backends assigns priority by list index (lowest wins), the priority-0 prepend overrode the gfx1x preference, so ROCM_ATTN was selected on RDNA. The prepend's not use_kv_connector guard was also redundant: connector compatibility is already enforced in validate_configuration via supports_kv_connector(), which RocmAttentionBackend overrides to False. Removing the prepend (and the now-unused use_kv_connector parameter) is therefore safe for the connector case and restores correct RDNA ordering.

Profiling evidence (gfx1151, Qwen3 W4A16 MoE)

Same op (vllm::unified_attention_with_output), same input shapes; only the selected backend's kernel differs:

Path	Decode kernel	per-instance	total (×1270 decodes)
`TRITON_ATTN` (expected)	`kernel_unified_attention` + `reduce_segments`	~35 µs	~44 ms
`ROCM_ATTN` (regressed)	`kernel_paged_attention_2d`	~248 µs	~315 ms

W4A16 GEMM time was unchanged between the two runs; the entire end-to-end delta came from this attention-kernel difference.

Test plan

On-device gfx1151 benchmark confirming decode attention returns to the kernel_unified_attention path.

mgehre-amd · 2026-06-10T14:33:42Z

Your CI failure might be fixed by #991

_get_backend_priorities unconditionally prepended ROCM_ATTN at top priority, which overrode the gfx1x (RDNA) block that intentionally ranks TRITON_ATTN ahead of ROCM_ATTN. On gfx1151 this selected ROCM_ATTN and routed decode through the slower kernel_paged_attention_2d path instead of TRITON_ATTN's kernel_unified_attention. Remove the prepend so the RDNA ordering is honored again. KV-connector correctness for ROCM_ATTN is already enforced by its supports_kv_connector()=False, so the prepend's use_kv_connector guard was redundant; drop it and the now-unused parameter. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Robert Esclapez Garcia <robert.garcia@amd.com>

roberteg16 requested a review from mgehre-amd June 10, 2026 13:57

roberteg16 requested a review from dllehr-amd as a code owner June 10, 2026 13:57

roberteg16 changed the title ~~[ROCm][gfx11] Restore TRITON_ATTN priority on RDNA; mark TURBOQUANT KV-connector incompatible~~ [ROCm][gfx11] Restore TRITON_ATTN priority on RDNA Jun 10, 2026

roberteg16 force-pushed the rogarcia.fix_priority_attn_backend_gfx11 branch from f60d86e to 5bc109d Compare June 10, 2026 13:59

roberteg16 requested a review from amd-callumm June 10, 2026 14:00

roberteg16 changed the title ~~[ROCm][gfx11] Restore TRITON_ATTN priority on RDNA~~ [ROCm][gfx11] Restore TRITON_ATTN priority on gfx11 Jun 10, 2026

mgehre-amd approved these changes Jun 10, 2026

View reviewed changes

mgehre-amd reviewed Jun 10, 2026

View reviewed changes

Comment thread vllm/platforms/rocm.py

mgehre-amd removed the request for review from dllehr-amd June 10, 2026 14:47

roberteg16 force-pushed the rogarcia.fix_priority_attn_backend_gfx11 branch from 5bc109d to 08b7a33 Compare June 10, 2026 14:54

mgehre-amd merged commit 535c582 into gfx11 Jun 10, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][gfx11] Restore TRITON_ATTN priority on gfx11#994

[ROCm][gfx11] Restore TRITON_ATTN priority on gfx11#994
mgehre-amd merged 1 commit into
gfx11from
rogarcia.fix_priority_attn_backend_gfx11

roberteg16 commented Jun 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mgehre-amd commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

roberteg16 commented Jun 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Profiling evidence (gfx1151, Qwen3 W4A16 MoE)

Test plan

Uh oh!

mgehre-amd commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roberteg16 commented Jun 10, 2026 •

edited by github-actions Bot

Loading