[Triton] Unified Attention support by k50112113 · Pull Request #1108 · ROCm/ATOM

k50112113 · 2026-06-05T19:46:56Z

Current AITER main supports both key and value cache to be either flash (un-shuffled) or non-flash (shuffled) layout in Triton Unified attention

This PR updates the behavior of ATOM_USE_UNIFIED_ATTN=1, in which the build_kv_cache_tensor now fixed to shuffled layout and set use_flash_layout=False, which propagates to PagedAttentionImpl

In paged_attention_triton the witch logic now becomes

if envs.ATOM_USE_UNIFIED_ATTN or self.use_flash_layout:
    unified_attention(...)
else:
    run_pa_decode_gluon(...)

lm_eval on gpt-oss-120b

local-completions ({'model': '/data/openai/gpt-oss-120b', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 2000.0, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.4405|±  |0.0137|
|     |       |strict-match    |     3|exact_match|↑  |0.2055|±  |0.0111|

k50112113 added 2 commits June 5, 2026 19:39

update TritonMHAMetadataBuilder, with use_flash_layout=False

7451d8f

temp set ATOM_USE_UNIFIED_ATTN=1 by default

648be58

k50112113 changed the title ~~update TritonMHAMetadataBuilder, with use_flash_layout=False~~ [Triton] Unified Attention support Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] Unified Attention support#1108

[Triton] Unified Attention support#1108
k50112113 wants to merge 2 commits into
mainfrom
shaoclee/triton_attention_non_flash_support

k50112113 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

k50112113 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant