Skip to content

[Triton] Unified Attention support#1108

Open
k50112113 wants to merge 2 commits into
mainfrom
shaoclee/triton_attention_non_flash_support
Open

[Triton] Unified Attention support#1108
k50112113 wants to merge 2 commits into
mainfrom
shaoclee/triton_attention_non_flash_support

Conversation

@k50112113
Copy link
Copy Markdown
Contributor

Current AITER main supports both key and value cache to be either flash (un-shuffled) or non-flash (shuffled) layout in Triton Unified attention

This PR updates the behavior of ATOM_USE_UNIFIED_ATTN=1, in which the build_kv_cache_tensor now fixed to shuffled layout and set use_flash_layout=False, which propagates to PagedAttentionImpl

In paged_attention_triton the witch logic now becomes

if envs.ATOM_USE_UNIFIED_ATTN or self.use_flash_layout:
    unified_attention(...)
else:
    run_pa_decode_gluon(...)

lm_eval on gpt-oss-120b

local-completions ({'model': '/data/openai/gpt-oss-120b', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 2000.0, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.4405|±  |0.0137|
|     |       |strict-match    |     3|exact_match|↑  |0.2055|±  |0.0111|

@k50112113 k50112113 changed the title update TritonMHAMetadataBuilder, with use_flash_layout=False [Triton] Unified Attention support Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant