[AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap by jammm · Pull Request #2519 · ROCm/llvm-project

jammm · 2026-05-13T14:50:26Z

Summary

This PR adds a COMGR hotswap pass that wraps standalone
v_wmma_f32_16x16x128_f8f6f4 instructions during the GFX1250 B0-to-A0
rewrite path.

The sethalt fix that was previously part of this branch has been split out into
a separate PR/branch. This PR is now scoped only to the VOP3PX2 wrap behavior.

Details

On A0, an async trap between an LD_SCALE prefix and the WMMA half of a VOP3PX2
instruction is not safely recoverable unless the trap rewind path can identify
the pair. This pass makes standalone f8f6f4 WMMA instructions follow the paired
form by prepending an inline-zero LD_SCALE prefix.

The pass:

Adds comgr-hotswap-patch-vop3px-wrap.cpp
Registers the pass through the existing hotswap patch registry/vtable
Runs after the per-instruction splitters, so it can also scan trampoline bodies
Wraps standalone v_wmma_f32_16x16x128_f8f6f4 instructions
Preserves the original WMMA bytes and matrix format modifiers
Leaves already wrapped v_wmma_scale_f32_16x16x128_f8f6f4 instructions alone
Updates affected WMMA split tests to expect wrapped trampoline output
Keeps the existing VOP3PX2 source-2 patch separate from the new wrap pass

Tests

Added/updated COMGR lit coverage for:

standalone f8f6f4 WMMA wrapping
FP8/FP8 and mixed FP6/FP4 modifier preservation
already wrapped VOP3PX2 no-op behavior
K=128 splitter interaction where generated trampoline WMMAs are wrapped
idempotent second rewrite behavior
existing WMMA split tests whose trampoline output is now wrapped

Local validation:

cmake --build /jam/TheRock-build --target amd-comgr+stage

PATH="/jam/TheRock-build/compiler/amd-llvm/build/bin:$PATH" \
  /jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -sv \
  /jam/TheRock-build/compiler/amd-comgr/build/test-lit --filter='hotswap'

Result: 51/51 hotswap lit tests passed.

Additional FFM validation:

- hip/ FFM hotswap smoke subset: 6 passed
- Full tests/test_ffm.py -m ffm_hotswap: 89 passed, 66 skipped, 1 failed, 75 deselected

The single full-suite failure was
gpt_oss_graph_mode/fwd_kernel__4f64b8, which timed out at the 120s harness
limit. The same manifest also timed out in FFM baseline without hotswap, so this
does not appear to be caused by this PR.


Reference: https://github.com/ROCm/llvm-project/pull/2519

Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a no-op, scale=1.0). The 8-byte prefix is byte-identical for every wrap site, so the WMMA portion stays bit-identical and modifier-rich operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi, matrix_a_reuse, matrix_b_reuse, ...) are preserved without re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100), matching the encoding the in-place vop3px2-src2 patch produces, so no second pass is needed on wrap-emitted forms. Two-pass operation: 1. Decoded[] scan -- wraps user-written standalone WMMAs. 2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in trampoline bytes (the K=128 32x16x128_f4 splitter from #2379 emits f8f6f4 into trampolines). Defensive checks: - Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88) leftover exists in Decoded[] -- the K=128 splitter must eliminate these before the wrap pass runs. - Already-wrapped detection: skips wrapping if the preceding decoded instruction is a SCALE form (per ISA doc p.158, the SCALE prefix must be contiguous with the WMMA). Wired into b0a0.cpp dispatcher AFTER all per-instruction patches and after applyVop3px2Src2Fix, so the wrap pass sees the final post-split trampoline layout. Pairs with the existing trap-handler split path in rocm-systems (commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in trap handler"). LIT tests: - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4, already-wrapped, K=128 split-then-wrap chain). - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines updated to expect post-wrap disassembly (v_wmma_scale_* instead of bare f8f6f4) since the wrap pass now wraps splitter outputs. - hotswap-vop3px2-src2-noop.s: comments updated to describe the new wrap interaction.

Rewrites every s_sethalt to s_nop 0 in-place (4-byte overwrite, no length change, no offset shifts). Defensive: LLVM almost never emits s_sethalt in production code (only via the int_amdgcn_s_sethalt intrinsic, mostly in debug builds), so this pass is dormant on most inputs -- it exists for defense-in-depth against any input ELF, compiler-emitted or hand-written, that ships s_sethalt. The replacement bytes come from LS.SNopBytes (pre-encoded at initLLVM() time, the same source the splitter uses for trampoline padding). Wired into b0a0.cpp dispatcher AFTER applyVop3pxWrapPatch (the order doesn't matter functionally -- sethalt-fix and the wrap pass don't interact -- but keeping all the A0-only post-loop passes adjacent makes the dispatch order easy to read). LIT test (hotswap-sethalt-fix.s): 4 cases -- bare s_sethalt is neutralized, s_sethalt before VOP3PX2 is neutralized, multiple s_sethalt instances are all neutralized, kernels without s_sethalt are not modified.

…to users/jam/a0-atomicity-hotswap-gfx1250-staging # Conflicts: # amd/comgr/src/comgr-hotswap-patches.def

chinmaydd · 2026-06-17T18:55:29Z

+  // sethalt-fix: defensive in-place s_sethalt -> s_nop. LLVM almost
+  // never emits s_sethalt (only via int_amdgcn_s_sethalt in debug
+  // builds), so this pass is dormant on production code; it exists to
+  // catch hand-written assembly / inline asm that ships s_sethalt.


We should ideally document all of the pass pipeline in a single place, and I dont think this is the right place to do it.

Something to remember for the refactor.

chinmaydd · 2026-06-17T18:56:09Z

+  if (Ctx.LS.SNopBytes.size() != MinInstSize) {
+    log() << "hotswap: error: sethalt-fix: LS.SNopBytes is not "
+          << MinInstSize << " bytes (got " << Ctx.LS.SNopBytes.size() << ")\n";
+    return 0;
+  }


Can this ever happen ? Just curious

In the expected GFX1250 path, no. SNopBytes is produced by the MC encoder for s_nop 0, and scalar s_nop should encode to the 4-byte MinInstSize.

I kept this as a defensive invariant check before the memcpy: if LLVMState/MC encoding initialization ever fails, or a future target path returns an unexpected encoding size, we should skip the patch instead of writing malformed bytes into .text. This is not expected to be
triggered by a valid code object.

Would you prefer we remove this? it seems codex is more defensive than usual here.

Hmm, thanks for the context.

if LLVMState/MC encoding initialization ever fails

I think for this one, we should catch init failure when it happens.

future target path returns an unexpected encoding size

I'd find it hard to envision a case where we update the encoding for s_nop. Maybe we should replace this with an assert ? No strong opinions there IMO. I'll leave it up to you.

chinmaydd · 2026-06-17T19:02:04Z

I dont really have any other blocking comments as such. If we rebase and fix above, we should be good for now

…to users/jam/a0-atomicity-hotswap-gfx1250-staging

jammm · 2026-06-25T13:12:04Z

• Addressed the review feedback in this update:

Reworked the VOP3PX2 wrap scan to use the AMDGPU MC decoder instead of raw 4-byte pattern matching. The patch now decodes instructions, checks the decoded mnemonic/size, and only treats decoded standalone WMMA instructions as candidates.
Made the leftover check stricter: if B0-only F4/VOP3PX2 patterns remain after patching, the pass now reports failure instead of continuing with a partially rewritten code object.
Added clearer diagnostics around s_sethalt rewriting, including unexpected instruction sizes, bounds guards, and kernel/text-offset context in the logs.
Trimmed the local B0-to-A0 pipeline comment so it documents ordering without duplicating design notes.

Validation:

Clean rebuilt amd-llvm and amd-comgr after merging latest amd-staging.
COMGR hotswap lit tests: 52/52 passed.
FFM WMMA hotswap tests passed for both K=128 and Scale16, including runtime binary proof that the B0 input code object is rewritten and B0-only opcodes are removed.

jmmartinez · 2026-06-25T14:54:17Z

+  if (VT.applyVop3pxWrapPatch)
+    Patched += VT.applyVop3pxWrapPatch(Ctx);
+  if (VT.applySethaltFixPatch)
+    Patched += VT.applySethaltFixPatch(Ctx);


The two patches seem unrelated. Could we split them in 2 PRs? The sethalt path looks good to me.

The PR is mainly about VOP3PX2. I could make a separate PR for sethalt one.

Yes, would prefer if we split

Okay i'll do a separate PR for the sethalt patch

jammm · 2026-06-26T15:26:34Z

Split PR made for sethalt:
#3087

cc @chinmaydd @jmmartinez

chinmaydd

Title and description need an update

chinmaydd

Mostly LGTM. I have issues with error handling and some other things, but I suppose its fine for now

jammm · 2026-06-27T08:39:41Z

Mostly LGTM. I have issues with error handling and some other things, but I suppose its fine for now

Is there anything in particular I should fix? I'll gladly do that before we merge.

jammm added 2 commits May 13, 2026 07:54

jammm force-pushed the users/jam/a0-atomicity-hotswap-gfx1250-staging branch 2 times, most recently from 8faa4fc to 6fd776d Compare May 13, 2026 15:14

jammm added comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels May 13, 2026

jammm marked this pull request as ready for review June 11, 2026 19:42

jammm requested review from chinmaydd and lamb-j as code owners June 11, 2026 19:42

Merge branch 'amd-staging' of https://github.com/ROCm/llvm-project in…

effc40b

…to users/jam/a0-atomicity-hotswap-gfx1250-staging # Conflicts: # amd/comgr/src/comgr-hotswap-patches.def

chinmaydd reviewed Jun 17, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-vop3px-wrap.cpp Outdated

chinmaydd reviewed Jun 17, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-vop3px-wrap.cpp

chinmaydd reviewed Jun 17, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-sethalt-fix.cpp Outdated

chinmaydd requested a review from jmmartinez June 17, 2026 19:02

jammm added 3 commits June 25, 2026 03:04

Merge branch 'amd-staging' of https://github.com/ROCm/llvm-project in…

0ab5d77

…to users/jam/a0-atomicity-hotswap-gfx1250-staging

[AMDGPU] comgr: address hotswap review feedback

c6012e9

Merge branch 'amd-staging' of https://github.com/ROCm/llvm-project in…

d8f5ebc

…to users/jam/a0-atomicity-hotswap-gfx1250-staging

jmmartinez reviewed Jun 25, 2026

View reviewed changes

chinmaydd removed the ci:skip Skip all CI builds/tests for this PR label Jun 25, 2026

[AMDGPU] comgr: split sethalt hotswap fix from VOP3PX2 patch

b714ace

jammm mentioned this pull request Jun 26, 2026

[AMDGPU] comgr: rewrite s_sethalt to s_nop for GFX1250 A0 #3087

Open

chinmaydd reviewed Jun 26, 2026

View reviewed changes

chinmaydd approved these changes Jun 26, 2026

View reviewed changes

jammm changed the title ~~[AMDGPU] comgr: VOP3PX2 wrap and s_sethalt rewrite for GFX1250 B0-to-A0 hotswap~~ [AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap Jun 27, 2026

Uh oh!

Conversation

jammm commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Tests

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chinmaydd commented Jun 17, 2026

Uh oh!

jammm commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm commented Jun 26, 2026

Uh oh!

chinmaydd left a comment

Choose a reason for hiding this comment

Uh oh!

chinmaydd left a comment

Choose a reason for hiding this comment

Uh oh!

jammm commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jammm commented May 13, 2026 •

edited

Loading