Skip to content

[AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap#2519

Open
jammm wants to merge 7 commits into
amd-stagingfrom
users/jam/a0-atomicity-hotswap-gfx1250-staging
Open

[AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap#2519
jammm wants to merge 7 commits into
amd-stagingfrom
users/jam/a0-atomicity-hotswap-gfx1250-staging

Conversation

@jammm

@jammm jammm commented May 13, 2026

Copy link
Copy Markdown

Summary

This PR adds a COMGR hotswap pass that wraps standalone
v_wmma_f32_16x16x128_f8f6f4 instructions during the GFX1250 B0-to-A0
rewrite path.

The sethalt fix that was previously part of this branch has been split out into
a separate PR/branch. This PR is now scoped only to the VOP3PX2 wrap behavior.

Details

On A0, an async trap between an LD_SCALE prefix and the WMMA half of a VOP3PX2
instruction is not safely recoverable unless the trap rewind path can identify
the pair. This pass makes standalone f8f6f4 WMMA instructions follow the paired
form by prepending an inline-zero LD_SCALE prefix.

The pass:

  • Adds comgr-hotswap-patch-vop3px-wrap.cpp
  • Registers the pass through the existing hotswap patch registry/vtable
  • Runs after the per-instruction splitters, so it can also scan trampoline bodies
  • Wraps standalone v_wmma_f32_16x16x128_f8f6f4 instructions
  • Preserves the original WMMA bytes and matrix format modifiers
  • Leaves already wrapped v_wmma_scale_f32_16x16x128_f8f6f4 instructions alone
  • Updates affected WMMA split tests to expect wrapped trampoline output
  • Keeps the existing VOP3PX2 source-2 patch separate from the new wrap pass

Tests

Added/updated COMGR lit coverage for:

  • standalone f8f6f4 WMMA wrapping
  • FP8/FP8 and mixed FP6/FP4 modifier preservation
  • already wrapped VOP3PX2 no-op behavior
  • K=128 splitter interaction where generated trampoline WMMAs are wrapped
  • idempotent second rewrite behavior
  • existing WMMA split tests whose trampoline output is now wrapped

Local validation:

cmake --build /jam/TheRock-build --target amd-comgr+stage

PATH="/jam/TheRock-build/compiler/amd-llvm/build/bin:$PATH" \
  /jam/TheRock-build/compiler/amd-llvm/build/bin/llvm-lit -sv \
  /jam/TheRock-build/compiler/amd-comgr/build/test-lit --filter='hotswap'

Result: 51/51 hotswap lit tests passed.

Additional FFM validation:

- hip/ FFM hotswap smoke subset: 6 passed
- Full tests/test_ffm.py -m ffm_hotswap: 89 passed, 66 skipped, 1 failed, 75 deselected

The single full-suite failure was
gpt_oss_graph_mode/fwd_kernel__4f64b8, which timed out at the 120s harness
limit. The same manifest also timed out in FFM baseline without hotswap, so this
does not appear to be caused by this PR.


Reference: https://github.com/ROCm/llvm-project/pull/2519

jammm added 2 commits May 13, 2026 07:54
Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused
VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a
no-op, scale=1.0). The 8-byte prefix is byte-identical for every
wrap site, so the WMMA portion stays bit-identical and modifier-rich
operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi,
matrix_a_reuse, matrix_b_reuse, ...) are preserved without
re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100),
matching the encoding the in-place vop3px2-src2 patch produces, so
no second pass is needed on wrap-emitted forms.

Two-pass operation:
  1. Decoded[] scan -- wraps user-written standalone WMMAs.
  2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
     trampoline bytes (the K=128 32x16x128_f4 splitter from #2379
     emits f8f6f4 into trampolines).

Defensive checks:
  - Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88)
    leftover exists in Decoded[] -- the K=128 splitter must
    eliminate these before the wrap pass runs.
  - Already-wrapped detection: skips wrapping if the preceding
    decoded instruction is a SCALE form (per ISA doc p.158, the
    SCALE prefix must be contiguous with the WMMA).

Wired into b0a0.cpp dispatcher AFTER all per-instruction patches
and after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout.

Pairs with the existing trap-handler split path in rocm-systems
(commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in
trap handler").

LIT tests:
  - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
    already-wrapped, K=128 split-then-wrap chain).
  - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
    updated to expect post-wrap disassembly (v_wmma_scale_* instead
    of bare f8f6f4) since the wrap pass now wraps splitter outputs.
  - hotswap-vop3px2-src2-noop.s: comments updated to describe the
    new wrap interaction.
Rewrites every s_sethalt to s_nop 0 in-place (4-byte overwrite, no
length change, no offset shifts). Defensive: LLVM almost never
emits s_sethalt in production code (only via the int_amdgcn_s_sethalt
intrinsic, mostly in debug builds), so this pass is dormant on most
inputs -- it exists for defense-in-depth against any input ELF,
compiler-emitted or hand-written, that ships s_sethalt.

The replacement bytes come from LS.SNopBytes (pre-encoded at
initLLVM() time, the same source the splitter uses for trampoline
padding).

Wired into b0a0.cpp dispatcher AFTER applyVop3pxWrapPatch (the
order doesn't matter functionally -- sethalt-fix and the wrap pass
don't interact -- but keeping all the A0-only post-loop passes
adjacent makes the dispatch order easy to read).

LIT test (hotswap-sethalt-fix.s): 4 cases -- bare s_sethalt is
neutralized, s_sethalt before VOP3PX2 is neutralized, multiple
s_sethalt instances are all neutralized, kernels without s_sethalt
are not modified.
@jammm jammm force-pushed the users/jam/a0-atomicity-hotswap-gfx1250-staging branch 2 times, most recently from 8faa4fc to 6fd776d Compare May 13, 2026 15:14
@jammm jammm added comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels May 13, 2026
@jammm jammm marked this pull request as ready for review June 11, 2026 19:42
@jammm jammm requested review from chinmaydd and lamb-j as code owners June 11, 2026 19:42
…to users/jam/a0-atomicity-hotswap-gfx1250-staging

# Conflicts:
#	amd/comgr/src/comgr-hotswap-patches.def
Comment thread amd/comgr/src/comgr-hotswap-patch-vop3px-wrap.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-vop3px-wrap.cpp
Comment thread amd/comgr/src/comgr-hotswap-b0a0.cpp Outdated
Comment on lines +410 to +413
// sethalt-fix: defensive in-place s_sethalt -> s_nop. LLVM almost
// never emits s_sethalt (only via int_amdgcn_s_sethalt in debug
// builds), so this pass is dormant on production code; it exists to
// catch hand-written assembly / inline asm that ships s_sethalt.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally document all of the pass pipeline in a single place, and I dont think this is the right place to do it.

Something to remember for the refactor.

Comment on lines +64 to +68
if (Ctx.LS.SNopBytes.size() != MinInstSize) {
log() << "hotswap: error: sethalt-fix: LS.SNopBytes is not "
<< MinInstSize << " bytes (got " << Ctx.LS.SNopBytes.size() << ")\n";
return 0;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen ? Just curious

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the expected GFX1250 path, no. SNopBytes is produced by the MC encoder for s_nop 0, and scalar s_nop should encode to the 4-byte MinInstSize.

I kept this as a defensive invariant check before the memcpy: if LLVMState/MC encoding initialization ever fails, or a future target path returns an unexpected encoding size, we should skip the patch instead of writing malformed bytes into .text. This is not expected to be
triggered by a valid code object.

Would you prefer we remove this? it seems codex is more defensive than usual here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks for the context.

if LLVMState/MC encoding initialization ever fails

I think for this one, we should catch init failure when it happens.

future target path returns an unexpected encoding size

I'd find it hard to envision a case where we update the encoding for s_nop. Maybe we should replace this with an assert ? No strong opinions there IMO. I'll leave it up to you.

Comment thread amd/comgr/src/comgr-hotswap-patch-sethalt-fix.cpp Outdated
@chinmaydd

Copy link
Copy Markdown

I dont really have any other blocking comments as such. If we rebase and fix above, we should be good for now

@chinmaydd chinmaydd requested a review from jmmartinez June 17, 2026 19:02
@jammm

jammm commented Jun 25, 2026

Copy link
Copy Markdown
Author

• Addressed the review feedback in this update:

  • Reworked the VOP3PX2 wrap scan to use the AMDGPU MC decoder instead of raw 4-byte pattern matching. The patch now decodes instructions, checks the decoded mnemonic/size, and only treats decoded standalone WMMA instructions as candidates.
  • Made the leftover check stricter: if B0-only F4/VOP3PX2 patterns remain after patching, the pass now reports failure instead of continuing with a partially rewritten code object.
  • Added clearer diagnostics around s_sethalt rewriting, including unexpected instruction sizes, bounds guards, and kernel/text-offset context in the logs.
  • Trimmed the local B0-to-A0 pipeline comment so it documents ordering without duplicating design notes.

Validation:

  • Clean rebuilt amd-llvm and amd-comgr after merging latest amd-staging.
  • COMGR hotswap lit tests: 52/52 passed.
  • FFM WMMA hotswap tests passed for both K=128 and Scale16, including runtime binary proof that the B0 input code object is rewritten and B0-only opcodes are removed.

Comment thread amd/comgr/src/comgr-hotswap-b0a0.cpp Outdated
Comment on lines +409 to +412
if (VT.applyVop3pxWrapPatch)
Patched += VT.applyVop3pxWrapPatch(Ctx);
if (VT.applySethaltFixPatch)
Patched += VT.applySethaltFixPatch(Ctx);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two patches seem unrelated. Could we split them in 2 PRs? The sethalt path looks good to me.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is mainly about VOP3PX2. I could make a separate PR for sethalt one.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would prefer if we split

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay i'll do a separate PR for the sethalt patch

@chinmaydd chinmaydd removed the ci:skip Skip all CI builds/tests for this PR label Jun 25, 2026
@jammm

jammm commented Jun 26, 2026

Copy link
Copy Markdown
Author

Split PR made for sethalt:
#3087

cc @chinmaydd @jmmartinez

@chinmaydd chinmaydd left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title and description need an update

@chinmaydd chinmaydd left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. I have issues with error handling and some other things, but I suppose its fine for now

@jammm

jammm commented Jun 27, 2026

Copy link
Copy Markdown
Author

Mostly LGTM. I have issues with error handling and some other things, but I suppose its fine for now

Is there anything in particular I should fix? I'll gladly do that before we merge.

@jammm jammm changed the title [AMDGPU] comgr: VOP3PX2 wrap and s_sethalt rewrite for GFX1250 B0-to-A0 hotswap [AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants