[AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap#2519
[AMDGPU] comgr: wrap standalone VOP3PX2 WMMA for GFX1250 hotswap#2519jammm wants to merge 7 commits into
Conversation
Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused
VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a
no-op, scale=1.0). The 8-byte prefix is byte-identical for every
wrap site, so the WMMA portion stays bit-identical and modifier-rich
operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi,
matrix_a_reuse, matrix_b_reuse, ...) are preserved without
re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100),
matching the encoding the in-place vop3px2-src2 patch produces, so
no second pass is needed on wrap-emitted forms.
Two-pass operation:
1. Decoded[] scan -- wraps user-written standalone WMMAs.
2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
trampoline bytes (the K=128 32x16x128_f4 splitter from #2379
emits f8f6f4 into trampolines).
Defensive checks:
- Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88)
leftover exists in Decoded[] -- the K=128 splitter must
eliminate these before the wrap pass runs.
- Already-wrapped detection: skips wrapping if the preceding
decoded instruction is a SCALE form (per ISA doc p.158, the
SCALE prefix must be contiguous with the WMMA).
Wired into b0a0.cpp dispatcher AFTER all per-instruction patches
and after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout.
Pairs with the existing trap-handler split path in rocm-systems
(commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in
trap handler").
LIT tests:
- hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
already-wrapped, K=128 split-then-wrap chain).
- hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
updated to expect post-wrap disassembly (v_wmma_scale_* instead
of bare f8f6f4) since the wrap pass now wraps splitter outputs.
- hotswap-vop3px2-src2-noop.s: comments updated to describe the
new wrap interaction.
Rewrites every s_sethalt to s_nop 0 in-place (4-byte overwrite, no length change, no offset shifts). Defensive: LLVM almost never emits s_sethalt in production code (only via the int_amdgcn_s_sethalt intrinsic, mostly in debug builds), so this pass is dormant on most inputs -- it exists for defense-in-depth against any input ELF, compiler-emitted or hand-written, that ships s_sethalt. The replacement bytes come from LS.SNopBytes (pre-encoded at initLLVM() time, the same source the splitter uses for trampoline padding). Wired into b0a0.cpp dispatcher AFTER applyVop3pxWrapPatch (the order doesn't matter functionally -- sethalt-fix and the wrap pass don't interact -- but keeping all the A0-only post-loop passes adjacent makes the dispatch order easy to read). LIT test (hotswap-sethalt-fix.s): 4 cases -- bare s_sethalt is neutralized, s_sethalt before VOP3PX2 is neutralized, multiple s_sethalt instances are all neutralized, kernels without s_sethalt are not modified.
8faa4fc to
6fd776d
Compare
…to users/jam/a0-atomicity-hotswap-gfx1250-staging # Conflicts: # amd/comgr/src/comgr-hotswap-patches.def
| // sethalt-fix: defensive in-place s_sethalt -> s_nop. LLVM almost | ||
| // never emits s_sethalt (only via int_amdgcn_s_sethalt in debug | ||
| // builds), so this pass is dormant on production code; it exists to | ||
| // catch hand-written assembly / inline asm that ships s_sethalt. |
There was a problem hiding this comment.
We should ideally document all of the pass pipeline in a single place, and I dont think this is the right place to do it.
Something to remember for the refactor.
| if (Ctx.LS.SNopBytes.size() != MinInstSize) { | ||
| log() << "hotswap: error: sethalt-fix: LS.SNopBytes is not " | ||
| << MinInstSize << " bytes (got " << Ctx.LS.SNopBytes.size() << ")\n"; | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
In the expected GFX1250 path, no. SNopBytes is produced by the MC encoder for s_nop 0, and scalar s_nop should encode to the 4-byte MinInstSize.
I kept this as a defensive invariant check before the memcpy: if LLVMState/MC encoding initialization ever fails, or a future target path returns an unexpected encoding size, we should skip the patch instead of writing malformed bytes into .text. This is not expected to be
triggered by a valid code object.
Would you prefer we remove this? it seems codex is more defensive than usual here.
There was a problem hiding this comment.
Hmm, thanks for the context.
if LLVMState/MC encoding initialization ever fails
I think for this one, we should catch init failure when it happens.
future target path returns an unexpected encoding size
I'd find it hard to envision a case where we update the encoding for s_nop. Maybe we should replace this with an assert ? No strong opinions there IMO. I'll leave it up to you.
|
I dont really have any other blocking comments as such. If we rebase and fix above, we should be good for now |
…to users/jam/a0-atomicity-hotswap-gfx1250-staging
…to users/jam/a0-atomicity-hotswap-gfx1250-staging
|
• Addressed the review feedback in this update:
Validation:
|
| if (VT.applyVop3pxWrapPatch) | ||
| Patched += VT.applyVop3pxWrapPatch(Ctx); | ||
| if (VT.applySethaltFixPatch) | ||
| Patched += VT.applySethaltFixPatch(Ctx); |
There was a problem hiding this comment.
The two patches seem unrelated. Could we split them in 2 PRs? The sethalt path looks good to me.
There was a problem hiding this comment.
The PR is mainly about VOP3PX2. I could make a separate PR for sethalt one.
There was a problem hiding this comment.
Okay i'll do a separate PR for the sethalt patch
|
Split PR made for sethalt: |
chinmaydd
left a comment
There was a problem hiding this comment.
Title and description need an update
chinmaydd
left a comment
There was a problem hiding this comment.
Mostly LGTM. I have issues with error handling and some other things, but I suppose its fine for now
Is there anything in particular I should fix? I'll gladly do that before we merge. |
Summary
This PR adds a COMGR hotswap pass that wraps standalone
v_wmma_f32_16x16x128_f8f6f4instructions during the GFX1250 B0-to-A0rewrite path.
The sethalt fix that was previously part of this branch has been split out into
a separate PR/branch. This PR is now scoped only to the VOP3PX2 wrap behavior.
Details
On A0, an async trap between an
LD_SCALEprefix and the WMMA half of a VOP3PX2instruction is not safely recoverable unless the trap rewind path can identify
the pair. This pass makes standalone f8f6f4 WMMA instructions follow the paired
form by prepending an inline-zero
LD_SCALEprefix.The pass:
comgr-hotswap-patch-vop3px-wrap.cppv_wmma_f32_16x16x128_f8f6f4instructionsv_wmma_scale_f32_16x16x128_f8f6f4instructions aloneTests
Added/updated COMGR lit coverage for:
Local validation: