codegen: add dest-mask support for VU0 VMOVE and VMR32#116
Closed
jlsandri wants to merge 2 commits intoran-j:mainfrom
Closed
codegen: add dest-mask support for VU0 VMOVE and VMR32#116jlsandri wants to merge 2 commits intoran-j:mainfrom
jlsandri wants to merge 2 commits intoran-j:mainfrom
Conversation
Copy-paste errors in the VU0 codegen path across ~36 instruction sites:
1. dest_mask lane-bit ordering was inverted: many emitters passed
{0x8, 0x4, 0x2, 0x1} to _mm_set_epi32() instead of the correct
{0x1, 0x2, 0x4, 0x8}, and the VSQD emitter had the further typo
{0x1, 0x2, 0x2, 0x1}. Because _mm_set_epi32() takes arguments in
reverse lane order, the X/Y mask bits were gating the Z/W lanes
and vice versa, causing masked writes to silently misroute. The
affected instructions include VSQD, VADD/VSUB/VMUL/VMADD/VMSUB
family _Field variants, VMINI/VMAX _Field, VFTOI/VITOF family,
VADD_bc/VSUB_bc/VMUL_bc/VMADD_bc/VMSUB_bc, VADDq/VSUBq/VMULq/
VMADDq/VMSUBq, VADDi/VSUBi/VMULi/VMADDi/VMSUBi, and a number of
ADDAq/VMULA variants.
2. VU0 VMR32 was emitting _MM_SHUFFLE(0,0,0,1) instead of
_MM_SHUFFLE(0,3,2,1), so the rotated lane produced a broadcast
of lane 1 across all output lanes instead of a true 1-lane rotate.
Both classes of bug are pure copy-paste / typo fixes with no runtime
behavioural change for dest_mask == 0xF (the most common case), but
silently wrong for any masked dest_mask.
Previously VMOVE and VMR32 emitted unconditional moves/shuffles that ignored the instruction's dest_mask field, so any masked VMOVE/VMR32 would incorrectly overwrite lanes the mask was meant to preserve. This patch: - Fast-paths dest_mask == 0xF to the existing unconditional codegen. - For any other mask, emits an _mm_blendv_ps guarded by a per-lane selector built from the dest_mask bits, matching the lane-selection convention used by the rest of the VU0 dest-mask emitters. Stacks on the VMR32 shuffle pattern fix — without that prerequisite, the masked VMR32 path would rotate to the wrong lane.
c589c40 to
0b88fb9
Compare
Author
|
Closing as part of a batch cleanup after #107 landed. The runtime ecosystem refactor in #107 substantially reworked the files this PR touched, and I would like to re-audit the underlying fix against the new code structure before putting it back in front of you. If the fix is still needed after that re-audit, I will re-open as a focused PR rebased onto current main. Thanks for your patience. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
VU0_S2_VMOVEandVU0_S2_VMR32were emitting unconditional moves/shuffles that ignored the instruction'sdest_maskfield. Any masked VMOVE/VMR32 therefore overwrote lanes the mask was meant to preserve, silently corrupting state for every masked write.Fix
Both emitters now:
dest_mask == 0xFto the existing unconditional codegen (identical output for the common case — zero overhead)._mm_blendv_psguarded by a per-lane selector built from the dest_mask bits, using the same{0x1, 0x2, 0x4, 0x8}lane-bit convention as the rest of the VU0 dest-mask emitters.Why this is stacked on #112
VMR32's masked path emits a shuffle followed by a blend; the shuffle uses_MM_SHUFFLE(0,3,2,1), which is the corrected form from #112. Without that prerequisite merged, the masked VMR32 would rotate to the wrong lane — the same typo #112 fixes in the unmasked path.Scope
ps2xRecomp/src/lib/code_generator.cppRationale
Parity with the rest of the VU0 macro codegen. Every other vector instruction already respects
dest_mask;VMOVE/VMR32were outliers.