diff --git a/.dev/checklist.md b/.dev/checklist.md index e40a8a74e..5981ee41a 100644 --- a/.dev/checklist.md +++ b/.dev/checklist.md @@ -89,57 +89,71 @@ Prefix: W## (to distinguish from CW's F## items). `@./.dev/w47-investigation.md`. Low priority since 20 other benchmarks improved >10% (GC paths 40–76% faster). -- [ ] W54: `tgo_strops` (and other div-heavy TinyGo workloads) is - ~2.1× slower than wasmtime/cranelift on M4 Pro per - `bench/runtime_comparison.yaml` (2026-03-25, runs=1/warmup=0): - zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. The original - framing — "constant-divisor folding is missing" — was - **disproven** during the 2026-04-29 evening investigation. Both - ARM64 (`src/jit.zig:3582-3666` `tryEmitDivByConstU32`) and - x86_64 (`src/x86.zig`) already emit the Hacker's Delight magic - multiply for `i32.div_u K`; the JIT dump for - `bench/wasm/tgo_string_ops.wasm` shows three MOVZ+MOVK+UMULL+LSR - sequences for the three `i32.div_u 10` sites, with zero `UDIV` - instructions. - - The remaining 2.1× lives in two places: - - 1. The 2-instruction magic-constant load (`MOVZ + MOVK` for - `0xCCCCCCCD`) is re-emitted inside the loop body on every - iteration; cranelift's SSA + GVN hoist it once so only - `UMULH/UMULL + LSR` stay hot. Three div sites in - `tgo_strops` cost ~6 ARM64 instructions per iteration that a - preheader hoist would eliminate. - 2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift - collapses those into register renames whereas zwasm's - linear-scan regalloc still spills them to LDR/STR against - `regs[]`. - - Single-pass-compatible levers, ranked: - - 1. **Loop-preheader magic hoist.** Extend `emitLoopPreHeader` - (today SIMD-only, `src/jit.zig:4604`) to scan for - `OP_CONST32 K → OP_DIV_U` patterns, allocate a callee-saved - register, and pre-load the magic. `tryEmitDivByConstU32` - short-circuits when the magic is already live. Risk: - medium — needs to coexist with the existing physical- - register layout (functions like `string_ops` (func#24, 13 - vregs) saturate the callee-saved set; the prologue would - have to reserve a free slot up front). - 2. **`OP_CONST32` reuse across loop back-edges.** Today - `known_consts` is wiped at every header. Skip unless (1) - lands — saves the 1-instr const itself but not the 2-instr - magic that hangs off it. - 3. **`OP_MOV` coalescing in linear-scan regalloc.** Substantial - surgery; deserves a separate W## entry. - - Out of scope (would break single-pass): SSA + dataflow, - global register allocation, automatic loop unroll / - vectorise. Re-record `runtime_comparison.yaml` at 5 runs / 3 - warmup before claiming any number — the current values are - single-sample. - - Full investigation log: `@./.dev/w54-investigation.md`. +- [x] W54 (substrate): structural cleanup landed via PR #91 from + `develop/w54-loop-info`. Single change: `src/loop_info.zig` is + the single source of truth for branch / loop / vreg liveness. + The two JIT backends used to maintain byte-identical + `scanBranchTargets` implementations; both now consume + `LoopInfo.analyse(...)`. `vreg_first_def[]` / `vreg_last_use[]` + are computed in the same forward sweep, ready for future + consumers (Phase 5 / Phase 4). Behaviour byte-identical to + main (verified via `--dump-jit` diff on tgo_string_ops func#24, + fib func#2, and the realworld suite on Mac aarch64 + Ubuntu + x86_64). Architecture: D138 in `.dev/decisions.md`. Session arc: + `.dev/w54-redesign-postmortem.md`. + +- [ ] W54-coalescer: liveness-driven mov coalescing extension to + `regalloc.copyPropagate`. Built and proven on Mac aarch64 (50/50 + realworld), reverted from the substrate PR after Linux x86_64 + CI flagged a `go_math_big` divergence — BigInt subtraction + result `864197532086419753208641975320` (wasmtime) vs + `864197532160206729503480181784` (zwasm). The regalloc itself + is arch-agnostic, so the same `RegFunc` flows through both + backends; the divergence implies an x86_64-specific assumption + in `src/x86.zig`'s codegen that the new IR layout (fewer MOVs, + shifted PCs) violates. Reproducible on OrbStack's + `my-ubuntu-amd64` with `develop/w54-loop-pass-redesign` + checked out and `zig build -Doptimize=ReleaseSafe`. The + coalescer commit (`ec8182f` on the archive branch) cherry-picks + cleanly; debugging is in the x86 backend's interaction with the + redef-stop pattern — likely a getOrLoad / SCRATCH contention + triggered by a specific opcode sequence that the coalesced IR + produces. Recommended bisect: dump `--dump-regir` for the + failing function, identify the first MOV the new coalescer + folds that the old version kept, then dump `--dump-jit` on x86 + to find the codegen mismatch. + +- [ ] W54-hoist-revisit: magic-constant loop-invariant hoist was + built and proven on `develop/w54-loop-pass-redesign` (archived + as `archive/w54-magic-hoist-2026-04-30`). digitCount JIT goes + 196 → 192 with hoist alone. **Held back** from the substrate PR + pending three prerequisites: + 1. **W47** — bench harness with σ < 5% on `tgo_strops`. The + hoist's effect is below the current 10% σ floor. + 2. **W54-x86** — x86_64 `pickHoistPhys` parity. ARM64-only land + forces a reconciling follow-up; bundling makes one coherent + change. + 3. `runtime_comparison.yaml` re-recorded at 5/3 hyperfine on a + thermally-stable rig. + Re-attempt: cherry-pick `1600397` + `c4b806e` from the archive + branch onto a fresh redesign branch once 1+2+3 land. + +- [ ] W54-x86: x86_64 magic-constant hoist parity for + W54-hoist-revisit. The archive branch's `src/x86.zig` already + has the `hoist_phys` / `hoist_displaced_infra` field + scaffolding; `pickHoistPhys` body needs implementing for x86's + reg layout (RBX/RBP/R15 vreg-bound for any reg_count >= 1; no + `inst_ptr_cached` slot to displace; R13/R14 free only when + `!has_memory`). Bench-driven decision on whether the win is + worth the displacement cost on functions with `has_memory`. + +- [ ] W54-libm: real-world `rw_c_math` (5.0× wasmtime cold, + 8.7× cached) is dominated by BLR-heavy libm dispatch (`sin`, + `cos`, `pow`, `sqrt` per iteration). Out of scope for the + loop-pass redesign. Single-pass-compatible candidates: intrinsic + recognition for imported function names (`sqrt` → FSQRT inline), + software-libm fallback for sin/cos/pow. Needs imported-function + name resolution on the predecode side. - [ ] W48 Phase 2: Linux binary size 1.56 MB → 1.50 MB (~62 KB more). W48 Phase 1 shipped (2026-04-25): `pub const panic = std.debug.simple_panic` diff --git a/.dev/decisions.md b/.dev/decisions.md index 8f9faa549..c51bb0d49 100644 --- a/.dev/decisions.md +++ b/.dev/decisions.md @@ -941,3 +941,131 @@ Mac binary another 60 KB → cap drops 1.30 → 1.25 MB). Loosening a ceiling requires a CHANGELOG entry naming the regression source so the slack is intentional and visible. + +## D138: Shared `LoopInfo` analysis layer for the JIT pipeline + +**Status**: Accepted — landed via `develop/w54-loop-info` (PR #91). + +**Context**: For a long stretch we treated each JIT optimisation as a +self-contained patch — the SIMD `emitLoopPreHeader`, the +const-divisor magic-multiply fold, the adjacent-MOV `copyPropagate`, +the `vm_ptr_cached` / `inst_ptr_cached` slots in `vregToPhys`. Each +re-derived its own slice of "what's a loop" and "is this vreg dead" +inline at codegen time. That ad-hoc layout was the proximate cause +of the W54 magic-hoist abandonment on +`develop/w54-magic-hoist-attempt` (2026-04-29 evening, see +`.dev/w54-investigation.md`): x21 was simultaneously the inst_ptr +cache slot for `reg_count <= 13 && has_self_call` AND the natural +callee-saved candidate for the magic. Picking a safe boundary was a +design call, not a tail-end commit. + +**Decision**: Ship the substrate first. One structural change that +stands on its own merit and unblocks future loop-aware work: + +`src/loop_info.zig` is the single source of truth for the function's +control-flow shape and per-vreg liveness. The two JIT backends used +to maintain byte-for-byte identical `scanBranchTargets` +implementations; both now consume `LoopInfo.analyse(allocator, ir, +reg_count)`, which produces: + +- `branch_targets[]`, `loop_headers[]`, `loop_end[]` (drives JIT + cache eviction and the `known_consts` wipe at merge points). +- `vreg_first_def[]`, `vreg_last_use[]` (one forward sweep, + conservative reads — over-approximation extends last_use later + than necessary, which only shrinks a future coalescer's window; + never breaks correctness). Phase 5+ consumers will read these. + +The opcode classification helpers `opWritesRd`, `opUsesRdAsSource`, +`opUsesRs1AsSource`, `opUsesRs2AsSource` live in `loop_info.zig` +for now (private) — they will be promoted to public regalloc API +once the coalescer extension that needs them is debugged on x86_64 +(see W54-coalescer in checklist.md). + +**Effect (ARM64 Mac)**: + +- Both backends drop ~60 lines of duplicated `scanBranchTargets` + body in favour of a thin `LoopInfo.analyse` call. Behaviour is + byte-deterministic identical to main: `--dump-jit=24` of + `tgo_string_ops` func#24 (digitCount, 196 ARM64 instrs / 784 + bytes) matches main bit-for-bit. +- All other functions emit byte-identical machine code. No + performance change is expected or observed (Phase 0 + 1 are pure + refactoring; the data is computed but no codegen consumer reads + the new `vreg_first_def[]` / `vreg_last_use[]` arrays yet). + +**Rejected alternatives** (and why they didn't ride along): + +- **Liveness-driven mov coalescing in `regalloc.copyPropagate`**. + Implemented and shipped on the develop branch, then **reverted + on 2026-04-30** after the green Mac gate but RED Linux x86_64 CI + on `go_math_big`: the new "stop at first redef of old_reg" scan + with branch-target / forward-branch / multi-source bail-outs + passes the Mac aarch64 realworld suite (50/50, including + `rust_regex` which the first attempt broke), but produces wrong + results on Linux x86_64's `go_math_big` (BigInt subtraction + mismatch — wasmtime returns + `864197532086419753208641975320`, zwasm returns + `864197532160206729503480181784`). The regalloc itself is + arch-agnostic, so the same `RegFunc` flows through both + backends; the divergence implies an x86_64-specific assumption + in `src/x86.zig`'s codegen that the new IR layout violates. + Diagnosis is bench/CI-bound and not a tail-end fix; tracked as + W54-coalescer for a focused follow-up. + +- **Magic-constant loop-invariant hoist** (`OP_CONST32 K → + OP_DIV_U` pattern, materialise the magic into a callee-saved + register in the prologue, short-circuit + `tryEmitDivByConstU32`). Implemented on + `develop/w54-loop-pass-redesign` (commits `1600397`, `c4b806e`) + and proved out: digitCount JIT 196 → 192 with hoist alone, 192 + → 185 stacked with the (eventually reverted) coalescer. Held + back from this PR for three reasons: + 1. The runtime gain is below the bench σ floor today; without + the W47 harness work the optimisation would land + evidence-free and any later regression would be argued as + noise rather than measured. + 2. The hoist requires displacing `inst_ptr_cached` (x21) on + functions with reg_count >= 5 + has_self_call — an + ARM64-specific behaviour change with no measured benefit + today. Pushing it post-harness keeps the trade-off + reviewable. + 3. x86_64 parity has different free-slot mechanics (no + `inst_ptr_cached` to displace). Bundling hoist with parity + makes one coherent change later. + +- **Loop-invariant `known_consts` survival across loop headers**. + Sketched as Phase 4 of the original plan; dropped after + inspection of digitCount's RegIR showed every `i32.div_u 10` site + has its CONST32 emitted *inside* the loop body (TinyGo reuses + the same vreg `r8` / `r9` / `r12` per div), so the optimisation + would never fire on the W54 target. + +**Affected files**: `src/loop_info.zig` (new), `src/jit.zig` +(replace `scanBranchTargets` with `LoopInfo.analyse`), `src/x86.zig` +(same). + +**Archive**: the magic-hoist + coalescer work is preserved on +`develop/w54-loop-pass-redesign` (last commit `a56d442`) and tagged +`archive/w54-magic-hoist-2026-04-30`. Cherry-pick path: +- `1600397` + `c4b806e` for the ARM64 magic hoist (re-attempt + W54-hoist-revisit). +- `ec8182f` for the redef-aware coalescer (re-attempt W54-coalescer + after diagnosing the x86_64 `go_math_big` regression). + +**Re-evaluation pre-conditions**: +1. **W47** — bench harness with σ < 5% on tgo_strops (currently ~10%) +2. **W54-x86** — symmetric `pickHoistPhys` for x86_64 reg layout +3. **W54-coalescer** — diagnose and fix the x86_64 `go_math_big` + divergence; the diff is in the regalloc-stage IR shape, the + backend assumption that breaks is in `src/x86.zig`. +4. `runtime_comparison.yaml` re-recorded with 5/3 hyperfine on a + thermally-stable rig. + +**Follow-ups** (open W## items in checklist.md): +- `W54-coalescer`: diagnose the x86_64 `go_math_big` regression, + re-land the coalescer. +- `W54-hoist-revisit`: revive the magic-hoist work once W47 + + W54-x86 are ready. +- `W54-libm`: `rw_c_math` is dominated by libm `sin`/`cos`/`pow` + dispatch; intrinsic recognition + ARM64 FSQRT inline + soft-libm + fallback. diff --git a/.dev/memo.md b/.dev/memo.md index 0f5a66bb0..e0505dc59 100644 --- a/.dev/memo.md +++ b/.dev/memo.md @@ -23,7 +23,41 @@ Session handover document. Read at session start. ## Current Task -**W53 done. C-g foundation + Mac/Ubuntu baselines done.** Ship-overnight +**W54 substrate landed via PR #91 from `develop/w54-loop-info`** (2026-04-30). +Single structural change: `src/loop_info.zig` is the single source of +truth for branch / loop / vreg liveness. Both backends drop ~60 lines +of byte-identical `scanBranchTargets` in favour of a thin +`LoopInfo.analyse(...)` call. `vreg_first_def[]` / +`vreg_last_use[]` are computed from the same forward sweep, ready +for future consumers. JIT output is byte-identical to main on every +function (verified via `--dump-jit` diff for tgo_string_ops func#24 +and fib func#2). + +### Held back (archive branch) + +`develop/w54-loop-pass-redesign` (tagged +`archive/w54-magic-hoist-2026-04-30`) preserves two further pieces +of work that were built and bench-validated, but held back: + +1. **Magic-constant loop-invariant hoist** (commits `1600397`, + `c4b806e`). digitCount JIT 196 → 192. Re-attempt prerequisites: + W47 (bench harness σ < 5%), W54-x86 (parity). +2. **Liveness-driven mov coalescing** (commit `ec8182f`). digitCount + JIT 192 → 185 stacked on hoist; substrate-only branch JIT 196 → + 189 with just the coalescer. **Reverted from PR #91 after + Linux x86_64 CI failed `go_math_big`** (BigInt subtraction + divergence — wasmtime returns `864197532086419753208641975320`, + zwasm returns `864197532160206729503480181784`). The regalloc + is arch-agnostic, so the divergence is in `src/x86.zig`'s + codegen interaction with the new IR layout. Reproducible on + OrbStack `my-ubuntu-amd64`. Tracked as W54-coalescer. + +Architecture rationale: D138 in decisions.md. Detailed session arc ++ branch names: `.dev/w54-redesign-postmortem.md`. + +### Previous (still on main) + +**C-g foundation + Mac/Ubuntu baselines done.** Ship-overnight session 2026-04-29 evening landed two PRs to main on top of the afternoon's six (#79..#84): diff --git a/.dev/w54-redesign-postmortem.md b/.dev/w54-redesign-postmortem.md new file mode 100644 index 000000000..6600835c6 --- /dev/null +++ b/.dev/w54-redesign-postmortem.md @@ -0,0 +1,219 @@ +# W54 redesign — what shipped, what archived, what to do next + +Captured 2026-04-30 after the deep redesign session, updated after +PR #91's first CI surfaced an x86_64-only regression on the +coalescer. + +## TL;DR + +- **Shipped** (`develop/w54-loop-info` → main, PR #91): the LoopInfo + substrate only. `src/loop_info.zig` shared analysis layer with + branch_targets / loop_headers / loop_end + per-vreg first_def / + last_use. Both backends consume it instead of duplicating + `scanBranchTargets`. Behaviour byte-identical to main on every + benchmark we dump-jit'd. +- **Archived** (`develop/w54-loop-pass-redesign`, tagged + `archive/w54-magic-hoist-2026-04-30`): + - Magic-constant loop-invariant hoist with `inst_ptr_cached` + displacement. digitCount JIT 196 → 192. Held back pending + W47 + W54-x86. + - Liveness-driven mov coalescing extension to + `regalloc.copyPropagate`. digitCount JIT 196 → 189. Reverted + from the substrate PR after Linux x86_64 CI flagged a + `go_math_big` BigInt divergence; tracked as W54-coalescer. +- **Dropped**: Phase 4 ("loop-invariant `known_consts` survival + across loop headers"). The W54 target — digitCount — emits + CONST32 *inside* the loop body for every divisor site, so the + optimisation never fires on it. Re-evaluate when a benchmark + with the defined-outside-loop pattern shows up. + +## Session arc + +The starting point: PR #90 captured the W54 investigation +(`.dev/w54-investigation.md`) which disproved the original framing +("zwasm doesn't fold i32.div_u K"). zwasm already emits the +Hacker's Delight magic-multiply for constant divisors on both +arches; the 2.4× wasmtime gap on `tgo_strops` lives in two places +— magic constants re-loaded every iter, and TinyGo's mov-heavy +`local.set` chains. + +The first attempt at the magic hoist +(`develop/w54-magic-hoist-attempt`, abandoned the same evening) +hit a register collision: x21 was simultaneously the inst_ptr cache +for `reg_count <= 13 && has_self_call` AND the natural callee-saved +candidate for the magic. The investigation captured in +`.dev/w54-investigation.md` concluded that picking a safe boundary +was a design call, not a tail-end commit on a long autonomous run. + +The redesign session (this one, 2026-04-29 → 2026-04-30) did the +design work the abandoned attempt avoided: + +1. Built the substrate (LoopInfo + opcode helpers + liveness data). +2. Built the magic hoist on top, with `pickHoistPhys` that + displaces inst_ptr_cached when needed. +3. Built the liveness-driven coalescer. +4. Discovered via Mac bench that the runtime gain of (2)+(3) is + below the σ ≈ 10% noise floor on tgo_strops (W47). +5. Reduced scope to just the substrate (1) for the PR. Tested. + Mac green. Pushed. +6. Linux x86_64 CI flagged `go_math_big` regression on the + coalescer (3). Reproduced on OrbStack `my-ubuntu-amd64`. +7. Reverted (3) from the PR. Re-pushed substrate-only. +8. Mac aarch64 native testing of (3) had passed; the bug is + x86-specific. Tracked as W54-coalescer for diagnosis. + +## Branches and commits + +``` +main (pre-redesign) + └── develop/w54-magic-hoist-attempt abandoned 2026-04-29 evening + │ reason: x21 register collision, deferred for daylight design + │ + └── develop/w54-loop-pass-redesign archived 2026-04-30 + │ tag: archive/w54-magic-hoist-2026-04-30 + │ contents (7 commits): + │ dd450f5 redesign plan (.dev/w54-redesign-plan.md) + │ b65477a Phase 0 scanBranchTargets → LoopInfo + │ 98287ae Phase 1 vreg liveness on LoopInfo + │ 1600397 Phase 2 hoist_phys / hoist_displaced_inst_ptr scaffold + │ c4b806e Phase 3 ARM64 magic-constant hoist + │ ec8182f Phase 5 liveness-driven mov coalescing + │ (Mac green, x86_64 fails go_math_big) + │ a56d442 Phase 6 D138 + checklist + memo + bench record + │ + └── develop/w54-loop-info shipped 2026-04-30 (PR #91) + contents (3 commits, cherry-picked from the archive): + ee10661 Phase 0 scanBranchTargets → LoopInfo + ac2d446 Phase 1 vreg liveness on LoopInfo + D138 + checklist + memo + this postmortem +``` + +## Why the coalescer was reverted from PR #91 + +The Phase 5 commit (`ec8182f`) extended `regalloc.copyPropagate` to +fold a temp-to-local MOV when the temp is killed (redefined) before +any later read — an O(N) bounded scan that stops at the first +redefinition of `old_reg`, with bail-outs for branch targets, +forward branches, and multi-source ops. + +On Mac aarch64 this passed: +- 412/412 unit tests +- spec / e2e / FFI / minimal builds +- 50/50 realworld (including `rust_regex` which the first attempt + broke — the forward-branch bail caught that case) + +On Linux x86_64 CI it failed: +- realworld 49/50: `go_math_big` DIFF +- wasmtime: `diff: 864197532086419753208641975320` +- zwasm: `diff: 864197532160206729503480181784` + +Reproduced on OrbStack `my-ubuntu-amd64` (native x86_64, not +Rosetta). This means the coalesced `RegFunc` (which is identical +across both backends — regalloc is arch-agnostic) gets correctly +emitted on ARM64 but mis-emitted on x86_64. The bug is in +`src/x86.zig`'s codegen interaction with the new IR layout (fewer +MOVs, shifted PCs). + +This rules out "the coalescer is wrong" — Mac aarch64 passes 50/50 +on the same `RegFunc`. It points at an x86-specific assumption +the new layout violates: likely a getOrLoad / SCRATCH contention +or a spill-around-call sequence whose timing depends on a MOV +that the new coalescer eliminates. + +Diagnosis path (W54-coalescer): +1. `--dump-regir` for `go_math_big`'s offending function on both + the coalescer branch and main; identify the first MOV that the + new fold removed. +2. `--dump-jit=...` for that function on x86_64 main vs branch; + find the codegen difference. +3. Check that x86's getOrLoad caching, scratch_vreg invalidation + on UMULL, and call-site reload loops correctly handle the new + IR shape. +4. Add a regression test (the failing IR pattern, ideally a + minimal wat). + +## Why the hoist was held back from PR #91 + +The Phase 3 commits (`1600397` + `c4b806e`) implement the magic +hoist. ARM64 dump-jit shows the win: digitCount 196 → 192 with +hoist alone. Stacked with the (now-reverted) coalescer: 192 → +185. + +Three reasons it didn't ride along with the substrate: + +1. **W47**: the bench σ on `tgo_strops` is ~10%. The hoist's + wall-clock effect is below the noise floor. Without harness + improvement the win is unfalsifiable; landing it now would mean + any later regression is argued as noise rather than measured. + +2. **`inst_ptr_cached` displacement**: when no callee-saved slot + is free (digitCount has reg_count=13 + self_call which + saturates), the hoist takes x21 from the inst_ptr cache. Every + `emitLoadInstPtr` site becomes a memory load. ARM64-specific + behaviour change, no measured benefit today. + +3. **W54-x86**: x86_64 has different free-slot mechanics. Land + ARM64 alone and the next x86 PR has to reconcile two arches. + Bundling makes one coherent change later. + +When W47 + W54-x86 + W54-coalescer are all green, the path is: +checkout `archive/w54-magic-hoist-2026-04-30`, cherry-pick +`1600397` + `c4b806e` (hoist) + `ec8182f` (coalescer, after +diagnosing the go_math_big regression). + +## Lessons / signals to remember + +- **Linux x86_64 CI is irreplaceable for arch-asymmetric + regressions.** Mac aarch64 + OrbStack x86_64 (Rosetta) both + pass; only the GitHub-hosted native x86_64 runner caught + go_math_big. OrbStack's "amd64" via Rosetta is x86-emulated on + ARM Mac and somehow doesn't trigger the same codegen path the + CI runner does. **The Mac-only "Mac green ⇒ ship" heuristic + is unsafe** — Linux CI is non-redundant. + + Update 2026-04-30: confirmed reproducible on OrbStack with a + fresh build (`zig build -Doptimize=ReleaseSafe` in the VM, not + cross-compiled from Mac). The earlier "OrbStack passes" reading + was a stale Mach-O binary that wasn't actually executed — + OrbStack Linux can't run aarch64-darwin Mach-O, so the test + fell through to wasmtime's output. + +- **Regalloc-stage IR changes are arch-agnostic, but JIT + consumption isn't.** A new `RegFunc` shape that's correct by + construction can still expose existing backend bugs (or + undocumented backend assumptions). Both backends need to be + exercised before claiming a regalloc-stage refactor is + behaviour-neutral. + +- **Bench σ ≈ 10% on `tgo_strops`** (W47) is the gating + constraint for measuring small JIT optimisations. Until W47, + sub-10% wins are unfalsifiable. + +- **Forward branches are the safety boundary for redef-stop + coalescing.** The `rust_regex` `/h.l+o/ ~ "hallo"` failure was + exactly the "branch over a redef" pattern — without dominator + info, every forward branch in the scan window has to be a bail. + The x86_64 `go_math_big` failure is a different class — same + RegFunc, but the x86 backend mis-emits. + +- **Phase 4 (invariant `known_consts` across loop headers) does + not fire on the W54 target**. digitCount's CONST32 is reborn + per iteration. Verify-on-RegIR is cheaper than + implement-and-bench. + +- **`develop/w54-magic-hoist-attempt` was right to defer**. The + collision class (x21 = inst_ptr_cache vs hoist) was a missing + layer in the JIT. The substrate added the layer; the + consequential optimisations stack on top. + +## Pointers + +- Architecture: D138 (`.dev/decisions.md`). +- Investigation log: `.dev/w54-investigation.md` (in main, PR #90). +- Original plan: `.dev/w54-redesign-plan.md` (on the archive + branch only — the shipped scope is much narrower than the plan). +- Bench harness work: `W47` in `.dev/checklist.md`. +- Coalescer re-attempt: `W54-coalescer` in `.dev/checklist.md`. + Diagnose via `--dump-regir` / `--dump-jit` on x86_64 first. +- Hoist re-attempt: `W54-hoist-revisit` + `W54-x86`. Cherry-pick + `1600397` + `c4b806e` from `archive/w54-magic-hoist-2026-04-30`. diff --git a/src/jit.zig b/src/jit.zig index 5be48b801..97a880373 100644 --- a/src/jit.zig +++ b/src/jit.zig @@ -34,6 +34,8 @@ const WasmMemory = @import("memory.zig").Memory; const trace_mod = @import("trace.zig"); const predecode_mod = @import("predecode.zig"); const platform = @import("platform.zig"); +const loop_info_mod = @import("loop_info.zig"); +const LoopInfo = loop_info_mod.LoopInfo; /// JIT-compiled function pointer type. /// Args: regs_ptr, vm_ptr, instance_ptr. @@ -1154,13 +1156,11 @@ pub const Compiler = struct { self_call_entry_idx: u32, /// Saved fast-path pattern from emitBaseCaseFastPath for duplication at self-call entry. fast_path_info: ?FastPathInfo, - /// IR slice and branch targets for peephole fusion (set during compile). + /// IR slice for peephole fusion (set during compile). ir_slice: []const RegInstr = &.{}, - branch_targets_slice: []bool = &.{}, - /// Loop header markers: true for PCs that are targets of backward branches. - loop_headers_slice: []bool = &.{}, - /// For each loop header PC, the max back-edge source PC (defines loop body range). - loop_end_map: []u32 = &.{}, + /// Branch / loop / liveness analysis owned by the compile, freed at end. + /// Populated by `LoopInfo.analyse` at the start of `compileMain`. + loop_info: LoopInfo = .{}, const FastPathInfo = struct { param_offset: u16, @@ -2431,76 +2431,10 @@ pub const Compiler = struct { return false; } - /// Pre-scan IR to find all branch targets (PCs that can be jumped to). - /// Also populates loop_headers_slice and loop_end_map for SIMD loop persistence. - fn scanBranchTargets(self: *Compiler, ir: []const RegInstr) ?[]bool { - const targets = self.alloc.alloc(bool, ir.len) catch return null; - @memset(targets, false); - const loop_headers = self.alloc.alloc(bool, ir.len) catch { - self.alloc.free(targets); - return null; - }; - @memset(loop_headers, false); - const loop_end = self.alloc.alloc(u32, ir.len) catch { - self.alloc.free(loop_headers); - self.alloc.free(targets); - return null; - }; - @memset(loop_end, 0); - - var scan_pc: u32 = 0; - while (scan_pc < ir.len) { - const instr = ir[scan_pc]; - const source_pc = scan_pc; - scan_pc += 1; - switch (instr.op) { - regalloc_mod.OP_BR => { - if (instr.operand < ir.len) { - targets[instr.operand] = true; - // Back-edge: target < source → loop header - if (instr.operand <= source_pc) { - loop_headers[instr.operand] = true; - if (source_pc > loop_end[instr.operand]) - loop_end[instr.operand] = source_pc; - } - } - }, - regalloc_mod.OP_BR_IF, regalloc_mod.OP_BR_IF_NOT => { - if (instr.operand < ir.len) { - targets[instr.operand] = true; - if (instr.operand <= source_pc) { - loop_headers[instr.operand] = true; - if (source_pc > loop_end[instr.operand]) - loop_end[instr.operand] = source_pc; - } - } - }, - regalloc_mod.OP_BR_TABLE => { - const count = instr.operand; - var i: u32 = 0; - while (i < count + 1 and scan_pc < ir.len) : (i += 1) { - const entry = ir[scan_pc]; - scan_pc += 1; - if (entry.operand < ir.len) { - targets[entry.operand] = true; - if (entry.operand <= source_pc) { - loop_headers[entry.operand] = true; - if (source_pc > loop_end[entry.operand]) - loop_end[entry.operand] = source_pc; - } - } - } - }, - regalloc_mod.OP_BLOCK_END => { - targets[scan_pc - 1] = true; - }, - else => {}, - } - } - self.loop_headers_slice = loop_headers; - self.loop_end_map = loop_end; - return targets; - } + // scanBranchTargets has moved to src/loop_info.zig — both backends now + // call self.loop_info.analyse() in compileMain. The body lives there so + // the analysis can be enriched (liveness, invariant-const classification) + // without diverging the two arches. fn isControlFlowOp(_: *const Compiler, op: u16) bool { return switch (op) { @@ -2606,15 +2540,13 @@ pub const Compiler = struct { // Pre-allocate pc_map indexed by RegInstr PC (not loop iteration) self.pc_map.appendNTimes(self.alloc, 0, ir.len + 1) catch return null; - // Pre-scan: find branch targets for known_consts invalidation - const branch_targets = self.scanBranchTargets(ir) orelse return null; - defer self.alloc.free(branch_targets); - defer self.alloc.free(self.loop_headers_slice); - defer self.alloc.free(self.loop_end_map); + // Pre-scan: branch targets, loop headers, loop body extents. + if (!self.loop_info.analyse(self.alloc, ir, self.reg_count)) return null; + defer self.loop_info.deinit(self.alloc); + const branch_targets = self.loop_info.branch_targets; - // Store IR and branch targets for peephole fusion + // Store IR for peephole fusion (loop_info is read directly via self). self.ir_slice = ir; - self.branch_targets_slice = branch_targets; // Detect SIMD presence for v128 sync in MOV/CONST for (ir) |scan_instr| { @@ -2658,7 +2590,7 @@ pub const Compiler = struct { if (pc < branch_targets.len and branch_targets[pc]) { self.known_consts = .{null} ** 128; self.scratch_vreg = null; - if (pc < self.loop_headers_slice.len and self.loop_headers_slice[pc]) { + if (pc < self.loop_info.loop_headers.len and self.loop_info.loop_headers[pc]) { // Loop header: emit pre-loads BEFORE pc_map (first iteration only), // then keep Q-reg cache alive. Back-edges jump to pc_map (after pre-loads). self.fpCacheEvictAll(); @@ -2774,7 +2706,7 @@ pub const Compiler = struct { regalloc_mod.OP_BR => { const target = instr.operand; const is_back_edge = target <= pc.* - 1; - if (is_back_edge and target < self.loop_headers_slice.len and self.loop_headers_slice[target]) { + if (is_back_edge and target < self.loop_info.loop_headers.len and self.loop_info.loop_headers[target]) { // Back-edge to loop header: flush Q-regs (keep cache) for deopt safety self.fpCacheEvictAll(); self.simdQregFlushAll(); @@ -3453,7 +3385,7 @@ pub const Compiler = struct { if (next.op != regalloc_mod.OP_BR_IF and next.op != regalloc_mod.OP_BR_IF_NOT) return false; if (next.rd != rd) return false; // Don't fuse if the BR_IF is a branch target (merge point) - if (pc.* < self.branch_targets_slice.len and self.branch_targets_slice[pc.*]) return false; + if (pc.* < self.loop_info.branch_targets.len and self.loop_info.branch_targets[pc.*]) return false; // Fuse: emit B.cond instead of CSET + CBNZ/CBZ self.evictAllCaches(); @@ -4602,7 +4534,7 @@ pub const Compiler = struct { /// Called BEFORE recording pc_map so back-edges skip pre-loads (only first iteration loads). /// Sets up Q-reg cache entries so the loop body finds inputs already cached. fn emitLoopPreHeader(self: *Compiler, ir: []const RegInstr, header_pc: u32) void { - const end_pc = if (header_pc < self.loop_end_map.len) self.loop_end_map[header_pc] else return; + const end_pc = if (header_pc < self.loop_info.loop_end.len) self.loop_info.loop_end[header_pc] else return; if (end_pc == 0) return; // Scan loop body to find v128 vregs that are read (as SIMD op inputs) diff --git a/src/loop_info.zig b/src/loop_info.zig new file mode 100644 index 000000000..b15934dc3 --- /dev/null +++ b/src/loop_info.zig @@ -0,0 +1,448 @@ +//! Shared loop / branch / liveness analysis for the JIT pipeline. +//! +//! Owns the data structures that describe a function's control-flow +//! shape (where the branch targets are, which PCs are loop headers, +//! how long each loop body extends) plus per-vreg liveness (first +//! definition, last use). Phase 4+ extends this with classification of +//! loop-invariant constants. +//! +//! Both JIT backends consume the same `LoopInfo` instead of running +//! their own pre-scans. Cost: one forward sweep over the RegInstr +//! stream per compile (control-flow sweep + liveness sweep are fused). + +const std = @import("std"); +const regalloc = @import("regalloc.zig"); + +const RegInstr = regalloc.RegInstr; + +/// Sentinel used by `vreg_first_def[v]` to mean "vreg v is never written +/// inside the function body". Callers wishing to ask "is v defined before +/// PC X?" should treat NEVER_DEFINED as "no, not defined here". Params +/// and locals are conceptually defined at function entry but their +/// definition is implicit (no RegInstr writes them) — Phase 4 handles +/// that distinction by also treating `v < local_count` as defined-before-loop. +pub const NEVER_DEFINED: u32 = std.math.maxInt(u32); + +pub const LoopInfo = struct { + /// branch_targets[pc] = true iff some control-flow op (BR, BR_IF, + /// BR_IF_NOT, BR_TABLE, BLOCK_END) targets this PC. Drives JIT + /// cache eviction and the known_consts wipe. + branch_targets: []bool = &.{}, + + /// loop_headers[pc] = true iff `pc` is the target of a backward + /// branch (i.e. a loop entry). + loop_headers: []bool = &.{}, + + /// loop_end[header_pc] = max source PC of any back-edge into + /// header_pc. Defines the inclusive range `[header_pc, loop_end]` + /// that the loop body covers. 0 for non-headers. + loop_end: []u32 = &.{}, + + /// vreg_first_def[v] = PC of the first RegInstr that writes vreg v, + /// or `NEVER_DEFINED` if v is never assigned by any instruction in + /// this function body. "Write" here is dataflow-correct: stores + /// (0x36..0x3E), conditional branches (BR_IF / BR_IF_NOT) and + /// RETURN treat rd as a SOURCE, not a destination, and are ignored. + vreg_first_def: []u32 = &.{}, + + /// vreg_last_use[v] = PC of the last RegInstr that reads vreg v + /// (rs1, rs2_field, or rd-as-source for stores / conditional + /// branches / RETURN). 0 if v is never read. + /// Conservative: opcodes that don't actually consume rs1/rs2 (BR, + /// CONST32, CONST64, BLOCK_END, NOP, DELETED) are excluded; + /// everything else treats both rs1 and rs2_field as a read. The + /// over-approximation extends last_use later than necessary, which + /// only shrinks the coalescing window in Phase 5 — safe by design. + vreg_last_use: []u32 = &.{}, + + /// Number of vregs the liveness arrays cover. Equals reg_func.reg_count + /// at analyse() time. Used for bounds checks in callers. + vreg_count: u32 = 0, + + /// Free all owned slices. Safe to call on a default-initialized + /// (empty) LoopInfo. + pub fn deinit(self: *LoopInfo, alloc: std.mem.Allocator) void { + if (self.branch_targets.len > 0) alloc.free(self.branch_targets); + if (self.loop_headers.len > 0) alloc.free(self.loop_headers); + if (self.loop_end.len > 0) alloc.free(self.loop_end); + if (self.vreg_first_def.len > 0) alloc.free(self.vreg_first_def); + if (self.vreg_last_use.len > 0) alloc.free(self.vreg_last_use); + self.* = .{}; + } + + /// Single forward sweep populating branch_targets / loop_headers / + /// loop_end and per-vreg first_def / last_use. Returns false on + /// allocation failure (caller treats the JIT compile as a bail). + pub fn analyse( + self: *LoopInfo, + alloc: std.mem.Allocator, + ir: []const RegInstr, + reg_count: u32, + ) bool { + const targets = alloc.alloc(bool, ir.len) catch return false; + @memset(targets, false); + const loop_headers = alloc.alloc(bool, ir.len) catch { + alloc.free(targets); + return false; + }; + @memset(loop_headers, false); + const loop_end = alloc.alloc(u32, ir.len) catch { + alloc.free(loop_headers); + alloc.free(targets); + return false; + }; + @memset(loop_end, 0); + + const first_def = alloc.alloc(u32, reg_count) catch { + alloc.free(loop_end); + alloc.free(loop_headers); + alloc.free(targets); + return false; + }; + @memset(first_def, NEVER_DEFINED); + const last_use = alloc.alloc(u32, reg_count) catch { + alloc.free(first_def); + alloc.free(loop_end); + alloc.free(loop_headers); + alloc.free(targets); + return false; + }; + @memset(last_use, 0); + + var scan_pc: u32 = 0; + while (scan_pc < ir.len) { + const instr = ir[scan_pc]; + const source_pc = scan_pc; + scan_pc += 1; + + // --- Control-flow shape --- + switch (instr.op) { + regalloc.OP_BR => recordTarget(targets, loop_headers, loop_end, instr.operand, source_pc, ir.len), + regalloc.OP_BR_IF, regalloc.OP_BR_IF_NOT => recordTarget( + targets, + loop_headers, + loop_end, + instr.operand, + source_pc, + ir.len, + ), + regalloc.OP_BR_TABLE => { + const count = instr.operand; + var i: u32 = 0; + while (i < count + 1 and scan_pc < ir.len) : (i += 1) { + const entry = ir[scan_pc]; + scan_pc += 1; + recordTarget(targets, loop_headers, loop_end, entry.operand, source_pc, ir.len); + // BR_TABLE follow-up NOPs participate in liveness too: + // their operand is a target PC, not a vreg, so we skip + // their rs1/rs2_field/rd entries. + } + }, + regalloc.OP_BLOCK_END => { + targets[scan_pc - 1] = true; + }, + else => {}, + } + + // --- Liveness --- + // + // Update last_use BEFORE first_def so a `mov rd = rs1` that + // happens to have rd == rs1 (degenerate, but legal) records + // the read at PC and the write at PC. That's correct: the + // value is read AT this PC and written AT this PC. + + if (opUsesRdAsSource(instr.op)) { + if (instr.rd < reg_count) last_use[instr.rd] = source_pc; + } + if (opUsesRs1AsSource(instr.op)) { + if (instr.rs1 < reg_count) last_use[instr.rs1] = source_pc; + } + if (opUsesRs2AsSource(instr.op)) { + const r2 = instr.rs2(); + if (r2 < reg_count) last_use[r2] = source_pc; + } + // Multi-source ops (CALL / CALL_INDIRECT / RETURN_MULTI / + // memory.fill / memory.copy) read additional vregs that live + // in the operand field as a count + following NOP slots, or + // in special positions. Phase 1 conservatively treats them + // via the rs1/rs2 fields above (over-approximation only loses + // optimization in Phase 5; never hurts correctness). + + if (opWritesRd(instr.op)) { + if (instr.rd < reg_count and first_def[instr.rd] == NEVER_DEFINED) { + first_def[instr.rd] = source_pc; + } + } + } + + self.* = .{ + .branch_targets = targets, + .loop_headers = loop_headers, + .loop_end = loop_end, + .vreg_first_def = first_def, + .vreg_last_use = last_use, + .vreg_count = reg_count, + }; + return true; + } +}; + +fn recordTarget( + targets: []bool, + loop_headers: []bool, + loop_end: []u32, + target_pc: u32, + source_pc: u32, + ir_len: usize, +) void { + if (target_pc >= ir_len) return; + targets[target_pc] = true; + if (target_pc <= source_pc) { + loop_headers[target_pc] = true; + if (source_pc > loop_end[target_pc]) { + loop_end[target_pc] = source_pc; + } + } +} + +/// True iff this opcode writes a fresh value into rd (the destination +/// vreg). Stores treat rd as a value source; conditional branches +/// (BR_IF / BR_IF_NOT / RETURN) treat rd as the read condition or +/// returned value; control-flow ops without vregs return false. +fn opWritesRd(op: u16) bool { + return switch (op) { + regalloc.OP_BR, + regalloc.OP_BR_IF, + regalloc.OP_BR_IF_NOT, + regalloc.OP_BR_TABLE, + regalloc.OP_BLOCK_END, + regalloc.OP_NOP, + regalloc.OP_DELETED, + regalloc.OP_RETURN, + regalloc.OP_RETURN_VOID, + regalloc.OP_RETURN_MULTI, + regalloc.OP_MEMORY_FILL, + regalloc.OP_MEMORY_COPY, + // Wasm stores: rd is the value source, rs1 is the address. + 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E => false, + else => true, + }; +} + +/// True iff this opcode treats rd as a SOURCE (read) rather than a +/// destination. The set is symmetric with `!opWritesRd` for the cases +/// where rd carries a vreg reference at all — control-flow ops with no +/// vreg use return false in both predicates. +fn opUsesRdAsSource(op: u16) bool { + return switch (op) { + regalloc.OP_BR_IF, + regalloc.OP_BR_IF_NOT, + regalloc.OP_RETURN, + regalloc.OP_RETURN_MULTI, + // Wasm stores: rd is the value source. + 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E => true, + // memory.fill / memory.copy use rs1/rs2/operand-NOP entries — + // see fuzz_gen / vm.zig for the layout. Phase 1 doesn't model + // their additional sources; Phase 5 will not coalesce around + // them anyway, so the omission is harmless. + else => false, + }; +} + +/// True iff this opcode reads rs1 as a vreg. Default-true for most ops +/// (binary / unary / load / mov), false for control-flow / const / +/// no-vreg ops where rs1 is unused (and defaults to 0, which would +/// otherwise spuriously update last_use[0]). +fn opUsesRs1AsSource(op: u16) bool { + return switch (op) { + regalloc.OP_BR, + regalloc.OP_BR_IF, + regalloc.OP_BR_IF_NOT, + regalloc.OP_BLOCK_END, + regalloc.OP_NOP, + regalloc.OP_DELETED, + regalloc.OP_CONST32, + regalloc.OP_CONST64, + regalloc.OP_RETURN, + regalloc.OP_RETURN_VOID, + => false, + else => true, + }; +} + +/// True iff this opcode reads rs2_field as a vreg. Conservative: any +/// binop-shaped opcode might use it; ops that don't (unary, mov, load, +/// stores) over-approximate harmlessly. The hard exclusions are ops +/// where rs2_field is guaranteed unused so reading it would mark +/// vreg 0 as last-used spuriously. +fn opUsesRs2AsSource(op: u16) bool { + return switch (op) { + regalloc.OP_BR, + regalloc.OP_BR_IF, + regalloc.OP_BR_IF_NOT, + regalloc.OP_BLOCK_END, + regalloc.OP_NOP, + regalloc.OP_DELETED, + regalloc.OP_CONST32, + regalloc.OP_CONST64, + regalloc.OP_RETURN, + regalloc.OP_RETURN_VOID, + regalloc.OP_RETURN_MULTI, + regalloc.OP_MOV, + regalloc.OP_BR_TABLE, + => false, + else => true, + }; +} + +const testing = std.testing; + +test "LoopInfo: empty IR yields empty slices" { + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &.{}, 0)); + try testing.expectEqual(@as(usize, 0), info.branch_targets.len); + try testing.expectEqual(@as(usize, 0), info.loop_headers.len); + try testing.expectEqual(@as(usize, 0), info.loop_end.len); + try testing.expectEqual(@as(usize, 0), info.vreg_first_def.len); + try testing.expectEqual(@as(usize, 0), info.vreg_last_use.len); +} + +test "LoopInfo: forward branch flagged, no loop header" { + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_BR, .rd = 0, .rs1 = 0, .operand = 3 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 0)); + try testing.expect(info.branch_targets[3]); + try testing.expect(!info.loop_headers[3]); + try testing.expectEqual(@as(u32, 0), info.loop_end[3]); +} + +test "LoopInfo: backward branch is a loop header with end_pc" { + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_BR_IF, .rd = 0, .rs1 = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 0)); + try testing.expect(info.branch_targets[0]); + try testing.expect(info.loop_headers[0]); + try testing.expectEqual(@as(u32, 2), info.loop_end[0]); +} + +test "LoopInfo: nested back-edges keep max end_pc" { + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_BR, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_BR, .rd = 0, .rs1 = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 0)); + try testing.expectEqual(@as(u32, 4), info.loop_end[0]); +} + +test "LoopInfo: BLOCK_END marks the END pc itself as a target" { + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_BLOCK_END, .rd = 0, .rs1 = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 0)); + try testing.expect(info.branch_targets[1]); + try testing.expect(!info.loop_headers[1]); +} + +test "LoopInfo: liveness — CONST32 writes rd, ADD reads rs1+rs2_field" { + // pc=0: const32 r2 = 42 + // pc=1: const32 r3 = 5 + // pc=2: i32.add r4 = r2 + r3 + // pc=3: return r4 + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_CONST32, .rd = 2, .rs1 = 0, .operand = 42 }, + .{ .op = regalloc.OP_CONST32, .rd = 3, .rs1 = 0, .operand = 5 }, + .{ .op = 0x6A, .rd = 4, .rs1 = 2, .rs2_field = 3, .operand = 0 }, // i32.add + .{ .op = regalloc.OP_RETURN, .rd = 4, .rs1 = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 5)); + try testing.expectEqual(@as(u32, 5), info.vreg_count); + // r2 first defined at pc 0, last used at pc 2 + try testing.expectEqual(@as(u32, 0), info.vreg_first_def[2]); + try testing.expectEqual(@as(u32, 2), info.vreg_last_use[2]); + // r3 first defined at pc 1, last used at pc 2 + try testing.expectEqual(@as(u32, 1), info.vreg_first_def[3]); + try testing.expectEqual(@as(u32, 2), info.vreg_last_use[3]); + // r4 first defined at pc 2, last used at pc 3 (RETURN reads rd) + try testing.expectEqual(@as(u32, 2), info.vreg_first_def[4]); + try testing.expectEqual(@as(u32, 3), info.vreg_last_use[4]); + // r0 / r1 never touched + try testing.expectEqual(NEVER_DEFINED, info.vreg_first_def[0]); + try testing.expectEqual(@as(u32, 0), info.vreg_last_use[0]); + try testing.expectEqual(NEVER_DEFINED, info.vreg_first_def[1]); +} + +test "LoopInfo: liveness — store reads rd, does not write rd" { + // pc=0: const32 r2 = 100 (address) + // pc=1: const32 r3 = 7 (value) + // pc=2: i32.store rd=r3, rs1=r2 (op 0x36) + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_CONST32, .rd = 2, .rs1 = 0, .operand = 100 }, + .{ .op = regalloc.OP_CONST32, .rd = 3, .rs1 = 0, .operand = 7 }, + .{ .op = 0x36, .rd = 3, .rs1 = 2, .operand = 0 }, // i32.store + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 5)); + // The store does not redefine r3; first_def stays at pc=1 + try testing.expectEqual(@as(u32, 1), info.vreg_first_def[3]); + // Both r2 (address) and r3 (value) are read at pc=2 + try testing.expectEqual(@as(u32, 2), info.vreg_last_use[2]); + try testing.expectEqual(@as(u32, 2), info.vreg_last_use[3]); +} + +test "LoopInfo: liveness — BR_IF reads rd as condition" { + // pc=0: const32 r2 = 1 (condition) + // pc=1: br_if rd=r2 -> pc=3 + // pc=2: nop + // pc=3: nop + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_CONST32, .rd = 2, .rs1 = 0, .operand = 1 }, + .{ .op = regalloc.OP_BR_IF, .rd = 2, .rs1 = 0, .operand = 3 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + .{ .op = regalloc.OP_NOP, .rd = 0, .rs1 = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 4)); + // r2 first defined at pc=0, BR_IF reads it (does NOT redefine) at pc=1 + try testing.expectEqual(@as(u32, 0), info.vreg_first_def[2]); + try testing.expectEqual(@as(u32, 1), info.vreg_last_use[2]); +} + +test "LoopInfo: liveness — MOV does not over-read rs2 default 0" { + // pc=0: const32 r5 = 99 + // pc=1: mov r6 = r5 (rs1=5, rs2_field defaults to 0) + // We must NOT mark vreg 0 as last_used at pc=1 just because rs2 + // defaulted to 0. + const ir = [_]RegInstr{ + .{ .op = regalloc.OP_CONST32, .rd = 5, .rs1 = 0, .operand = 99 }, + .{ .op = regalloc.OP_MOV, .rd = 6, .rs1 = 5, .rs2_field = 0, .operand = 0 }, + }; + var info: LoopInfo = .{}; + defer info.deinit(testing.allocator); + try testing.expect(info.analyse(testing.allocator, &ir, 8)); + try testing.expectEqual(@as(u32, 1), info.vreg_last_use[5]); // read at pc=1 + try testing.expectEqual(@as(u32, 0), info.vreg_last_use[0]); // never read +} diff --git a/src/x86.zig b/src/x86.zig index 34fe606cd..7ef84ae19 100644 --- a/src/x86.zig +++ b/src/x86.zig @@ -44,6 +44,8 @@ const JitCode = jit_mod.JitCode; const JitFn = jit_mod.JitFn; const vm_mod = @import("vm.zig"); const platform = @import("platform.zig"); +const loop_info_mod = @import("loop_info.zig"); +const LoopInfo = loop_info_mod.LoopInfo; // ================================================================ // x86_64 register definitions @@ -1728,13 +1730,11 @@ pub const Compiler = struct { osr_target_pc: ?u32, /// Byte offset of the OSR prologue in the code buffer. osr_prologue_offset: u32, - /// IR slice and branch targets for peephole fusion (set during compile). + /// IR slice for peephole fusion (set during compile). ir_slice: []const RegInstr = &.{}, - branch_targets_slice: []bool = &.{}, - /// Loop header markers: true for PCs that are targets of backward branches. - loop_headers_slice: []bool = &.{}, - /// For each loop header PC, the max back-edge source PC (defines loop body range). - loop_end_map: []u32 = &.{}, + /// Branch / loop / liveness analysis owned by the compile, freed at end. + /// Populated by `LoopInfo.analyse` at the start of `compileMain`. + loop_info: LoopInfo = .{}, const Patch = struct { rel32_offset: u32, // byte offset of the rel32 field in code @@ -3845,13 +3845,11 @@ pub const Compiler = struct { self.pc_map.appendNTimes(self.alloc, 0, ir.len + 1) catch return null; - // Pre-scan: find branch targets for fusion safety - const branch_targets = self.scanBranchTargets(ir) orelse return null; - defer self.alloc.free(branch_targets); - defer self.alloc.free(self.loop_headers_slice); - defer self.alloc.free(self.loop_end_map); + // Pre-scan: branch targets, loop headers, loop body extents. + if (!self.loop_info.analyse(self.alloc, ir, self.reg_count)) return null; + defer self.loop_info.deinit(self.alloc); + const branch_targets = self.loop_info.branch_targets; self.ir_slice = ir; - self.branch_targets_slice = branch_targets; // Pre-scan: compute written_vregs for the ENTIRE function. // Must cover all instructions (not just those before each call site) @@ -3879,7 +3877,7 @@ pub const Compiler = struct { // Evict SIMD XMM cache at branch targets (merge points) if (pc < branch_targets.len and branch_targets[pc]) { self.scratch_vreg = null; - if (pc < self.loop_headers_slice.len and self.loop_headers_slice[pc]) { + if (pc < self.loop_info.loop_headers.len and self.loop_info.loop_headers[pc]) { // Loop header: emit pre-loads BEFORE pc_map (first iteration only) self.emitLoopPreHeader(ir, pc); } else { @@ -3984,7 +3982,7 @@ pub const Compiler = struct { /// Emit loop pre-header: load v128 input vregs into XMM regs before the loop header. fn emitLoopPreHeader(self: *Compiler, ir: []const RegInstr, header_pc: u32) void { - const end_pc = if (header_pc < self.loop_end_map.len) self.loop_end_map[header_pc] else return; + const end_pc = if (header_pc < self.loop_info.loop_end.len) self.loop_info.loop_end[header_pc] else return; if (end_pc == 0) return; var written = [_]bool{false} ** 128; @@ -6009,7 +6007,7 @@ pub const Compiler = struct { regalloc_mod.OP_BR => { const target = instr.operand; const is_back_edge = target <= pc.* - 1; - if (is_back_edge and target < self.loop_headers_slice.len and self.loop_headers_slice[target]) { + if (is_back_edge and target < self.loop_info.loop_headers.len and self.loop_info.loop_headers[target]) { // Back-edge to loop header: flush XMM (keep cache) for deopt safety self.simdXregFlushAll(); } else if (is_back_edge) { @@ -6474,73 +6472,10 @@ pub const Compiler = struct { // --- Peephole fusion: CMP+Jcc --- - fn scanBranchTargets(self: *Compiler, ir: []const RegInstr) ?[]bool { - const targets = self.alloc.alloc(bool, ir.len) catch return null; - @memset(targets, false); - const loop_headers = self.alloc.alloc(bool, ir.len) catch { - self.alloc.free(targets); - return null; - }; - @memset(loop_headers, false); - const loop_end = self.alloc.alloc(u32, ir.len) catch { - self.alloc.free(loop_headers); - self.alloc.free(targets); - return null; - }; - @memset(loop_end, 0); - - var scan_pc: u32 = 0; - while (scan_pc < ir.len) { - const instr = ir[scan_pc]; - const source_pc = scan_pc; - scan_pc += 1; - switch (instr.op) { - regalloc_mod.OP_BR => { - if (instr.operand < ir.len) { - targets[instr.operand] = true; - if (instr.operand <= source_pc) { - loop_headers[instr.operand] = true; - if (source_pc > loop_end[instr.operand]) - loop_end[instr.operand] = source_pc; - } - } - }, - regalloc_mod.OP_BR_IF, regalloc_mod.OP_BR_IF_NOT => { - if (instr.operand < ir.len) { - targets[instr.operand] = true; - if (instr.operand <= source_pc) { - loop_headers[instr.operand] = true; - if (source_pc > loop_end[instr.operand]) - loop_end[instr.operand] = source_pc; - } - } - }, - regalloc_mod.OP_BR_TABLE => { - const count = instr.operand; - var i: u32 = 0; - while (i < count + 1 and scan_pc < ir.len) : (i += 1) { - const entry = ir[scan_pc]; - scan_pc += 1; - if (entry.operand < ir.len) { - targets[entry.operand] = true; - if (entry.operand <= source_pc) { - loop_headers[entry.operand] = true; - if (source_pc > loop_end[entry.operand]) - loop_end[entry.operand] = source_pc; - } - } - } - }, - regalloc_mod.OP_BLOCK_END => { - targets[scan_pc - 1] = true; - }, - else => {}, - } - } - self.loop_headers_slice = loop_headers; - self.loop_end_map = loop_end; - return targets; - } + // scanBranchTargets has moved to src/loop_info.zig — see jit.zig comment + // for the rationale. The body of the analysis is shared with the ARM64 + // backend so future enrichments (liveness, invariant-const classification) + // do not need to be duplicated. /// Try to fuse a CMP result with a following BR_IF/BR_IF_NOT. /// Returns true if fused, false if not fuseable. Returns null on OOM. @@ -6549,7 +6484,7 @@ pub const Compiler = struct { const next = self.ir_slice[pc.*]; if (next.op != regalloc_mod.OP_BR_IF and next.op != regalloc_mod.OP_BR_IF_NOT) return false; if (next.rd != rd) return false; - if (pc.* < self.branch_targets_slice.len and self.branch_targets_slice[pc.*]) return false; + if (pc.* < self.loop_info.branch_targets.len and self.loop_info.branch_targets[pc.*]) return false; // Fuse: emit Jcc instead of SETCC + MOVZX + store + load + TEST + Jcc const actual_cc = if (next.op == regalloc_mod.OP_BR_IF) cc else cc.invert();