clojurewasm · chaploud · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.dev/checklist.md b/.dev/checklist.md
@@ -89,57 +89,71 @@ Prefix: W## (to distinguish from CW's F## items).
   `@./.dev/w47-investigation.md`. Low priority since 20 other
   benchmarks improved >10% (GC paths 40–76% faster).
 
-- [ ] W54: `tgo_strops` (and other div-heavy TinyGo workloads) is
-  ~2.1× slower than wasmtime/cranelift on M4 Pro per
-  `bench/runtime_comparison.yaml` (2026-03-25, runs=1/warmup=0):
-  zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. The original
-  framing — "constant-divisor folding is missing" — was
-  **disproven** during the 2026-04-29 evening investigation. Both
-  ARM64 (`src/jit.zig:3582-3666` `tryEmitDivByConstU32`) and
-  x86_64 (`src/x86.zig`) already emit the Hacker's Delight magic
-  multiply for `i32.div_u K`; the JIT dump for
-  `bench/wasm/tgo_string_ops.wasm` shows three MOVZ+MOVK+UMULL+LSR
-  sequences for the three `i32.div_u 10` sites, with zero `UDIV`
-  instructions.
-
-  The remaining 2.1× lives in two places:
-
-  1. The 2-instruction magic-constant load (`MOVZ + MOVK` for
-     `0xCCCCCCCD`) is re-emitted inside the loop body on every
-     iteration; cranelift's SSA + GVN hoist it once so only
-     `UMULH/UMULL + LSR` stay hot. Three div sites in
-     `tgo_strops` cost ~6 ARM64 instructions per iteration that a
-     preheader hoist would eliminate.
-  2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift
-     collapses those into register renames whereas zwasm's
-     linear-scan regalloc still spills them to LDR/STR against
-     `regs[]`.
-
-  Single-pass-compatible levers, ranked:
-
-  1. **Loop-preheader magic hoist.** Extend `emitLoopPreHeader`
-     (today SIMD-only, `src/jit.zig:4604`) to scan for
-     `OP_CONST32 K → OP_DIV_U` patterns, allocate a callee-saved
-     register, and pre-load the magic. `tryEmitDivByConstU32`
-     short-circuits when the magic is already live. Risk:
-     medium — needs to coexist with the existing physical-
-     register layout (functions like `string_ops` (func#24, 13
-     vregs) saturate the callee-saved set; the prologue would
-     have to reserve a free slot up front).
-  2. **`OP_CONST32` reuse across loop back-edges.** Today
-     `known_consts` is wiped at every header. Skip unless (1)
-     lands — saves the 1-instr const itself but not the 2-instr
-     magic that hangs off it.
-  3. **`OP_MOV` coalescing in linear-scan regalloc.** Substantial
-     surgery; deserves a separate W## entry.
-
-  Out of scope (would break single-pass): SSA + dataflow,
-  global register allocation, automatic loop unroll /
-  vectorise. Re-record `runtime_comparison.yaml` at 5 runs / 3
-  warmup before claiming any number — the current values are
-  single-sample.
-
-  Full investigation log: `@./.dev/w54-investigation.md`.
+- [x] W54 (substrate): structural cleanup landed via PR #91 from
+  `develop/w54-loop-info`. Single change: `src/loop_info.zig` is
+  the single source of truth for branch / loop / vreg liveness.
+  The two JIT backends used to maintain byte-identical
+  `scanBranchTargets` implementations; both now consume
+  `LoopInfo.analyse(...)`. `vreg_first_def[]` / `vreg_last_use[]`
+  are computed in the same forward sweep, ready for future
+  consumers (Phase 5 / Phase 4). Behaviour byte-identical to
+  main (verified via `--dump-jit` diff on tgo_string_ops func#24,
+  fib func#2, and the realworld suite on Mac aarch64 + Ubuntu
+  x86_64). Architecture: D138 in `.dev/decisions.md`. Session arc:
+  `.dev/w54-redesign-postmortem.md`.
+
+- [ ] W54-coalescer: liveness-driven mov coalescing extension to
+  `regalloc.copyPropagate`. Built and proven on Mac aarch64 (50/50
+  realworld), reverted from the substrate PR after Linux x86_64
+  CI flagged a `go_math_big` divergence — BigInt subtraction
+  result `864197532086419753208641975320` (wasmtime) vs
+  `864197532160206729503480181784` (zwasm). The regalloc itself
+  is arch-agnostic, so the same `RegFunc` flows through both
+  backends; the divergence implies an x86_64-specific assumption
+  in `src/x86.zig`'s codegen that the new IR layout (fewer MOVs,
+  shifted PCs) violates. Reproducible on OrbStack's
+  `my-ubuntu-amd64` with `develop/w54-loop-pass-redesign`
+  checked out and `zig build -Doptimize=ReleaseSafe`. The
+  coalescer commit (`ec8182f` on the archive branch) cherry-picks
+  cleanly; debugging is in the x86 backend's interaction with the
+  redef-stop pattern — likely a getOrLoad / SCRATCH contention
+  triggered by a specific opcode sequence that the coalesced IR
+  produces. Recommended bisect: dump `--dump-regir` for the
+  failing function, identify the first MOV the new coalescer
+  folds that the old version kept, then dump `--dump-jit` on x86
+  to find the codegen mismatch.
+
+- [ ] W54-hoist-revisit: magic-constant loop-invariant hoist was
+  built and proven on `develop/w54-loop-pass-redesign` (archived
+  as `archive/w54-magic-hoist-2026-04-30`). digitCount JIT goes
+  196 → 192 with hoist alone. **Held back** from the substrate PR
+  pending three prerequisites:
+  1. **W47** — bench harness with σ < 5% on `tgo_strops`. The
+     hoist's effect is below the current 10% σ floor.
+  2. **W54-x86** — x86_64 `pickHoistPhys` parity. ARM64-only land
+     forces a reconciling follow-up; bundling makes one coherent
+     change.
+  3. `runtime_comparison.yaml` re-recorded at 5/3 hyperfine on a
+     thermally-stable rig.
+  Re-attempt: cherry-pick `1600397` + `c4b806e` from the archive
+  branch onto a fresh redesign branch once 1+2+3 land.
+
+- [ ] W54-x86: x86_64 magic-constant hoist parity for
+  W54-hoist-revisit. The archive branch's `src/x86.zig` already
+  has the `hoist_phys` / `hoist_displaced_infra` field
+  scaffolding; `pickHoistPhys` body needs implementing for x86's
+  reg layout (RBX/RBP/R15 vreg-bound for any reg_count >= 1; no
+  `inst_ptr_cached` slot to displace; R13/R14 free only when
+  `!has_memory`). Bench-driven decision on whether the win is
+  worth the displacement cost on functions with `has_memory`.
+
+- [ ] W54-libm: real-world `rw_c_math` (5.0× wasmtime cold,
+  8.7× cached) is dominated by BLR-heavy libm dispatch (`sin`,
+  `cos`, `pow`, `sqrt` per iteration). Out of scope for the
+  loop-pass redesign. Single-pass-compatible candidates: intrinsic
+  recognition for imported function names (`sqrt` → FSQRT inline),
+  software-libm fallback for sin/cos/pow. Needs imported-function
+  name resolution on the predecode side.
 
 - [ ] W48 Phase 2: Linux binary size 1.56 MB → 1.50 MB (~62 KB more).
   W48 Phase 1 shipped (2026-04-25): `pub const panic = std.debug.simple_panic`

diff --git a/.dev/decisions.md b/.dev/decisions.md
@@ -941,3 +941,131 @@ Mac binary another 60 KB → cap drops 1.30 → 1.25 MB). Loosening a
 ceiling requires a CHANGELOG entry naming the regression source so
 the slack is intentional and visible.
 
+
+## D138: Shared `LoopInfo` analysis layer for the JIT pipeline
+
+**Status**: Accepted — landed via `develop/w54-loop-info` (PR #91).
+
+**Context**: For a long stretch we treated each JIT optimisation as a
+self-contained patch — the SIMD `emitLoopPreHeader`, the
+const-divisor magic-multiply fold, the adjacent-MOV `copyPropagate`,
+the `vm_ptr_cached` / `inst_ptr_cached` slots in `vregToPhys`. Each
+re-derived its own slice of "what's a loop" and "is this vreg dead"
+inline at codegen time. That ad-hoc layout was the proximate cause
+of the W54 magic-hoist abandonment on
+`develop/w54-magic-hoist-attempt` (2026-04-29 evening, see
+`.dev/w54-investigation.md`): x21 was simultaneously the inst_ptr
+cache slot for `reg_count <= 13 && has_self_call` AND the natural
+callee-saved candidate for the magic. Picking a safe boundary was a
+design call, not a tail-end commit.
+
+**Decision**: Ship the substrate first. One structural change that
+stands on its own merit and unblocks future loop-aware work:
+
+`src/loop_info.zig` is the single source of truth for the function's
+control-flow shape and per-vreg liveness. The two JIT backends used
+to maintain byte-for-byte identical `scanBranchTargets`
+implementations; both now consume `LoopInfo.analyse(allocator, ir,
+reg_count)`, which produces:
+
+- `branch_targets[]`, `loop_headers[]`, `loop_end[]` (drives JIT
+  cache eviction and the `known_consts` wipe at merge points).
+- `vreg_first_def[]`, `vreg_last_use[]` (one forward sweep,
+  conservative reads — over-approximation extends last_use later
+  than necessary, which only shrinks a future coalescer's window;
+  never breaks correctness). Phase 5+ consumers will read these.
+
+The opcode classification helpers `opWritesRd`, `opUsesRdAsSource`,
+`opUsesRs1AsSource`, `opUsesRs2AsSource` live in `loop_info.zig`
+for now (private) — they will be promoted to public regalloc API
+once the coalescer extension that needs them is debugged on x86_64
+(see W54-coalescer in checklist.md).
+
+**Effect (ARM64 Mac)**:
+
+- Both backends drop ~60 lines of duplicated `scanBranchTargets`
+  body in favour of a thin `LoopInfo.analyse` call. Behaviour is
+  byte-deterministic identical to main: `--dump-jit=24` of
+  `tgo_string_ops` func#24 (digitCount, 196 ARM64 instrs / 784
+  bytes) matches main bit-for-bit.
+- All other functions emit byte-identical machine code. No
+  performance change is expected or observed (Phase 0 + 1 are pure
+  refactoring; the data is computed but no codegen consumer reads
+  the new `vreg_first_def[]` / `vreg_last_use[]` arrays yet).
+
+**Rejected alternatives** (and why they didn't ride along):
+
+- **Liveness-driven mov coalescing in `regalloc.copyPropagate`**.
+  Implemented and shipped on the develop branch, then **reverted
+  on 2026-04-30** after the green Mac gate but RED Linux x86_64 CI
+  on `go_math_big`: the new "stop at first redef of old_reg" scan
+  with branch-target / forward-branch / multi-source bail-outs
+  passes the Mac aarch64 realworld suite (50/50, including
+  `rust_regex` which the first attempt broke), but produces wrong
+  results on Linux x86_64's `go_math_big` (BigInt subtraction
+  mismatch — wasmtime returns
+  `864197532086419753208641975320`, zwasm returns
+  `864197532160206729503480181784`). The regalloc itself is
+  arch-agnostic, so the same `RegFunc` flows through both
+  backends; the divergence implies an x86_64-specific assumption
+  in `src/x86.zig`'s codegen that the new IR layout violates.
+  Diagnosis is bench/CI-bound and not a tail-end fix; tracked as
+  W54-coalescer for a focused follow-up.
+
+- **Magic-constant loop-invariant hoist** (`OP_CONST32 K →
+  OP_DIV_U` pattern, materialise the magic into a callee-saved
+  register in the prologue, short-circuit
+  `tryEmitDivByConstU32`). Implemented on
+  `develop/w54-loop-pass-redesign` (commits `1600397`, `c4b806e`)
+  and proved out: digitCount JIT 196 → 192 with hoist alone, 192
+  → 185 stacked with the (eventually reverted) coalescer. Held
+  back from this PR for three reasons:
+  1. The runtime gain is below the bench σ floor today; without
+     the W47 harness work the optimisation would land
+     evidence-free and any later regression would be argued as
+     noise rather than measured.
+  2. The hoist requires displacing `inst_ptr_cached` (x21) on
+     functions with reg_count >= 5 + has_self_call — an
+     ARM64-specific behaviour change with no measured benefit
+     today. Pushing it post-harness keeps the trade-off
+     reviewable.
+  3. x86_64 parity has different free-slot mechanics (no
+     `inst_ptr_cached` to displace). Bundling hoist with parity
+     makes one coherent change later.
+
+- **Loop-invariant `known_consts` survival across loop headers**.
+  Sketched as Phase 4 of the original plan; dropped after
+  inspection of digitCount's RegIR showed every `i32.div_u 10` site
+  has its CONST32 emitted *inside* the loop body (TinyGo reuses
+  the same vreg `r8` / `r9` / `r12` per div), so the optimisation
+  would never fire on the W54 target.
+
+**Affected files**: `src/loop_info.zig` (new), `src/jit.zig`
+(replace `scanBranchTargets` with `LoopInfo.analyse`), `src/x86.zig`
+(same).
+
+**Archive**: the magic-hoist + coalescer work is preserved on
+`develop/w54-loop-pass-redesign` (last commit `a56d442`) and tagged
+`archive/w54-magic-hoist-2026-04-30`. Cherry-pick path:
+- `1600397` + `c4b806e` for the ARM64 magic hoist (re-attempt
+  W54-hoist-revisit).
+- `ec8182f` for the redef-aware coalescer (re-attempt W54-coalescer
+  after diagnosing the x86_64 `go_math_big` regression).
+
+**Re-evaluation pre-conditions**:
+1. **W47** — bench harness with σ < 5% on tgo_strops (currently ~10%)
+2. **W54-x86** — symmetric `pickHoistPhys` for x86_64 reg layout
+3. **W54-coalescer** — diagnose and fix the x86_64 `go_math_big`
+   divergence; the diff is in the regalloc-stage IR shape, the
+   backend assumption that breaks is in `src/x86.zig`.
+4. `runtime_comparison.yaml` re-recorded with 5/3 hyperfine on a
+   thermally-stable rig.
+
+**Follow-ups** (open W## items in checklist.md):
+- `W54-coalescer`: diagnose the x86_64 `go_math_big` regression,
+  re-land the coalescer.
+- `W54-hoist-revisit`: revive the magic-hoist work once W47 +
+  W54-x86 are ready.
+- `W54-libm`: `rw_c_math` is dominated by libm `sin`/`cos`/`pow`
+  dispatch; intrinsic recognition + ARM64 FSQRT inline + soft-libm
+  fallback.
diff --git a/.dev/memo.md b/.dev/memo.md
@@ -23,7 +23,41 @@ Session handover document. Read at session start.
 
 ## Current Task
 
-**W53 done. C-g foundation + Mac/Ubuntu baselines done.** Ship-overnight
+**W54 substrate landed via PR #91 from `develop/w54-loop-info`** (2026-04-30).
+Single structural change: `src/loop_info.zig` is the single source of
+truth for branch / loop / vreg liveness. Both backends drop ~60 lines
+of byte-identical `scanBranchTargets` in favour of a thin
+`LoopInfo.analyse(...)` call. `vreg_first_def[]` /
+`vreg_last_use[]` are computed from the same forward sweep, ready
+for future consumers. JIT output is byte-identical to main on every
+function (verified via `--dump-jit` diff for tgo_string_ops func#24
+and fib func#2).
+
+### Held back (archive branch)
+
+`develop/w54-loop-pass-redesign` (tagged
+`archive/w54-magic-hoist-2026-04-30`) preserves two further pieces
+of work that were built and bench-validated, but held back:
+
+1. **Magic-constant loop-invariant hoist** (commits `1600397`,
+   `c4b806e`). digitCount JIT 196 → 192. Re-attempt prerequisites:
+   W47 (bench harness σ < 5%), W54-x86 (parity).
+2. **Liveness-driven mov coalescing** (commit `ec8182f`). digitCount
+   JIT 192 → 185 stacked on hoist; substrate-only branch JIT 196 →
+   189 with just the coalescer. **Reverted from PR #91 after
+   Linux x86_64 CI failed `go_math_big`** (BigInt subtraction
+   divergence — wasmtime returns `864197532086419753208641975320`,
+   zwasm returns `864197532160206729503480181784`). The regalloc
+   is arch-agnostic, so the divergence is in `src/x86.zig`'s
+   codegen interaction with the new IR layout. Reproducible on
+   OrbStack `my-ubuntu-amd64`. Tracked as W54-coalescer.
+
+Architecture rationale: D138 in decisions.md. Detailed session arc
++ branch names: `.dev/w54-redesign-postmortem.md`.
+
+### Previous (still on main)
+
+**C-g foundation + Mac/Ubuntu baselines done.** Ship-overnight
 session 2026-04-29 evening landed two PRs to main on top of the
 afternoon's six (#79..#84):