From 4c26046d7d6118789e788cf4528c6decbc22ca0b Mon Sep 17 00:00:00 2001
From: "Shota Kudo (chaploud)" <shota.508@studist.jp>
Date: Wed, 29 Apr 2026 22:18:07 +0900
Subject: [PATCH] =?UTF-8?q?docs(w54):=20investigation=20=E2=80=94=202.1?=
 =?UTF-8?q?=C3=97=20wasmtime=20gap=20is=20post-fold=20loop=20overhead?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The original W54 framing assumed zwasm did not constant-fold
`i32.div_u K`, leaving cranelift's multiply-high optimisation
unmatched. That is wrong: the fold is already implemented for
both ARM64 (`src/jit.zig:3582-3666`) and x86_64
(`src/x86.zig` `tryEmitDivByConstU32`). Dumping the JIT for
`bench/wasm/tgo_string_ops.wasm` confirms three identical
MOVZ+MOVK+UMULL+LSR sequences for the three `i32.div_u 10`
sites — zero `UDIV` instructions emitted.

The actual gap lives in two places:

1. The 2-instruction magic-constant load (MOVZ + MOVK for
   0xCCCCCCCD) is re-emitted inside the loop body on every
   iteration; cranelift hoists it via SSA + GVN so only
   UMULL+LSR stay in the hot path. With three div sites in
   `tgo_strops` that costs ~6 ARM64 instructions per loop
   iteration.
2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift's
   SSA collapses those into register renames whereas zwasm's
   linear-scan regalloc spills them to LDR/STR pairs against
   `regs[]`.

Single-pass-compatible levers, ranked by leverage:

- **Loop-preheader magic hoist.** Extend `emitLoopPreHeader`
  (currently SIMD-only) to scan for `OP_CONST32 K` whose `rd`
  feeds `OP_DIV_U` / `OP_REM_U` later in the loop body, allocate
  a callee-saved register, and pre-load the magic. `tryEmitDivByConstU32`
  short-circuits when the magic is already live. Saves ~6
  instructions per iteration on `tgo_strops`. Risk: medium —
  needs to coexist with the existing physical-register layout
  (`vregToPhys` saturates x20-x26 + x9-x15 fast for high
  reg_count functions like func#24 with 13 vregs, where no
  callee-saved register is free without reserving one in the
  prologue).
- **`OP_CONST32` reuse across loop back-edges.** Today
  `known_consts` is wiped at every header. Skip unless the
  preheader hoist lands first — saves the 1-instr const itself
  but not the 2-instr magic that hangs off it.
- **`OP_MOV` coalescing in linear-scan regalloc.** Substantial
  surgery; warrants its own W## entry, not in scope for tonight.

Next step: open a focused PR that experiments with the
preheader hoist on a minimal JIT regression suite first, and
abort if `bench/run_bench.sh --quick` shows a regression
elsewhere. Re-record `bench/runtime_comparison.yaml` at 5 runs
+ 3 warmup before claiming a number — the existing values are
single-sample.

Captures the diagnosis tonight so the implementation pass can
start clean from a verified hypothesis rather than redo the
analysis.
---
 .dev/checklist.md         |  82 ++++++++++----------
 .dev/memo.md              |  63 ++++++++--------
 .dev/roadmap.md           |   2 +-
 .dev/w54-investigation.md | 152 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 229 insertions(+), 70 deletions(-)
 create mode 100644 .dev/w54-investigation.md

diff --git a/.dev/checklist.md b/.dev/checklist.md
index 00867732d..e40a8a74e 100644
--- a/.dev/checklist.md
+++ b/.dev/checklist.md
@@ -92,46 +92,54 @@ Prefix: W## (to distinguish from CW's F## items).
 - [ ] W54: `tgo_strops` (and other div-heavy TinyGo workloads) is
   ~2.1× slower than wasmtime/cranelift on M4 Pro per
   `bench/runtime_comparison.yaml` (2026-03-25, runs=1/warmup=0):
-  zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. Current main
-  (`448f4c8`) is ~67 ms cached, so the gap is recurring, not a
-  measurement artefact, and it dwarfs the W47 +15 % post-0.16
-  regression. The hot loop is TinyGo's `digitCount` —
-  `i32.div_u 10 + br_if` in a `for v > 0` loop — and the lever is
-  the constant-divisor optimisation. Strategy stack, ordered by
-  leverage and single-pass compatibility:
-
-  1. **Constant-divisor → multiply-high (Hacker's Delight 10-9)**
-     in predecode. Detect `i32.const K; i32.div_u` (and `rem_u`,
-     plus `mul K` for power-of-two / shift-add) in a 2-instruction
-     window, rewrite to a synthetic `udiv_const K` RegIR op. JIT
-     emits `MOVZ m; UMULH tmp, n, m; LSR result, tmp, s` on
-     ARM64; `MOV m; MUL r/m32; SHR edx, s` on x86_64. Magic
-     numbers are pure constants of K, computed once. UDIV is
-     8–10 cycles vs UMULH+LSR ~3–4 cycles, so realistic gain on
-     `digitCount` is ~30–40 ms cached → close to wasmtime parity.
-     Pure peephole; preserves single-pass.
-  2. **Loop-header Q-cache persistence** (existing W45). Detect
-     back-edges in `scanBranchTargets` and skip the Q-cache
-     evict at the loop header so the induction var stays in a
-     register. Helps `tgo_strops`, `tgo_arith`, `tgo_fib_loop`,
-     `st_nestedloop`. Already designed; cheap to land.
-  3. **`br_if` fall-through ordering audit**. cranelift always
-     places the fall-through arm as the loop continuation so the
-     branch predictor wins. Confirm `regalloc.zig`'s terminator
-     emit does the same and mirror it if not. Cheap audit.
-  4. **Interpreter dispatch codegen diff** (also closes W47). asm
-     diff `vm.zig`'s hot dispatch loop between v1.9.1 and main
-     under Zig 0.16 / LLVM 19. The post-0.16 +15 % most likely
-     lives here, and a fix would lift every interpreter path,
-     not just `tgo_strops`.
+  zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. The original
+  framing — "constant-divisor folding is missing" — was
+  **disproven** during the 2026-04-29 evening investigation. Both
+  ARM64 (`src/jit.zig:3582-3666` `tryEmitDivByConstU32`) and
+  x86_64 (`src/x86.zig`) already emit the Hacker's Delight magic
+  multiply for `i32.div_u K`; the JIT dump for
+  `bench/wasm/tgo_string_ops.wasm` shows three MOVZ+MOVK+UMULL+LSR
+  sequences for the three `i32.div_u 10` sites, with zero `UDIV`
+  instructions.
+
+  The remaining 2.1× lives in two places:
+
+  1. The 2-instruction magic-constant load (`MOVZ + MOVK` for
+     `0xCCCCCCCD`) is re-emitted inside the loop body on every
+     iteration; cranelift's SSA + GVN hoist it once so only
+     `UMULH/UMULL + LSR` stay hot. Three div sites in
+     `tgo_strops` cost ~6 ARM64 instructions per iteration that a
+     preheader hoist would eliminate.
+  2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift
+     collapses those into register renames whereas zwasm's
+     linear-scan regalloc still spills them to LDR/STR against
+     `regs[]`.
+
+  Single-pass-compatible levers, ranked:
+
+  1. **Loop-preheader magic hoist.** Extend `emitLoopPreHeader`
+     (today SIMD-only, `src/jit.zig:4604`) to scan for
+     `OP_CONST32 K → OP_DIV_U` patterns, allocate a callee-saved
+     register, and pre-load the magic. `tryEmitDivByConstU32`
+     short-circuits when the magic is already live. Risk:
+     medium — needs to coexist with the existing physical-
+     register layout (functions like `string_ops` (func#24, 13
+     vregs) saturate the callee-saved set; the prologue would
+     have to reserve a free slot up front).
+  2. **`OP_CONST32` reuse across loop back-edges.** Today
+     `known_consts` is wiped at every header. Skip unless (1)
+     lands — saves the 1-instr const itself but not the 2-instr
+     magic that hangs off it.
+  3. **`OP_MOV` coalescing in linear-scan regalloc.** Substantial
+     surgery; deserves a separate W## entry.
 
   Out of scope (would break single-pass): SSA + dataflow,
-  global register allocation beyond linear scan, automatic loop
-  unroll / vectorise.
+  global register allocation, automatic loop unroll /
+  vectorise. Re-record `runtime_comparison.yaml` at 5 runs / 3
+  warmup before claiming any number — the current values are
+  single-sample.
 
-  Measurement note: `runtime_comparison.yaml` is currently runs=1
-  / warmup=0 — useful for ordering, not for absolute targets.
-  Re-record at 5 runs / 3 warmup before claiming a win.
+  Full investigation log: `@./.dev/w54-investigation.md`.
 
 - [ ] W48 Phase 2: Linux binary size 1.56 MB → 1.50 MB (~62 KB more).
   W48 Phase 1 shipped (2026-04-25): `pub const panic = std.debug.simple_panic`
diff --git a/.dev/memo.md b/.dev/memo.md
index 309df0656..0f5a66bb0 100644
--- a/.dev/memo.md
+++ b/.dev/memo.md
@@ -58,38 +58,37 @@ Post-merge bench rows for the C-g merge (`e5766ee`):
 
 ## Open work
 
-### 0. **W54** — close the 2.1× wasmtime gap on `tgo_strops`
-
-`bench/runtime_comparison.yaml` (2026-03-25, commit 65db814,
-runs=1/warmup=0) shows zwasm cached 63.2 ms vs wasmtime cached
-30.0 ms on `tgo_strops`. Current main (`448f4c8`) is ~67 ms
-cached, so the gap is structural — and it dwarfs the W47 +15 %
-post-0.16 regression. Bigger leverage than W47.
-
-Hot loop is TinyGo's `digitCount` — `i32.div_u 10 + br_if` inside
-`for v > 0`. The lever is the constant-divisor optimisation that
-cranelift does by default. Strategy stack (single-pass-compatible,
-ordered by leverage):
-
-1. **Constant-divisor → multiply-high in predecode**. Two-op
-   window peephole for `i32.const K; i32.div_u` (and `rem_u`,
-   `mul K` power-of-two). Synthesise `udiv_const K`; JIT emits
-   `UMULH + LSR` on ARM64, `MUL + SHR` on x86_64. Magic
-   numbers pre-computed per K. UDIV ~10 cyc → UMULH ~3 cyc,
-   so realistic gain ≈ wasmtime parity for div-heavy workloads.
-2. **Loop-header Q-cache persistence (W45)**. Skip Q-cache
-   evict at loop headers so induction vars stay in registers.
-   Already designed.
-3. **`br_if` fall-through audit**. Confirm `regalloc.zig`
-   places the fall-through arm as the loop-continuation path.
-4. **Interpreter dispatch codegen diff** (also closes W47).
-   asm diff `vm.zig` hot dispatch v1.9.1 vs main under
-   Zig 0.16 / LLVM 19.
-
-Out of scope (would break single-pass): SSA, global regalloc,
-auto unroll / vectorise. Re-record `runtime_comparison.yaml` at
-5 runs / 3 warmup before claiming a win — current values are
-single-sample. Detailed strategy in `.dev/checklist.md` W54.
+### 0. **W54** — investigated; framing reset, implementation deferred
+
+The original W54 framing — "zwasm doesn't fold `i32.div_u K`,
+that's why wasmtime is 2× faster" — was **disproven** during
+the 2026-04-29 evening investigation. Both ARM64 and x86_64
+JITs already emit the Hacker's Delight multiply-high for
+constant divisors; the JIT dump for `tgo_string_ops` (func#24,
+3 div_u sites, magic 0xCCCCCCCD) shows zero UDIV instructions.
+
+The 2.1× gap actually lives in:
+
+1. **Magic-constant re-load every iteration** (~6 ARM64 instrs/
+   iter for 3 div sites). cranelift's SSA + GVN hoist; zwasm
+   single-pass cannot without an explicit preheader pass.
+2. **mov-heavy RegIR from TinyGo's `local.set`** that the
+   linear-scan regalloc spills to LDR/STR pairs.
+
+Best next step (single-pass-compatible) is the loop-preheader
+magic hoist — extend `emitLoopPreHeader` (SIMD-only today,
+`src/jit.zig:4604`) to scan for `OP_CONST32 K → OP_DIV_U`,
+reserve a callee-saved register in the prologue, and pre-load
+the magic. Deferred from this session because reserving the
+extra callee-saved slot interacts with the physical-register
+layout in `vregToPhys` (functions like `string_ops` with 13
+vregs already saturate the callee-saved set, so the hoist
+needs the prologue spill machinery to make room) — the design
+is well-bounded but invasive enough to warrant its own focused
+PR rather than a tail-end commit on a long autonomous run.
+
+Full investigation: `.dev/w54-investigation.md` (now in main
+on PR #90).
 
 ### 1. **W47** — `tgo_strops_cached` regression with stable harness
 
diff --git a/.dev/roadmap.md b/.dev/roadmap.md
index 20bc0206a..901535429 100644
--- a/.dev/roadmap.md
+++ b/.dev/roadmap.md
@@ -32,7 +32,7 @@ Details: `roadmap-archive.md`.
 | Windows CI guard removal    | Done     | W49 (Plan C residuals) + W50 (CI Nix-ify) shipped 2026-04-29 PM. Only `benchmark` Ubuntu-only remains, sequenced behind C-g step 5. |
 | W53 install-tools.ps1 rust  | Done     | Root-cause: rustup-init stdout polluting `Install-Rustup`'s return; fix routes through `Out-Host`. CI dropped `-SkipRust`. |
 | C-g multi-arch bench schema | Done     | PR #86 (2026-04-29 eve). Step 5 (3-OS matrix flip + Windows hyperfine + native x86_64 baseline) tracked in `.dev/memo.md` open work. |
-| W54 close 2.1× wasmtime gap | Active   | Hot lever: constant-divisor → mul-high peephole in predecode. Single-pass-safe. See `.dev/checklist.md` W54 for full strategy. |
+| W54 close 2.1× wasmtime gap | Active   | Original framing disproven (const-divisor fold is already implemented). Real lever: magic-constant hoist in `emitLoopPreHeader`. See `.dev/w54-investigation.md`. |
 | Spec test auto-bump         | Active   | Weekly CI (spec-bump.yml). Review failures.        |
 | wasm-tools tracking         | Active   | Monthly CI (wasm-tools-bump.yml)                   |
 | SpecTec monitoring          | Active   | Weekly CI (spectec-monitor.yml)                    |
diff --git a/.dev/w54-investigation.md b/.dev/w54-investigation.md
new file mode 100644
index 000000000..c57ce90bf
--- /dev/null
+++ b/.dev/w54-investigation.md
@@ -0,0 +1,152 @@
+# W54 — `tgo_strops` 2.1× wasmtime gap investigation
+
+Captured: 2026-04-29 evening, ship-overnight session.
+Status: investigated, no fix shipped yet.
+
+## What zwasm already does, contrary to the initial hypothesis
+
+The constant-divisor → multiply-high (Hacker's Delight 10-9) peephole
+is **already implemented** for both ARM64 (`src/jit.zig:3582-3666`)
+and x86_64 (`src/x86.zig`, `tryEmitDivByConstU32`). `known_consts`
+tracking in the compile loop sets each vreg's value when an
+`OP_CONST32` lands; `emitDiv32` checks the rs2 vreg for a known
+constant and falls into `tryEmitDivByConstU32` for non-zero,
+non-power-of-two divisors.
+
+Confirmed empirically on `bench/wasm/tgo_string_ops.wasm`:
+
+```
+$ ./zig-out/bin/zwasm run --invoke string_ops bench/wasm/tgo_string_ops.wasm \
+    --dump-jit=24 10000 | python3 ...
+0x0f8: MOVZ X16, #0xCCCD                  ← magic for /10 (low half)
+0x0fc: MOVK X16, #0xCCCC, lsl 16          ← magic for /10 (high half)
+0x100: UMULL X8, W22, W16                 ← multiply-high
+0x104: LSR  X8, X8, #35                   ← extract quotient
+```
+
+Three identical 5-instruction sequences for the three `i32.div_u 10`
+sites in `string_ops`. **Zero `UDIV` instructions emitted.**
+
+So the original W54 hypothesis ("zwasm doesn't fold constant divisors,
+that's why wasmtime is 2× faster") is wrong — the fold is done.
+
+## Where the 2× gap actually lives
+
+### a. Magic constant re-loaded every iteration
+
+Each `i32.div_u 10` inside `digitCount`'s loop body re-emits the full
+2-instruction MOVZ+MOVK pair to materialise `0xCCCCCCCD`. Cranelift's
+SSA + GVN hoist that load to before the loop, leaving only UMULL+LSR
+inside the hot path. With three div sites in `tgo_strops`, the
+hoistable cost is **6 instructions per loop iteration**.
+
+### b. mov-heavy RegIR
+
+TinyGo's `digitCount` body in RegIR (function 24, PCs 21..30):
+
+```
+[022] add  r8 = r2, r7      ; counter + 1
+[023] mov  r2 = r8
+[024] const32 r8 = 9
+[025] gt_u r8 = r0, r8
+[026] mov  r6 = r8           ← cond temp
+[027] const32 r8 = 10
+[028] div_u r8 = r0, r8      ← 5 instrs after const-folding
+[029] mov  r0 = r8
+[030] br_if r6 -> pc=21
+```
+
+Roughly 9 RegIR instructions, but the JIT emits ~17 ARM64 instructions
+because each `mov` becomes an LDR/STR pair against the `regs[]`
+spill area when the vreg is not currently held in a physical
+register, and the const-divisor sequence is 5 instructions.
+
+Cranelift's SSA collapses every `mov rA = rB` plus the redundant
+counter/temp stores into pure register renames. zwasm's linear-scan
+regalloc cannot do that without an additional pass.
+
+### c. Single-pass constraint
+
+Both fixes (magic-constant hoist, mov coalescing) require either
+loop-aware analysis or a second pass over the RegIR — which the
+project's design constraint (single-pass JIT to keep cold start
+cheap) excludes by default.
+
+## Realistic single-pass-compatible wins
+
+Ordered by leverage and implementation risk:
+
+1. **Loop-preheader magic hoist.** Extend the existing
+   `emitLoopPreHeader` (currently SIMD-only,
+   `src/jit.zig:4604`) to scan the loop body for
+   `OP_CONST32 K` instructions whose `rd` is later consumed by an
+   `OP_DIV_U` / `OP_REM_U`. Allocate a callee-saved register, emit
+   the magic constant once before the loop, and have the in-loop
+   `emitDiv32` skip its MOVZ+MOVK if the magic is already live.
+   Saves ~6 instructions per iteration on `tgo_strops`. Risk:
+   medium — needs careful tracking of which scratch register holds
+   which magic across the loop body, and the back-edge logic must
+   not invalidate the cache.
+
+   **Register layout interaction (caught during the abandoned
+   experimental attempt on `develop/w54-magic-hoist-attempt`).**
+   The obvious choice for the magic register is `x21`, which is
+   only handed out by `vregToPhys` when `reg_count >= 14`. But
+   `x21` is *also* the dedicated `inst_ptr` cache slot whenever
+   `reg_count <= 13 && has_self_calls` (see `src/jit.zig:1129`,
+   field `inst_ptr_cached`). Both states overlap on a real slice
+   of the corpus, so the hoist needs to either
+   (a) skip the optimisation whenever `inst_ptr_cached` is true
+       (smallest, safest patch — gives up the optimisation on
+       self-calling functions with ≤13 vregs);
+   (b) extend the prologue to reserve an additional callee-saved
+       slot (e.g. push `x23` or pick from the unused tail of the
+       STP-pair set when `reg_count` is small) and thread that
+       through the existing layout machinery (more invasive).
+   x86_64 has its own version of this dance (`r13`/`r14` are the
+   common candidates); a clean implementation should keep the
+   register choice in an arch-specific helper rather than a
+   shared constant.
+
+   Tonight's experimental branch was abandoned at this point
+   because picking the right safety boundary for the register
+   choice is itself a design call worth daylight, not a
+   tail-end commit.
+
+2. **`OP_CONST32` reuse across loop back-edges.** Today
+   `known_consts` is wiped at every loop header and back-edge.
+   For consts whose `rd` is rewritten consistently to the same
+   value on every iteration, we could keep the entry alive: emit
+   only the first iteration's MOVZ+MOVK, then `nop` (or branch
+   over) the const-load on subsequent iterations. Less leverage
+   than (1) because OP_CONST32 itself is only 1 instruction —
+   what (1) saves is the magic computation that hangs off the
+   const, not the const itself. Skip unless (1) lands.
+
+3. **`OP_MOV` coalescing in regalloc.** When a `mov rd = rs1` has
+   `rs1` dead-after-this-point, rewrite the producer of `rs1` to
+   write directly into `rd`. Needs liveness, which today is
+   computed only as `written_vregs`. Substantial regalloc surgery
+   — likely a separate W## item, not in scope for tonight.
+
+## What was attempted in this session
+
+- Confirmed the const-divisor fold triggers (above).
+- Decoded the JIT'd bytes for func#24 to see the actual emitted
+  ARM64. Identified MOVZ+MOVK+UMULL+LSR as the 5-instruction
+  per-div-site cost.
+- Did **not** attempt the loop-preheader magic hoist or mov
+  coalescing — both warrant their own design pass with the
+  spec/regalloc tests as the safety net, not an overnight commit.
+
+## Recommended next step
+
+Open a separate `develop/w54-loop-magic-hoist` branch and
+prototype the preheader hoist (item 1 above) against a minimal
+JIT regression suite first. If `bench/run_bench.sh --quick` shows
+≥15 % improvement on `tgo_strops` with no regression elsewhere,
+land it. Otherwise revert and capture the dead-end here.
+
+Re-record `bench/runtime_comparison.yaml` at 5 runs / 3 warmup
+before claiming a number — the current single-sample values are
+useful for ordering but not for absolute targets.