From 4c26046d7d6118789e788cf4528c6decbc22ca0b Mon Sep 17 00:00:00 2001 From: "Shota Kudo (chaploud)" Date: Wed, 29 Apr 2026 22:18:07 +0900 Subject: [PATCH] =?UTF-8?q?docs(w54):=20investigation=20=E2=80=94=202.1?= =?UTF-8?q?=C3=97=20wasmtime=20gap=20is=20post-fold=20loop=20overhead?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The original W54 framing assumed zwasm did not constant-fold `i32.div_u K`, leaving cranelift's multiply-high optimisation unmatched. That is wrong: the fold is already implemented for both ARM64 (`src/jit.zig:3582-3666`) and x86_64 (`src/x86.zig` `tryEmitDivByConstU32`). Dumping the JIT for `bench/wasm/tgo_string_ops.wasm` confirms three identical MOVZ+MOVK+UMULL+LSR sequences for the three `i32.div_u 10` sites — zero `UDIV` instructions emitted. The actual gap lives in two places: 1. The 2-instruction magic-constant load (MOVZ + MOVK for 0xCCCCCCCD) is re-emitted inside the loop body on every iteration; cranelift hoists it via SSA + GVN so only UMULL+LSR stay in the hot path. With three div sites in `tgo_strops` that costs ~6 ARM64 instructions per loop iteration. 2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift's SSA collapses those into register renames whereas zwasm's linear-scan regalloc spills them to LDR/STR pairs against `regs[]`. Single-pass-compatible levers, ranked by leverage: - **Loop-preheader magic hoist.** Extend `emitLoopPreHeader` (currently SIMD-only) to scan for `OP_CONST32 K` whose `rd` feeds `OP_DIV_U` / `OP_REM_U` later in the loop body, allocate a callee-saved register, and pre-load the magic. `tryEmitDivByConstU32` short-circuits when the magic is already live. Saves ~6 instructions per iteration on `tgo_strops`. Risk: medium — needs to coexist with the existing physical-register layout (`vregToPhys` saturates x20-x26 + x9-x15 fast for high reg_count functions like func#24 with 13 vregs, where no callee-saved register is free without reserving one in the prologue). - **`OP_CONST32` reuse across loop back-edges.** Today `known_consts` is wiped at every header. Skip unless the preheader hoist lands first — saves the 1-instr const itself but not the 2-instr magic that hangs off it. - **`OP_MOV` coalescing in linear-scan regalloc.** Substantial surgery; warrants its own W## entry, not in scope for tonight. Next step: open a focused PR that experiments with the preheader hoist on a minimal JIT regression suite first, and abort if `bench/run_bench.sh --quick` shows a regression elsewhere. Re-record `bench/runtime_comparison.yaml` at 5 runs + 3 warmup before claiming a number — the existing values are single-sample. Captures the diagnosis tonight so the implementation pass can start clean from a verified hypothesis rather than redo the analysis. --- .dev/checklist.md | 82 ++++++++++---------- .dev/memo.md | 63 ++++++++-------- .dev/roadmap.md | 2 +- .dev/w54-investigation.md | 152 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 229 insertions(+), 70 deletions(-) create mode 100644 .dev/w54-investigation.md diff --git a/.dev/checklist.md b/.dev/checklist.md index 00867732d..e40a8a74e 100644 --- a/.dev/checklist.md +++ b/.dev/checklist.md @@ -92,46 +92,54 @@ Prefix: W## (to distinguish from CW's F## items). - [ ] W54: `tgo_strops` (and other div-heavy TinyGo workloads) is ~2.1× slower than wasmtime/cranelift on M4 Pro per `bench/runtime_comparison.yaml` (2026-03-25, runs=1/warmup=0): - zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. Current main - (`448f4c8`) is ~67 ms cached, so the gap is recurring, not a - measurement artefact, and it dwarfs the W47 +15 % post-0.16 - regression. The hot loop is TinyGo's `digitCount` — - `i32.div_u 10 + br_if` in a `for v > 0` loop — and the lever is - the constant-divisor optimisation. Strategy stack, ordered by - leverage and single-pass compatibility: - - 1. **Constant-divisor → multiply-high (Hacker's Delight 10-9)** - in predecode. Detect `i32.const K; i32.div_u` (and `rem_u`, - plus `mul K` for power-of-two / shift-add) in a 2-instruction - window, rewrite to a synthetic `udiv_const K` RegIR op. JIT - emits `MOVZ m; UMULH tmp, n, m; LSR result, tmp, s` on - ARM64; `MOV m; MUL r/m32; SHR edx, s` on x86_64. Magic - numbers are pure constants of K, computed once. UDIV is - 8–10 cycles vs UMULH+LSR ~3–4 cycles, so realistic gain on - `digitCount` is ~30–40 ms cached → close to wasmtime parity. - Pure peephole; preserves single-pass. - 2. **Loop-header Q-cache persistence** (existing W45). Detect - back-edges in `scanBranchTargets` and skip the Q-cache - evict at the loop header so the induction var stays in a - register. Helps `tgo_strops`, `tgo_arith`, `tgo_fib_loop`, - `st_nestedloop`. Already designed; cheap to land. - 3. **`br_if` fall-through ordering audit**. cranelift always - places the fall-through arm as the loop continuation so the - branch predictor wins. Confirm `regalloc.zig`'s terminator - emit does the same and mirror it if not. Cheap audit. - 4. **Interpreter dispatch codegen diff** (also closes W47). asm - diff `vm.zig`'s hot dispatch loop between v1.9.1 and main - under Zig 0.16 / LLVM 19. The post-0.16 +15 % most likely - lives here, and a fix would lift every interpreter path, - not just `tgo_strops`. + zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. The original + framing — "constant-divisor folding is missing" — was + **disproven** during the 2026-04-29 evening investigation. Both + ARM64 (`src/jit.zig:3582-3666` `tryEmitDivByConstU32`) and + x86_64 (`src/x86.zig`) already emit the Hacker's Delight magic + multiply for `i32.div_u K`; the JIT dump for + `bench/wasm/tgo_string_ops.wasm` shows three MOVZ+MOVK+UMULL+LSR + sequences for the three `i32.div_u 10` sites, with zero `UDIV` + instructions. + + The remaining 2.1× lives in two places: + + 1. The 2-instruction magic-constant load (`MOVZ + MOVK` for + `0xCCCCCCCD`) is re-emitted inside the loop body on every + iteration; cranelift's SSA + GVN hoist it once so only + `UMULH/UMULL + LSR` stay hot. Three div sites in + `tgo_strops` cost ~6 ARM64 instructions per iteration that a + preheader hoist would eliminate. + 2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift + collapses those into register renames whereas zwasm's + linear-scan regalloc still spills them to LDR/STR against + `regs[]`. + + Single-pass-compatible levers, ranked: + + 1. **Loop-preheader magic hoist.** Extend `emitLoopPreHeader` + (today SIMD-only, `src/jit.zig:4604`) to scan for + `OP_CONST32 K → OP_DIV_U` patterns, allocate a callee-saved + register, and pre-load the magic. `tryEmitDivByConstU32` + short-circuits when the magic is already live. Risk: + medium — needs to coexist with the existing physical- + register layout (functions like `string_ops` (func#24, 13 + vregs) saturate the callee-saved set; the prologue would + have to reserve a free slot up front). + 2. **`OP_CONST32` reuse across loop back-edges.** Today + `known_consts` is wiped at every header. Skip unless (1) + lands — saves the 1-instr const itself but not the 2-instr + magic that hangs off it. + 3. **`OP_MOV` coalescing in linear-scan regalloc.** Substantial + surgery; deserves a separate W## entry. Out of scope (would break single-pass): SSA + dataflow, - global register allocation beyond linear scan, automatic loop - unroll / vectorise. + global register allocation, automatic loop unroll / + vectorise. Re-record `runtime_comparison.yaml` at 5 runs / 3 + warmup before claiming any number — the current values are + single-sample. - Measurement note: `runtime_comparison.yaml` is currently runs=1 - / warmup=0 — useful for ordering, not for absolute targets. - Re-record at 5 runs / 3 warmup before claiming a win. + Full investigation log: `@./.dev/w54-investigation.md`. - [ ] W48 Phase 2: Linux binary size 1.56 MB → 1.50 MB (~62 KB more). W48 Phase 1 shipped (2026-04-25): `pub const panic = std.debug.simple_panic` diff --git a/.dev/memo.md b/.dev/memo.md index 309df0656..0f5a66bb0 100644 --- a/.dev/memo.md +++ b/.dev/memo.md @@ -58,38 +58,37 @@ Post-merge bench rows for the C-g merge (`e5766ee`): ## Open work -### 0. **W54** — close the 2.1× wasmtime gap on `tgo_strops` - -`bench/runtime_comparison.yaml` (2026-03-25, commit 65db814, -runs=1/warmup=0) shows zwasm cached 63.2 ms vs wasmtime cached -30.0 ms on `tgo_strops`. Current main (`448f4c8`) is ~67 ms -cached, so the gap is structural — and it dwarfs the W47 +15 % -post-0.16 regression. Bigger leverage than W47. - -Hot loop is TinyGo's `digitCount` — `i32.div_u 10 + br_if` inside -`for v > 0`. The lever is the constant-divisor optimisation that -cranelift does by default. Strategy stack (single-pass-compatible, -ordered by leverage): - -1. **Constant-divisor → multiply-high in predecode**. Two-op - window peephole for `i32.const K; i32.div_u` (and `rem_u`, - `mul K` power-of-two). Synthesise `udiv_const K`; JIT emits - `UMULH + LSR` on ARM64, `MUL + SHR` on x86_64. Magic - numbers pre-computed per K. UDIV ~10 cyc → UMULH ~3 cyc, - so realistic gain ≈ wasmtime parity for div-heavy workloads. -2. **Loop-header Q-cache persistence (W45)**. Skip Q-cache - evict at loop headers so induction vars stay in registers. - Already designed. -3. **`br_if` fall-through audit**. Confirm `regalloc.zig` - places the fall-through arm as the loop-continuation path. -4. **Interpreter dispatch codegen diff** (also closes W47). - asm diff `vm.zig` hot dispatch v1.9.1 vs main under - Zig 0.16 / LLVM 19. - -Out of scope (would break single-pass): SSA, global regalloc, -auto unroll / vectorise. Re-record `runtime_comparison.yaml` at -5 runs / 3 warmup before claiming a win — current values are -single-sample. Detailed strategy in `.dev/checklist.md` W54. +### 0. **W54** — investigated; framing reset, implementation deferred + +The original W54 framing — "zwasm doesn't fold `i32.div_u K`, +that's why wasmtime is 2× faster" — was **disproven** during +the 2026-04-29 evening investigation. Both ARM64 and x86_64 +JITs already emit the Hacker's Delight multiply-high for +constant divisors; the JIT dump for `tgo_string_ops` (func#24, +3 div_u sites, magic 0xCCCCCCCD) shows zero UDIV instructions. + +The 2.1× gap actually lives in: + +1. **Magic-constant re-load every iteration** (~6 ARM64 instrs/ + iter for 3 div sites). cranelift's SSA + GVN hoist; zwasm + single-pass cannot without an explicit preheader pass. +2. **mov-heavy RegIR from TinyGo's `local.set`** that the + linear-scan regalloc spills to LDR/STR pairs. + +Best next step (single-pass-compatible) is the loop-preheader +magic hoist — extend `emitLoopPreHeader` (SIMD-only today, +`src/jit.zig:4604`) to scan for `OP_CONST32 K → OP_DIV_U`, +reserve a callee-saved register in the prologue, and pre-load +the magic. Deferred from this session because reserving the +extra callee-saved slot interacts with the physical-register +layout in `vregToPhys` (functions like `string_ops` with 13 +vregs already saturate the callee-saved set, so the hoist +needs the prologue spill machinery to make room) — the design +is well-bounded but invasive enough to warrant its own focused +PR rather than a tail-end commit on a long autonomous run. + +Full investigation: `.dev/w54-investigation.md` (now in main +on PR #90). ### 1. **W47** — `tgo_strops_cached` regression with stable harness diff --git a/.dev/roadmap.md b/.dev/roadmap.md index 20bc0206a..901535429 100644 --- a/.dev/roadmap.md +++ b/.dev/roadmap.md @@ -32,7 +32,7 @@ Details: `roadmap-archive.md`. | Windows CI guard removal | Done | W49 (Plan C residuals) + W50 (CI Nix-ify) shipped 2026-04-29 PM. Only `benchmark` Ubuntu-only remains, sequenced behind C-g step 5. | | W53 install-tools.ps1 rust | Done | Root-cause: rustup-init stdout polluting `Install-Rustup`'s return; fix routes through `Out-Host`. CI dropped `-SkipRust`. | | C-g multi-arch bench schema | Done | PR #86 (2026-04-29 eve). Step 5 (3-OS matrix flip + Windows hyperfine + native x86_64 baseline) tracked in `.dev/memo.md` open work. | -| W54 close 2.1× wasmtime gap | Active | Hot lever: constant-divisor → mul-high peephole in predecode. Single-pass-safe. See `.dev/checklist.md` W54 for full strategy. | +| W54 close 2.1× wasmtime gap | Active | Original framing disproven (const-divisor fold is already implemented). Real lever: magic-constant hoist in `emitLoopPreHeader`. See `.dev/w54-investigation.md`. | | Spec test auto-bump | Active | Weekly CI (spec-bump.yml). Review failures. | | wasm-tools tracking | Active | Monthly CI (wasm-tools-bump.yml) | | SpecTec monitoring | Active | Weekly CI (spectec-monitor.yml) | diff --git a/.dev/w54-investigation.md b/.dev/w54-investigation.md new file mode 100644 index 000000000..c57ce90bf --- /dev/null +++ b/.dev/w54-investigation.md @@ -0,0 +1,152 @@ +# W54 — `tgo_strops` 2.1× wasmtime gap investigation + +Captured: 2026-04-29 evening, ship-overnight session. +Status: investigated, no fix shipped yet. + +## What zwasm already does, contrary to the initial hypothesis + +The constant-divisor → multiply-high (Hacker's Delight 10-9) peephole +is **already implemented** for both ARM64 (`src/jit.zig:3582-3666`) +and x86_64 (`src/x86.zig`, `tryEmitDivByConstU32`). `known_consts` +tracking in the compile loop sets each vreg's value when an +`OP_CONST32` lands; `emitDiv32` checks the rs2 vreg for a known +constant and falls into `tryEmitDivByConstU32` for non-zero, +non-power-of-two divisors. + +Confirmed empirically on `bench/wasm/tgo_string_ops.wasm`: + +``` +$ ./zig-out/bin/zwasm run --invoke string_ops bench/wasm/tgo_string_ops.wasm \ + --dump-jit=24 10000 | python3 ... +0x0f8: MOVZ X16, #0xCCCD ← magic for /10 (low half) +0x0fc: MOVK X16, #0xCCCC, lsl 16 ← magic for /10 (high half) +0x100: UMULL X8, W22, W16 ← multiply-high +0x104: LSR X8, X8, #35 ← extract quotient +``` + +Three identical 5-instruction sequences for the three `i32.div_u 10` +sites in `string_ops`. **Zero `UDIV` instructions emitted.** + +So the original W54 hypothesis ("zwasm doesn't fold constant divisors, +that's why wasmtime is 2× faster") is wrong — the fold is done. + +## Where the 2× gap actually lives + +### a. Magic constant re-loaded every iteration + +Each `i32.div_u 10` inside `digitCount`'s loop body re-emits the full +2-instruction MOVZ+MOVK pair to materialise `0xCCCCCCCD`. Cranelift's +SSA + GVN hoist that load to before the loop, leaving only UMULL+LSR +inside the hot path. With three div sites in `tgo_strops`, the +hoistable cost is **6 instructions per loop iteration**. + +### b. mov-heavy RegIR + +TinyGo's `digitCount` body in RegIR (function 24, PCs 21..30): + +``` +[022] add r8 = r2, r7 ; counter + 1 +[023] mov r2 = r8 +[024] const32 r8 = 9 +[025] gt_u r8 = r0, r8 +[026] mov r6 = r8 ← cond temp +[027] const32 r8 = 10 +[028] div_u r8 = r0, r8 ← 5 instrs after const-folding +[029] mov r0 = r8 +[030] br_if r6 -> pc=21 +``` + +Roughly 9 RegIR instructions, but the JIT emits ~17 ARM64 instructions +because each `mov` becomes an LDR/STR pair against the `regs[]` +spill area when the vreg is not currently held in a physical +register, and the const-divisor sequence is 5 instructions. + +Cranelift's SSA collapses every `mov rA = rB` plus the redundant +counter/temp stores into pure register renames. zwasm's linear-scan +regalloc cannot do that without an additional pass. + +### c. Single-pass constraint + +Both fixes (magic-constant hoist, mov coalescing) require either +loop-aware analysis or a second pass over the RegIR — which the +project's design constraint (single-pass JIT to keep cold start +cheap) excludes by default. + +## Realistic single-pass-compatible wins + +Ordered by leverage and implementation risk: + +1. **Loop-preheader magic hoist.** Extend the existing + `emitLoopPreHeader` (currently SIMD-only, + `src/jit.zig:4604`) to scan the loop body for + `OP_CONST32 K` instructions whose `rd` is later consumed by an + `OP_DIV_U` / `OP_REM_U`. Allocate a callee-saved register, emit + the magic constant once before the loop, and have the in-loop + `emitDiv32` skip its MOVZ+MOVK if the magic is already live. + Saves ~6 instructions per iteration on `tgo_strops`. Risk: + medium — needs careful tracking of which scratch register holds + which magic across the loop body, and the back-edge logic must + not invalidate the cache. + + **Register layout interaction (caught during the abandoned + experimental attempt on `develop/w54-magic-hoist-attempt`).** + The obvious choice for the magic register is `x21`, which is + only handed out by `vregToPhys` when `reg_count >= 14`. But + `x21` is *also* the dedicated `inst_ptr` cache slot whenever + `reg_count <= 13 && has_self_calls` (see `src/jit.zig:1129`, + field `inst_ptr_cached`). Both states overlap on a real slice + of the corpus, so the hoist needs to either + (a) skip the optimisation whenever `inst_ptr_cached` is true + (smallest, safest patch — gives up the optimisation on + self-calling functions with ≤13 vregs); + (b) extend the prologue to reserve an additional callee-saved + slot (e.g. push `x23` or pick from the unused tail of the + STP-pair set when `reg_count` is small) and thread that + through the existing layout machinery (more invasive). + x86_64 has its own version of this dance (`r13`/`r14` are the + common candidates); a clean implementation should keep the + register choice in an arch-specific helper rather than a + shared constant. + + Tonight's experimental branch was abandoned at this point + because picking the right safety boundary for the register + choice is itself a design call worth daylight, not a + tail-end commit. + +2. **`OP_CONST32` reuse across loop back-edges.** Today + `known_consts` is wiped at every loop header and back-edge. + For consts whose `rd` is rewritten consistently to the same + value on every iteration, we could keep the entry alive: emit + only the first iteration's MOVZ+MOVK, then `nop` (or branch + over) the const-load on subsequent iterations. Less leverage + than (1) because OP_CONST32 itself is only 1 instruction — + what (1) saves is the magic computation that hangs off the + const, not the const itself. Skip unless (1) lands. + +3. **`OP_MOV` coalescing in regalloc.** When a `mov rd = rs1` has + `rs1` dead-after-this-point, rewrite the producer of `rs1` to + write directly into `rd`. Needs liveness, which today is + computed only as `written_vregs`. Substantial regalloc surgery + — likely a separate W## item, not in scope for tonight. + +## What was attempted in this session + +- Confirmed the const-divisor fold triggers (above). +- Decoded the JIT'd bytes for func#24 to see the actual emitted + ARM64. Identified MOVZ+MOVK+UMULL+LSR as the 5-instruction + per-div-site cost. +- Did **not** attempt the loop-preheader magic hoist or mov + coalescing — both warrant their own design pass with the + spec/regalloc tests as the safety net, not an overnight commit. + +## Recommended next step + +Open a separate `develop/w54-loop-magic-hoist` branch and +prototype the preheader hoist (item 1 above) against a minimal +JIT regression suite first. If `bench/run_bench.sh --quick` shows +≥15 % improvement on `tgo_strops` with no regression elsewhere, +land it. Otherwise revert and capture the dead-end here. + +Re-record `bench/runtime_comparison.yaml` at 5 runs / 3 warmup +before claiming a number — the current single-sample values are +useful for ordering but not for absolute targets.