docs(w54): investigation — gap is post-fold loop overhead, not missing fold#90
Merged
Conversation
The original W54 framing assumed zwasm did not constant-fold `i32.div_u K`, leaving cranelift's multiply-high optimisation unmatched. That is wrong: the fold is already implemented for both ARM64 (`src/jit.zig:3582-3666`) and x86_64 (`src/x86.zig` `tryEmitDivByConstU32`). Dumping the JIT for `bench/wasm/tgo_string_ops.wasm` confirms three identical MOVZ+MOVK+UMULL+LSR sequences for the three `i32.div_u 10` sites — zero `UDIV` instructions emitted. The actual gap lives in two places: 1. The 2-instruction magic-constant load (MOVZ + MOVK for 0xCCCCCCCD) is re-emitted inside the loop body on every iteration; cranelift hoists it via SSA + GVN so only UMULL+LSR stay in the hot path. With three div sites in `tgo_strops` that costs ~6 ARM64 instructions per loop iteration. 2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift's SSA collapses those into register renames whereas zwasm's linear-scan regalloc spills them to LDR/STR pairs against `regs[]`. Single-pass-compatible levers, ranked by leverage: - **Loop-preheader magic hoist.** Extend `emitLoopPreHeader` (currently SIMD-only) to scan for `OP_CONST32 K` whose `rd` feeds `OP_DIV_U` / `OP_REM_U` later in the loop body, allocate a callee-saved register, and pre-load the magic. `tryEmitDivByConstU32` short-circuits when the magic is already live. Saves ~6 instructions per iteration on `tgo_strops`. Risk: medium — needs to coexist with the existing physical-register layout (`vregToPhys` saturates x20-x26 + x9-x15 fast for high reg_count functions like func#24 with 13 vregs, where no callee-saved register is free without reserving one in the prologue). - **`OP_CONST32` reuse across loop back-edges.** Today `known_consts` is wiped at every header. Skip unless the preheader hoist lands first — saves the 1-instr const itself but not the 2-instr magic that hangs off it. - **`OP_MOV` coalescing in linear-scan regalloc.** Substantial surgery; warrants its own W## entry, not in scope for tonight. Next step: open a focused PR that experiments with the preheader hoist on a minimal JIT regression suite first, and abort if `bench/run_bench.sh --quick` shows a regression elsewhere. Re-record `bench/runtime_comparison.yaml` at 5 runs + 3 warmup before claiming a number — the existing values are single-sample. Captures the diagnosis tonight so the implementation pass can start clean from a verified hypothesis rather than redo the analysis.
afae94a to
4c26046
Compare
chaploud
added a commit
that referenced
this pull request
Apr 29, 2026
…post-fold loop overhead (#90)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The original W54 framing assumed zwasm was missing the constant-divisor fold that cranelift uses (multiply-high reciprocal). That assumption was wrong: both ARM64 (
src/jit.zig:3582-3666tryEmitDivByConstU32) and x86_64 (src/x86.zig) already implement the Hacker's Delight 10-9 magic-multiply, and dumping the JIT forbench/wasm/tgo_string_ops.wasmconfirms three identicalMOVZ + MOVK + UMULL + LSRsequences for the threei32.div_u 10sites — zeroUDIVinstructions.This PR captures that diagnosis so the next implementation pass can start clean from a verified hypothesis.
What the 2.1× gap actually is
tgo_strops). Cranelift's SSA + GVN hoist; zwasm's single-pass cannot without a preheader pass.local.set→mov rd = rs1stays in the JIT'd code where cranelift collapses it into a register rename via SSA.Single-pass-compatible levers
emitLoopPreHeader(today SIMD-only,src/jit.zig:4604) to scan forOP_CONST32 K → OP_DIV_U, allocate a callee-saved register (the prologue already spills x19–x28 unconditionally; functions with ≤13 vregs havex21free without prologue surgery), pre-load the magic.tryEmitDivByConstU32short-circuits when the magic is already live. Saves ~6 instructions per iteration ontgo_strops.OP_CONST32reuse across loop back-edges — saves the 1-instr const itself but not the magic that hangs off it. Skip unless (1) lands.OP_MOVcoalescing in linear-scan regalloc — substantial surgery; deserves its own W## entry.Out of scope (would break single-pass): SSA + dataflow, global regalloc, auto unroll / vectorise.
Why no implementation in this PR
The hoist is well-bounded but invasive enough that it warrants a focused PR rather than a tail-end commit on a long autonomous run. Edge cases that need design-level attention before the first commit:
reg_count ≥ 14(the obvious x21 slot is occupied; need to either skip the optimisation or pick a caller-saved register and verify the loop has no calls).Test plan
.dev/w54-investigation.md.