Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 45 additions & 37 deletions .dev/checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,46 +92,54 @@ Prefix: W## (to distinguish from CW's F## items).
- [ ] W54: `tgo_strops` (and other div-heavy TinyGo workloads) is
~2.1× slower than wasmtime/cranelift on M4 Pro per
`bench/runtime_comparison.yaml` (2026-03-25, runs=1/warmup=0):
zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. Current main
(`448f4c8`) is ~67 ms cached, so the gap is recurring, not a
measurement artefact, and it dwarfs the W47 +15 % post-0.16
regression. The hot loop is TinyGo's `digitCount` —
`i32.div_u 10 + br_if` in a `for v > 0` loop — and the lever is
the constant-divisor optimisation. Strategy stack, ordered by
leverage and single-pass compatibility:

1. **Constant-divisor → multiply-high (Hacker's Delight 10-9)**
in predecode. Detect `i32.const K; i32.div_u` (and `rem_u`,
plus `mul K` for power-of-two / shift-add) in a 2-instruction
window, rewrite to a synthetic `udiv_const K` RegIR op. JIT
emits `MOVZ m; UMULH tmp, n, m; LSR result, tmp, s` on
ARM64; `MOV m; MUL r/m32; SHR edx, s` on x86_64. Magic
numbers are pure constants of K, computed once. UDIV is
8–10 cycles vs UMULH+LSR ~3–4 cycles, so realistic gain on
`digitCount` is ~30–40 ms cached → close to wasmtime parity.
Pure peephole; preserves single-pass.
2. **Loop-header Q-cache persistence** (existing W45). Detect
back-edges in `scanBranchTargets` and skip the Q-cache
evict at the loop header so the induction var stays in a
register. Helps `tgo_strops`, `tgo_arith`, `tgo_fib_loop`,
`st_nestedloop`. Already designed; cheap to land.
3. **`br_if` fall-through ordering audit**. cranelift always
places the fall-through arm as the loop continuation so the
branch predictor wins. Confirm `regalloc.zig`'s terminator
emit does the same and mirror it if not. Cheap audit.
4. **Interpreter dispatch codegen diff** (also closes W47). asm
diff `vm.zig`'s hot dispatch loop between v1.9.1 and main
under Zig 0.16 / LLVM 19. The post-0.16 +15 % most likely
lives here, and a fix would lift every interpreter path,
not just `tgo_strops`.
zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. The original
framing — "constant-divisor folding is missing" — was
**disproven** during the 2026-04-29 evening investigation. Both
ARM64 (`src/jit.zig:3582-3666` `tryEmitDivByConstU32`) and
x86_64 (`src/x86.zig`) already emit the Hacker's Delight magic
multiply for `i32.div_u K`; the JIT dump for
`bench/wasm/tgo_string_ops.wasm` shows three MOVZ+MOVK+UMULL+LSR
sequences for the three `i32.div_u 10` sites, with zero `UDIV`
instructions.

The remaining 2.1× lives in two places:

1. The 2-instruction magic-constant load (`MOVZ + MOVK` for
`0xCCCCCCCD`) is re-emitted inside the loop body on every
iteration; cranelift's SSA + GVN hoist it once so only
`UMULH/UMULL + LSR` stay hot. Three div sites in
`tgo_strops` cost ~6 ARM64 instructions per iteration that a
preheader hoist would eliminate.
2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift
collapses those into register renames whereas zwasm's
linear-scan regalloc still spills them to LDR/STR against
`regs[]`.

Single-pass-compatible levers, ranked:

1. **Loop-preheader magic hoist.** Extend `emitLoopPreHeader`
(today SIMD-only, `src/jit.zig:4604`) to scan for
`OP_CONST32 K → OP_DIV_U` patterns, allocate a callee-saved
register, and pre-load the magic. `tryEmitDivByConstU32`
short-circuits when the magic is already live. Risk:
medium — needs to coexist with the existing physical-
register layout (functions like `string_ops` (func#24, 13
vregs) saturate the callee-saved set; the prologue would
have to reserve a free slot up front).
2. **`OP_CONST32` reuse across loop back-edges.** Today
`known_consts` is wiped at every header. Skip unless (1)
lands — saves the 1-instr const itself but not the 2-instr
magic that hangs off it.
3. **`OP_MOV` coalescing in linear-scan regalloc.** Substantial
surgery; deserves a separate W## entry.

Out of scope (would break single-pass): SSA + dataflow,
global register allocation beyond linear scan, automatic loop
unroll / vectorise.
global register allocation, automatic loop unroll /
vectorise. Re-record `runtime_comparison.yaml` at 5 runs / 3
warmup before claiming any number — the current values are
single-sample.

Measurement note: `runtime_comparison.yaml` is currently runs=1
/ warmup=0 — useful for ordering, not for absolute targets.
Re-record at 5 runs / 3 warmup before claiming a win.
Full investigation log: `@./.dev/w54-investigation.md`.

- [ ] W48 Phase 2: Linux binary size 1.56 MB → 1.50 MB (~62 KB more).
W48 Phase 1 shipped (2026-04-25): `pub const panic = std.debug.simple_panic`
Expand Down
63 changes: 31 additions & 32 deletions .dev/memo.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,38 +58,37 @@ Post-merge bench rows for the C-g merge (`e5766ee`):

## Open work

### 0. **W54** — close the 2.1× wasmtime gap on `tgo_strops`

`bench/runtime_comparison.yaml` (2026-03-25, commit 65db814,
runs=1/warmup=0) shows zwasm cached 63.2 ms vs wasmtime cached
30.0 ms on `tgo_strops`. Current main (`448f4c8`) is ~67 ms
cached, so the gap is structural — and it dwarfs the W47 +15 %
post-0.16 regression. Bigger leverage than W47.

Hot loop is TinyGo's `digitCount` — `i32.div_u 10 + br_if` inside
`for v > 0`. The lever is the constant-divisor optimisation that
cranelift does by default. Strategy stack (single-pass-compatible,
ordered by leverage):

1. **Constant-divisor → multiply-high in predecode**. Two-op
window peephole for `i32.const K; i32.div_u` (and `rem_u`,
`mul K` power-of-two). Synthesise `udiv_const K`; JIT emits
`UMULH + LSR` on ARM64, `MUL + SHR` on x86_64. Magic
numbers pre-computed per K. UDIV ~10 cyc → UMULH ~3 cyc,
so realistic gain ≈ wasmtime parity for div-heavy workloads.
2. **Loop-header Q-cache persistence (W45)**. Skip Q-cache
evict at loop headers so induction vars stay in registers.
Already designed.
3. **`br_if` fall-through audit**. Confirm `regalloc.zig`
places the fall-through arm as the loop-continuation path.
4. **Interpreter dispatch codegen diff** (also closes W47).
asm diff `vm.zig` hot dispatch v1.9.1 vs main under
Zig 0.16 / LLVM 19.

Out of scope (would break single-pass): SSA, global regalloc,
auto unroll / vectorise. Re-record `runtime_comparison.yaml` at
5 runs / 3 warmup before claiming a win — current values are
single-sample. Detailed strategy in `.dev/checklist.md` W54.
### 0. **W54** — investigated; framing reset, implementation deferred

The original W54 framing — "zwasm doesn't fold `i32.div_u K`,
that's why wasmtime is 2× faster" — was **disproven** during
the 2026-04-29 evening investigation. Both ARM64 and x86_64
JITs already emit the Hacker's Delight multiply-high for
constant divisors; the JIT dump for `tgo_string_ops` (func#24,
3 div_u sites, magic 0xCCCCCCCD) shows zero UDIV instructions.

The 2.1× gap actually lives in:

1. **Magic-constant re-load every iteration** (~6 ARM64 instrs/
iter for 3 div sites). cranelift's SSA + GVN hoist; zwasm
single-pass cannot without an explicit preheader pass.
2. **mov-heavy RegIR from TinyGo's `local.set`** that the
linear-scan regalloc spills to LDR/STR pairs.

Best next step (single-pass-compatible) is the loop-preheader
magic hoist — extend `emitLoopPreHeader` (SIMD-only today,
`src/jit.zig:4604`) to scan for `OP_CONST32 K → OP_DIV_U`,
reserve a callee-saved register in the prologue, and pre-load
the magic. Deferred from this session because reserving the
extra callee-saved slot interacts with the physical-register
layout in `vregToPhys` (functions like `string_ops` with 13
vregs already saturate the callee-saved set, so the hoist
needs the prologue spill machinery to make room) — the design
is well-bounded but invasive enough to warrant its own focused
PR rather than a tail-end commit on a long autonomous run.

Full investigation: `.dev/w54-investigation.md` (now in main
on PR #90).

### 1. **W47** — `tgo_strops_cached` regression with stable harness

Expand Down
2 changes: 1 addition & 1 deletion .dev/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Details: `roadmap-archive.md`.
| Windows CI guard removal | Done | W49 (Plan C residuals) + W50 (CI Nix-ify) shipped 2026-04-29 PM. Only `benchmark` Ubuntu-only remains, sequenced behind C-g step 5. |
| W53 install-tools.ps1 rust | Done | Root-cause: rustup-init stdout polluting `Install-Rustup`'s return; fix routes through `Out-Host`. CI dropped `-SkipRust`. |
| C-g multi-arch bench schema | Done | PR #86 (2026-04-29 eve). Step 5 (3-OS matrix flip + Windows hyperfine + native x86_64 baseline) tracked in `.dev/memo.md` open work. |
| W54 close 2.1× wasmtime gap | Active | Hot lever: constant-divisor → mul-high peephole in predecode. Single-pass-safe. See `.dev/checklist.md` W54 for full strategy. |
| W54 close 2.1× wasmtime gap | Active | Original framing disproven (const-divisor fold is already implemented). Real lever: magic-constant hoist in `emitLoopPreHeader`. See `.dev/w54-investigation.md`. |
| Spec test auto-bump | Active | Weekly CI (spec-bump.yml). Review failures. |
| wasm-tools tracking | Active | Monthly CI (wasm-tools-bump.yml) |
| SpecTec monitoring | Active | Weekly CI (spectec-monitor.yml) |
Expand Down
152 changes: 152 additions & 0 deletions .dev/w54-investigation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# W54 — `tgo_strops` 2.1× wasmtime gap investigation

Captured: 2026-04-29 evening, ship-overnight session.
Status: investigated, no fix shipped yet.

## What zwasm already does, contrary to the initial hypothesis

The constant-divisor → multiply-high (Hacker's Delight 10-9) peephole
is **already implemented** for both ARM64 (`src/jit.zig:3582-3666`)
and x86_64 (`src/x86.zig`, `tryEmitDivByConstU32`). `known_consts`
tracking in the compile loop sets each vreg's value when an
`OP_CONST32` lands; `emitDiv32` checks the rs2 vreg for a known
constant and falls into `tryEmitDivByConstU32` for non-zero,
non-power-of-two divisors.

Confirmed empirically on `bench/wasm/tgo_string_ops.wasm`:

```
$ ./zig-out/bin/zwasm run --invoke string_ops bench/wasm/tgo_string_ops.wasm \
--dump-jit=24 10000 | python3 ...
0x0f8: MOVZ X16, #0xCCCD ← magic for /10 (low half)
0x0fc: MOVK X16, #0xCCCC, lsl 16 ← magic for /10 (high half)
0x100: UMULL X8, W22, W16 ← multiply-high
0x104: LSR X8, X8, #35 ← extract quotient
```

Three identical 5-instruction sequences for the three `i32.div_u 10`
sites in `string_ops`. **Zero `UDIV` instructions emitted.**

So the original W54 hypothesis ("zwasm doesn't fold constant divisors,
that's why wasmtime is 2× faster") is wrong — the fold is done.

## Where the 2× gap actually lives

### a. Magic constant re-loaded every iteration

Each `i32.div_u 10` inside `digitCount`'s loop body re-emits the full
2-instruction MOVZ+MOVK pair to materialise `0xCCCCCCCD`. Cranelift's
SSA + GVN hoist that load to before the loop, leaving only UMULL+LSR
inside the hot path. With three div sites in `tgo_strops`, the
hoistable cost is **6 instructions per loop iteration**.

### b. mov-heavy RegIR

TinyGo's `digitCount` body in RegIR (function 24, PCs 21..30):

```
[022] add r8 = r2, r7 ; counter + 1
[023] mov r2 = r8
[024] const32 r8 = 9
[025] gt_u r8 = r0, r8
[026] mov r6 = r8 ← cond temp
[027] const32 r8 = 10
[028] div_u r8 = r0, r8 ← 5 instrs after const-folding
[029] mov r0 = r8
[030] br_if r6 -> pc=21
```

Roughly 9 RegIR instructions, but the JIT emits ~17 ARM64 instructions
because each `mov` becomes an LDR/STR pair against the `regs[]`
spill area when the vreg is not currently held in a physical
register, and the const-divisor sequence is 5 instructions.

Cranelift's SSA collapses every `mov rA = rB` plus the redundant
counter/temp stores into pure register renames. zwasm's linear-scan
regalloc cannot do that without an additional pass.

### c. Single-pass constraint

Both fixes (magic-constant hoist, mov coalescing) require either
loop-aware analysis or a second pass over the RegIR — which the
project's design constraint (single-pass JIT to keep cold start
cheap) excludes by default.

## Realistic single-pass-compatible wins

Ordered by leverage and implementation risk:

1. **Loop-preheader magic hoist.** Extend the existing
`emitLoopPreHeader` (currently SIMD-only,
`src/jit.zig:4604`) to scan the loop body for
`OP_CONST32 K` instructions whose `rd` is later consumed by an
`OP_DIV_U` / `OP_REM_U`. Allocate a callee-saved register, emit
the magic constant once before the loop, and have the in-loop
`emitDiv32` skip its MOVZ+MOVK if the magic is already live.
Saves ~6 instructions per iteration on `tgo_strops`. Risk:
medium — needs careful tracking of which scratch register holds
which magic across the loop body, and the back-edge logic must
not invalidate the cache.

**Register layout interaction (caught during the abandoned
experimental attempt on `develop/w54-magic-hoist-attempt`).**
The obvious choice for the magic register is `x21`, which is
only handed out by `vregToPhys` when `reg_count >= 14`. But
`x21` is *also* the dedicated `inst_ptr` cache slot whenever
`reg_count <= 13 && has_self_calls` (see `src/jit.zig:1129`,
field `inst_ptr_cached`). Both states overlap on a real slice
of the corpus, so the hoist needs to either
(a) skip the optimisation whenever `inst_ptr_cached` is true
(smallest, safest patch — gives up the optimisation on
self-calling functions with ≤13 vregs);
(b) extend the prologue to reserve an additional callee-saved
slot (e.g. push `x23` or pick from the unused tail of the
STP-pair set when `reg_count` is small) and thread that
through the existing layout machinery (more invasive).
x86_64 has its own version of this dance (`r13`/`r14` are the
common candidates); a clean implementation should keep the
register choice in an arch-specific helper rather than a
shared constant.

Tonight's experimental branch was abandoned at this point
because picking the right safety boundary for the register
choice is itself a design call worth daylight, not a
tail-end commit.

2. **`OP_CONST32` reuse across loop back-edges.** Today
`known_consts` is wiped at every loop header and back-edge.
For consts whose `rd` is rewritten consistently to the same
value on every iteration, we could keep the entry alive: emit
only the first iteration's MOVZ+MOVK, then `nop` (or branch
over) the const-load on subsequent iterations. Less leverage
than (1) because OP_CONST32 itself is only 1 instruction —
what (1) saves is the magic computation that hangs off the
const, not the const itself. Skip unless (1) lands.

3. **`OP_MOV` coalescing in regalloc.** When a `mov rd = rs1` has
`rs1` dead-after-this-point, rewrite the producer of `rs1` to
write directly into `rd`. Needs liveness, which today is
computed only as `written_vregs`. Substantial regalloc surgery
— likely a separate W## item, not in scope for tonight.

## What was attempted in this session

- Confirmed the const-divisor fold triggers (above).
- Decoded the JIT'd bytes for func#24 to see the actual emitted
ARM64. Identified MOVZ+MOVK+UMULL+LSR as the 5-instruction
per-div-site cost.
- Did **not** attempt the loop-preheader magic hoist or mov
coalescing — both warrant their own design pass with the
spec/regalloc tests as the safety net, not an overnight commit.

## Recommended next step

Open a separate `develop/w54-loop-magic-hoist` branch and
prototype the preheader hoist (item 1 above) against a minimal
JIT regression suite first. If `bench/run_bench.sh --quick` shows
≥15 % improvement on `tgo_strops` with no regression elsewhere,
land it. Otherwise revert and capture the dead-end here.

Re-record `bench/runtime_comparison.yaml` at 5 runs / 3 warmup
before claiming a number — the current single-sample values are
useful for ordering but not for absolute targets.
Loading