Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 65 additions & 51 deletions .dev/checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,57 +89,71 @@ Prefix: W## (to distinguish from CW's F## items).
`@./.dev/w47-investigation.md`. Low priority since 20 other
benchmarks improved >10% (GC paths 40–76% faster).

- [ ] W54: `tgo_strops` (and other div-heavy TinyGo workloads) is
~2.1× slower than wasmtime/cranelift on M4 Pro per
`bench/runtime_comparison.yaml` (2026-03-25, runs=1/warmup=0):
zwasm cached 63.2 ms vs wasmtime cached 30.0 ms. The original
framing — "constant-divisor folding is missing" — was
**disproven** during the 2026-04-29 evening investigation. Both
ARM64 (`src/jit.zig:3582-3666` `tryEmitDivByConstU32`) and
x86_64 (`src/x86.zig`) already emit the Hacker's Delight magic
multiply for `i32.div_u K`; the JIT dump for
`bench/wasm/tgo_string_ops.wasm` shows three MOVZ+MOVK+UMULL+LSR
sequences for the three `i32.div_u 10` sites, with zero `UDIV`
instructions.

The remaining 2.1× lives in two places:

1. The 2-instruction magic-constant load (`MOVZ + MOVK` for
`0xCCCCCCCD`) is re-emitted inside the loop body on every
iteration; cranelift's SSA + GVN hoist it once so only
`UMULH/UMULL + LSR` stay hot. Three div sites in
`tgo_strops` cost ~6 ARM64 instructions per iteration that a
preheader hoist would eliminate.
2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift
collapses those into register renames whereas zwasm's
linear-scan regalloc still spills them to LDR/STR against
`regs[]`.

Single-pass-compatible levers, ranked:

1. **Loop-preheader magic hoist.** Extend `emitLoopPreHeader`
(today SIMD-only, `src/jit.zig:4604`) to scan for
`OP_CONST32 K → OP_DIV_U` patterns, allocate a callee-saved
register, and pre-load the magic. `tryEmitDivByConstU32`
short-circuits when the magic is already live. Risk:
medium — needs to coexist with the existing physical-
register layout (functions like `string_ops` (func#24, 13
vregs) saturate the callee-saved set; the prologue would
have to reserve a free slot up front).
2. **`OP_CONST32` reuse across loop back-edges.** Today
`known_consts` is wiped at every header. Skip unless (1)
lands — saves the 1-instr const itself but not the 2-instr
magic that hangs off it.
3. **`OP_MOV` coalescing in linear-scan regalloc.** Substantial
surgery; deserves a separate W## entry.

Out of scope (would break single-pass): SSA + dataflow,
global register allocation, automatic loop unroll /
vectorise. Re-record `runtime_comparison.yaml` at 5 runs / 3
warmup before claiming any number — the current values are
single-sample.

Full investigation log: `@./.dev/w54-investigation.md`.
- [x] W54 (substrate): structural cleanup landed via PR #91 from
`develop/w54-loop-info`. Single change: `src/loop_info.zig` is
the single source of truth for branch / loop / vreg liveness.
The two JIT backends used to maintain byte-identical
`scanBranchTargets` implementations; both now consume
`LoopInfo.analyse(...)`. `vreg_first_def[]` / `vreg_last_use[]`
are computed in the same forward sweep, ready for future
consumers (Phase 5 / Phase 4). Behaviour byte-identical to
main (verified via `--dump-jit` diff on tgo_string_ops func#24,
fib func#2, and the realworld suite on Mac aarch64 + Ubuntu
x86_64). Architecture: D138 in `.dev/decisions.md`. Session arc:
`.dev/w54-redesign-postmortem.md`.

- [ ] W54-coalescer: liveness-driven mov coalescing extension to
`regalloc.copyPropagate`. Built and proven on Mac aarch64 (50/50
realworld), reverted from the substrate PR after Linux x86_64
CI flagged a `go_math_big` divergence — BigInt subtraction
result `864197532086419753208641975320` (wasmtime) vs
`864197532160206729503480181784` (zwasm). The regalloc itself
is arch-agnostic, so the same `RegFunc` flows through both
backends; the divergence implies an x86_64-specific assumption
in `src/x86.zig`'s codegen that the new IR layout (fewer MOVs,
shifted PCs) violates. Reproducible on OrbStack's
`my-ubuntu-amd64` with `develop/w54-loop-pass-redesign`
checked out and `zig build -Doptimize=ReleaseSafe`. The
coalescer commit (`ec8182f` on the archive branch) cherry-picks
cleanly; debugging is in the x86 backend's interaction with the
redef-stop pattern — likely a getOrLoad / SCRATCH contention
triggered by a specific opcode sequence that the coalesced IR
produces. Recommended bisect: dump `--dump-regir` for the
failing function, identify the first MOV the new coalescer
folds that the old version kept, then dump `--dump-jit` on x86
to find the codegen mismatch.

- [ ] W54-hoist-revisit: magic-constant loop-invariant hoist was
built and proven on `develop/w54-loop-pass-redesign` (archived
as `archive/w54-magic-hoist-2026-04-30`). digitCount JIT goes
196 → 192 with hoist alone. **Held back** from the substrate PR
pending three prerequisites:
1. **W47** — bench harness with σ < 5% on `tgo_strops`. The
hoist's effect is below the current 10% σ floor.
2. **W54-x86** — x86_64 `pickHoistPhys` parity. ARM64-only land
forces a reconciling follow-up; bundling makes one coherent
change.
3. `runtime_comparison.yaml` re-recorded at 5/3 hyperfine on a
thermally-stable rig.
Re-attempt: cherry-pick `1600397` + `c4b806e` from the archive
branch onto a fresh redesign branch once 1+2+3 land.

- [ ] W54-x86: x86_64 magic-constant hoist parity for
W54-hoist-revisit. The archive branch's `src/x86.zig` already
has the `hoist_phys` / `hoist_displaced_infra` field
scaffolding; `pickHoistPhys` body needs implementing for x86's
reg layout (RBX/RBP/R15 vreg-bound for any reg_count >= 1; no
`inst_ptr_cached` slot to displace; R13/R14 free only when
`!has_memory`). Bench-driven decision on whether the win is
worth the displacement cost on functions with `has_memory`.

- [ ] W54-libm: real-world `rw_c_math` (5.0× wasmtime cold,
8.7× cached) is dominated by BLR-heavy libm dispatch (`sin`,
`cos`, `pow`, `sqrt` per iteration). Out of scope for the
loop-pass redesign. Single-pass-compatible candidates: intrinsic
recognition for imported function names (`sqrt` → FSQRT inline),
software-libm fallback for sin/cos/pow. Needs imported-function
name resolution on the predecode side.

- [ ] W48 Phase 2: Linux binary size 1.56 MB → 1.50 MB (~62 KB more).
W48 Phase 1 shipped (2026-04-25): `pub const panic = std.debug.simple_panic`
Expand Down
128 changes: 128 additions & 0 deletions .dev/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -941,3 +941,131 @@ Mac binary another 60 KB → cap drops 1.30 → 1.25 MB). Loosening a
ceiling requires a CHANGELOG entry naming the regression source so
the slack is intentional and visible.


## D138: Shared `LoopInfo` analysis layer for the JIT pipeline

**Status**: Accepted — landed via `develop/w54-loop-info` (PR #91).

**Context**: For a long stretch we treated each JIT optimisation as a
self-contained patch — the SIMD `emitLoopPreHeader`, the
const-divisor magic-multiply fold, the adjacent-MOV `copyPropagate`,
the `vm_ptr_cached` / `inst_ptr_cached` slots in `vregToPhys`. Each
re-derived its own slice of "what's a loop" and "is this vreg dead"
inline at codegen time. That ad-hoc layout was the proximate cause
of the W54 magic-hoist abandonment on
`develop/w54-magic-hoist-attempt` (2026-04-29 evening, see
`.dev/w54-investigation.md`): x21 was simultaneously the inst_ptr
cache slot for `reg_count <= 13 && has_self_call` AND the natural
callee-saved candidate for the magic. Picking a safe boundary was a
design call, not a tail-end commit.

**Decision**: Ship the substrate first. One structural change that
stands on its own merit and unblocks future loop-aware work:

`src/loop_info.zig` is the single source of truth for the function's
control-flow shape and per-vreg liveness. The two JIT backends used
to maintain byte-for-byte identical `scanBranchTargets`
implementations; both now consume `LoopInfo.analyse(allocator, ir,
reg_count)`, which produces:

- `branch_targets[]`, `loop_headers[]`, `loop_end[]` (drives JIT
cache eviction and the `known_consts` wipe at merge points).
- `vreg_first_def[]`, `vreg_last_use[]` (one forward sweep,
conservative reads — over-approximation extends last_use later
than necessary, which only shrinks a future coalescer's window;
never breaks correctness). Phase 5+ consumers will read these.

The opcode classification helpers `opWritesRd`, `opUsesRdAsSource`,
`opUsesRs1AsSource`, `opUsesRs2AsSource` live in `loop_info.zig`
for now (private) — they will be promoted to public regalloc API
once the coalescer extension that needs them is debugged on x86_64
(see W54-coalescer in checklist.md).

**Effect (ARM64 Mac)**:

- Both backends drop ~60 lines of duplicated `scanBranchTargets`
body in favour of a thin `LoopInfo.analyse` call. Behaviour is
byte-deterministic identical to main: `--dump-jit=24` of
`tgo_string_ops` func#24 (digitCount, 196 ARM64 instrs / 784
bytes) matches main bit-for-bit.
- All other functions emit byte-identical machine code. No
performance change is expected or observed (Phase 0 + 1 are pure
refactoring; the data is computed but no codegen consumer reads
the new `vreg_first_def[]` / `vreg_last_use[]` arrays yet).

**Rejected alternatives** (and why they didn't ride along):

- **Liveness-driven mov coalescing in `regalloc.copyPropagate`**.
Implemented and shipped on the develop branch, then **reverted
on 2026-04-30** after the green Mac gate but RED Linux x86_64 CI
on `go_math_big`: the new "stop at first redef of old_reg" scan
with branch-target / forward-branch / multi-source bail-outs
passes the Mac aarch64 realworld suite (50/50, including
`rust_regex` which the first attempt broke), but produces wrong
results on Linux x86_64's `go_math_big` (BigInt subtraction
mismatch — wasmtime returns
`864197532086419753208641975320`, zwasm returns
`864197532160206729503480181784`). The regalloc itself is
arch-agnostic, so the same `RegFunc` flows through both
backends; the divergence implies an x86_64-specific assumption
in `src/x86.zig`'s codegen that the new IR layout violates.
Diagnosis is bench/CI-bound and not a tail-end fix; tracked as
W54-coalescer for a focused follow-up.

- **Magic-constant loop-invariant hoist** (`OP_CONST32 K →
OP_DIV_U` pattern, materialise the magic into a callee-saved
register in the prologue, short-circuit
`tryEmitDivByConstU32`). Implemented on
`develop/w54-loop-pass-redesign` (commits `1600397`, `c4b806e`)
and proved out: digitCount JIT 196 → 192 with hoist alone, 192
→ 185 stacked with the (eventually reverted) coalescer. Held
back from this PR for three reasons:
1. The runtime gain is below the bench σ floor today; without
the W47 harness work the optimisation would land
evidence-free and any later regression would be argued as
noise rather than measured.
2. The hoist requires displacing `inst_ptr_cached` (x21) on
functions with reg_count >= 5 + has_self_call — an
ARM64-specific behaviour change with no measured benefit
today. Pushing it post-harness keeps the trade-off
reviewable.
3. x86_64 parity has different free-slot mechanics (no
`inst_ptr_cached` to displace). Bundling hoist with parity
makes one coherent change later.

- **Loop-invariant `known_consts` survival across loop headers**.
Sketched as Phase 4 of the original plan; dropped after
inspection of digitCount's RegIR showed every `i32.div_u 10` site
has its CONST32 emitted *inside* the loop body (TinyGo reuses
the same vreg `r8` / `r9` / `r12` per div), so the optimisation
would never fire on the W54 target.

**Affected files**: `src/loop_info.zig` (new), `src/jit.zig`
(replace `scanBranchTargets` with `LoopInfo.analyse`), `src/x86.zig`
(same).

**Archive**: the magic-hoist + coalescer work is preserved on
`develop/w54-loop-pass-redesign` (last commit `a56d442`) and tagged
`archive/w54-magic-hoist-2026-04-30`. Cherry-pick path:
- `1600397` + `c4b806e` for the ARM64 magic hoist (re-attempt
W54-hoist-revisit).
- `ec8182f` for the redef-aware coalescer (re-attempt W54-coalescer
after diagnosing the x86_64 `go_math_big` regression).

**Re-evaluation pre-conditions**:
1. **W47** — bench harness with σ < 5% on tgo_strops (currently ~10%)
2. **W54-x86** — symmetric `pickHoistPhys` for x86_64 reg layout
3. **W54-coalescer** — diagnose and fix the x86_64 `go_math_big`
divergence; the diff is in the regalloc-stage IR shape, the
backend assumption that breaks is in `src/x86.zig`.
4. `runtime_comparison.yaml` re-recorded with 5/3 hyperfine on a
thermally-stable rig.

**Follow-ups** (open W## items in checklist.md):
- `W54-coalescer`: diagnose the x86_64 `go_math_big` regression,
re-land the coalescer.
- `W54-hoist-revisit`: revive the magic-hoist work once W47 +
W54-x86 are ready.
- `W54-libm`: `rw_c_math` is dominated by libm `sin`/`cos`/`pow`
dispatch; intrinsic recognition + ARM64 FSQRT inline + soft-libm
fallback.
36 changes: 35 additions & 1 deletion .dev/memo.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,41 @@ Session handover document. Read at session start.

## Current Task

**W53 done. C-g foundation + Mac/Ubuntu baselines done.** Ship-overnight
**W54 substrate landed via PR #91 from `develop/w54-loop-info`** (2026-04-30).
Single structural change: `src/loop_info.zig` is the single source of
truth for branch / loop / vreg liveness. Both backends drop ~60 lines
of byte-identical `scanBranchTargets` in favour of a thin
`LoopInfo.analyse(...)` call. `vreg_first_def[]` /
`vreg_last_use[]` are computed from the same forward sweep, ready
for future consumers. JIT output is byte-identical to main on every
function (verified via `--dump-jit` diff for tgo_string_ops func#24
and fib func#2).

### Held back (archive branch)

`develop/w54-loop-pass-redesign` (tagged
`archive/w54-magic-hoist-2026-04-30`) preserves two further pieces
of work that were built and bench-validated, but held back:

1. **Magic-constant loop-invariant hoist** (commits `1600397`,
`c4b806e`). digitCount JIT 196 → 192. Re-attempt prerequisites:
W47 (bench harness σ < 5%), W54-x86 (parity).
2. **Liveness-driven mov coalescing** (commit `ec8182f`). digitCount
JIT 192 → 185 stacked on hoist; substrate-only branch JIT 196 →
189 with just the coalescer. **Reverted from PR #91 after
Linux x86_64 CI failed `go_math_big`** (BigInt subtraction
divergence — wasmtime returns `864197532086419753208641975320`,
zwasm returns `864197532160206729503480181784`). The regalloc
is arch-agnostic, so the divergence is in `src/x86.zig`'s
codegen interaction with the new IR layout. Reproducible on
OrbStack `my-ubuntu-amd64`. Tracked as W54-coalescer.

Architecture rationale: D138 in decisions.md. Detailed session arc
+ branch names: `.dev/w54-redesign-postmortem.md`.

### Previous (still on main)

**C-g foundation + Mac/Ubuntu baselines done.** Ship-overnight
session 2026-04-29 evening landed two PRs to main on top of the
afternoon's six (#79..#84):

Expand Down
Loading
Loading