Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions .claude/AMX_GOTCHAS.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ int8 2048³ = 169.7 GMAC/s, 600× scalar, single-thread
| runs fine, `correct=false` | operand index/sign convention mirrored | Gotcha 12 |
| compile error `unstable x86_amx_intrinsics` | used nightly intrinsics | Gotcha 1, 8 |
| compile error `rbx is used internally by LLVM` | inline-asm CPUID | Gotcha 3 |
| exact when idle, silently wrong under CPU load (VM) | tile state lost across host vCPU switch | Gotcha 14 |

---

Expand Down Expand Up @@ -222,6 +223,50 @@ once is correct even under rayon. `cpu_model()` is cached the same way.

---

## Gotcha 14: on oversubscribed VMs, tile state is silently corrupted under host CPU contention ⚑

Observed 2026-07-02 on this remote VM (4 vCPU, EMR-class Xeon, guest kernel
6.18.5) by `examples/onebrc_cascade_probe.rs`, reproduced on demand:

```
idle: 413/413 stations bit-exact (10M and 100M rows)
4 busy-loop competitors: 89/413, 152/413 exact — whole rows LOST, no fault
probe pinned to core 0,
load pinned to cores 1-3: 124/413 exact — pinning does NOT mitigate
idle control right after: 413/413 exact again
```

Signature: **no crash, no SIGSEGV/SIGILL — results are silently wrong, and
only under load.** An AVX-512 path in the same process, same run, stays
bit-exact, isolating the corruption to TMM tile state (the tmm0 accumulator
loses in-flight partial sums). Because guest-side pinning doesn't help, the
suspected mechanism is the **host** hypervisor's vCPU context switch failing
to save/restore guest `XTILEDATA` when the host multiplexes oversubscribed
pCPUs (idle guests keep their vCPUs resident → no corruption; loaded guests
get switched → corruption). Guest-side `arch_prctl` permission (Gotcha 4) is
correctly granted — this is a layer below the guest kernel.

Consequences:

- **Never certify AMX numerics from a shared/oversubscribed VM.** Bare metal
or a dedicated-CPU instance only. A "PASS on my cloud box" is worthless
under this gotcha unless the box was provably idle.
- **Extend Gotcha 9's discipline**: a parity test for a tile kernel must ALSO
run under deliberate CPU contention (a few busy loops are enough — see the
reproduction above). Exact-when-idle is necessary, not sufficient.
- **Keep tile residency short.** Long accumulation loops that live in tmm
across many iterations (the 16×16×K pattern holds tmm0 for K/32 iterations)
maximize the exposure window. Draining accumulators to memory more often
shrinks it but does NOT close it — treat it as harm reduction, not a fix.
- Production dispatch on virtualized hosts should either avoid AMX or pair it
with a checksum/parity channel (e.g. a redundant ones-row whose expected
value is known — the onebrc probe's count row doubles as exactly that).

Fault signature: `correct=true` in every quiet test, sporadic wrong results
in production under load, AVX-512 siblings unaffected.

---

## Hardware tiers

```
Expand Down
108 changes: 108 additions & 0 deletions .claude/blackboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,114 @@
> **Read this first.** The "Polyglot Notebook" architecture below is a
> separate/older program, not the current epoch.

## 2026-07-02 (later) — bf16 tile GEMM: VDPBF16PS middle tier + PackedBf16B (loose end closed)

Closed the [LOOSE END] from the 1BRC entry below. `hpc/bf16_tile_gemm.rs`
is now a three-tier ladder — **AMX TDPBF16PS → AVX-512 VDPBF16PS →
decode+FMA polyfill** — with the polyfill kernel (`simd_ops.rs`) untouched:

- **VDPBF16PS tier** (`avx512bf16_path`, private): bf16 pairs multiplied
natively per zmm (no bf16→f32 decode), f32 lane accumulators, SAME VNNI
operand layout as the AMX tile → one packed buffer serves both tile
tiers. `_mm512_dpbf16_ps` verified stable on Rust 1.94. Runtime
`is_x86_feature_detected!("avx512bf16")` (EMR box has it).
- **`PackedBf16B`** + **`bf16_tile_gemm_16x16_packed`**: VNNI pack (and
its per-call allocation) hoisted out of hot loops; `vnni_index(row,col)
= (row/2)·32 + 2·col + (row&1)` supports staging B DIRECTLY in VNNI
layout (zero pack cost — the right shape for one-hot/sparse staging).
- **`bf16_tile_gemm_tier()`**: names the tier that will run (Gotcha 9
reporting). Re-exports via `ndarray::simd::*` (W1a surface).
- **Exactness boundary preserved (operator condition):** bit-exact across
ALL tiers for bf16-exact integer operands with accumulation < 2^24 —
asserted with `assert_eq!` in the new parity tests (vnni_index vs
vnni_pack_bf16; packed==unpacked==i64 reference; VDPBF16PS exact +
tolerance-parity vs polyfill on floats; accumulate semantics). Gotcha-14
contention parity test included as `#[ignore]` (fails on oversubscribed
VMs BY DESIGN; run `--ignored` on dedicated silicon).

[MEASURED] onebrc probe GEMM leg with direct-VNNI staging: **3.6 → 21.3
Mrows/s (5.9×), 23.7 → 141.9 GMAC/s** (single thread — near the 169.7
GMAC/s int8 AMX anchor in AMX_GOTCHAS). 413/413 stations still EXACT;
8/8 lib tests + 2 doctests green; clippy/fmt clean.

[NOTE] Dispatch-behavior change signed off by operator: the row-major
entry `bf16_tile_gemm_16x16` now routes avx512bf16-without-AMX hosts
through VDPBF16PS instead of decode+FMA (bit-exact within the integer
boundary; BF16-precision-class accumulation-order differences on general
floats, same as any tier change).

[ADDED, same day] **LE byte contract on `PackedBf16B`** (operator "Go" —
first brick of the SoA-Morton batch-writer / write-hiding design):
`as_le_bytes()` (zero-cost reinterpret; LE by construction — the module
is x86_64-only) + `from_le_bytes()` (endian-correct anywhere, plain copy
on LE). This is the persistence/mailbox face per lance-graph's
SoaEnvelope discipline (envelope bytes LE from creation to tombstone).
Test `le_byte_view_roundtrips_and_is_truly_le` asserts byte 2i = low
byte of lane i AND that a GEMM over the roundtripped buffer stays
bit-exact. 9/9 lib tests green. Next bricks (lance-graph side): batch
writer flushing tile buffers as envelope tenants; write-hiding = stage
morsel N+1's VNNI writes while morsel N's tiles compute.

## 2026-07-02 — 1BRC-on-substrate probe (`examples/onebrc_cascade_probe.rs`)

1BRC workload (min/mean/max per station) restated on the substrate, as a
sibling of `morton_cascade_probe`. Branch `claude/1brc-lance-graph-xfx5tu`.
Three paths certified bit-for-bit against a scalar integer reference
(413 stations, integer tenths → exact in f32/f64 by construction):

- **Morton scatter**: stations minted as cells on a 64×64 Morton grid
(4×4 tile = one F32x16), morsel-batched (64K rows) scatter into
L1-resident SoA accumulators, (min,max,Σ,n) monoid fold.
- **AMX BF16 tile-GEMM group-by**: (Σ,n) as `C += A[16×K]·B[K×16]` via
the NEW `ndarray::simd::bf16_tile_gemm_16x16_amx` re-export (W1a: the
AMX-dispatching hpc wrapper surfaced through the canonical polyfill,
same pattern as `matmul_i8_to_i32`; the `_amx` suffix disambiguates
from the pure-FMA `simd::bf16_tile_gemm_16x16`) — B = per-row one-hot
station indicator (26 column-blocks of 16), A rows = {1, hi(t), lo(t),
bf16-RNE(t)} with the exactness split `hi=(t/256)·256, lo=t−hi` (both
bf16-exact; f32 tile accumulation exact for K=4096). Clear-by-undo
keeps B staging O(rows). AMX **actually ran** (amx_available()==true
printed per Gotcha 9 discipline; EMR-class Xeon, kernel 6.18.5).
- **Aggregate pyramid** over the tile grid: hierarchical (min,mean,max)
per tile/region/root in the same pass + band-prune queries
(Belichtungsmesser on the MIN channel).

[MEASURED] 10M rows, 4-core Xeon EMR VM, single thread:
reference 453 Mrows/s | morton scatter 443 Mrows/s (**substrate tax ≈ 2%**)
| tile-GEMM 3.6 Mrows/s = 23.7 GMAC/s (dense one-hot indicator = the
honest price of group-by-as-matmul; per-call `vnni_pack_bf16` alloc in
`bf16_tile_gemm_16x16` is a visible overhead) | pyramid fold 0.02 ms |
band query prune 90.2%. All 413 stations EXACT on both paths; PASS.
Also EXACT at 100M rows (idle). **"Is BF16 precise enough?" — measured:**
the naive bf16-RNE row through the same tile gives max per-station
|Δmean| = 0.0123 tenths (0.0012 °C, N≈24k/station — quantization bias
averages out); single readings off by ≤ 2 tenths (half-ulp of bf16 at
|t|∈[512,1024)). Verdict: bf16-direct fine for means, hi/lo split (free —
spare A rows) required for min/max + exactness certification.

[FINDING → **Gotcha 14**, `.claude/AMX_GOTCHAS.md`] On this oversubscribed
VM, **AMX tile state silently corrupts under host CPU contention**: idle
= 413/413 exact at 100M rows; with 4 busy-loop competitors = 89-152/413
(whole rows lost, no fault); guest-side core pinning does NOT mitigate
(124/413); AVX-512 scatter path in the same run stays exact → isolated
to TMM state; suspected host-vCPU-switch XTILEDATA loss. Consequences
written into the gotcha: never certify AMX numerics on shared VMs; parity
tests must also run under deliberate load (Gotcha 9 extension); short
tile residency = harm reduction only.

[CROSS-REPO] Algebraic certification (partition/regroup invariance of the
monoid fold, bf16 hi/lo decomposition exactness) lands as a diagnostic
probe in `lance-graph/crates/jc` (`onebrc_agg`) — kernels here, proof
there, per the architecture rule (ndarray = hardware, jc = proof).

[LOOSE END] AMX has no min/max tile op → min/max stay on the scatter
path by construction. `bf16_tile_gemm_16x16` allocates + VNNI-packs B on
every call — a pre-packed-B variant would lift the GEMM leg
substantially; file under W1-adjacent if the group-by-as-GEMM shape
recurs. Text-ingest leg (SWAR/SIMD parse of the 13 GB file) deliberately
NOT probed here — separate probe if pursued (would exercise
`byte_scan.rs`).

## 2026-06-28 — WASM SIMD128 backend filled in (`src/simd_wasm.rs`)

Replaced the commented-out scaffolding in `src/simd_wasm.rs` with a real
Expand Down
6 changes: 6 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ required-features = ["std"]
name = "morton_cascade_probe"
required-features = ["std"]

# 1BRC-on-substrate probe: Morton group-by + AMX BF16 tile-GEMM mean path
# (imports `ndarray::simd` + `ndarray::hpc`, both std-gated).
[[example]]
name = "onebrc_cascade_probe"
required-features = ["std"]

[[example]]
name = "golden_helix_probe"
required-features = ["std"]
Expand Down
Loading
Loading