Skip to content

Commit de72d15

Browse files
authored
Merge pull request #227 from AdaWorldAPI/claude/1brc-lance-graph-xfx5tu
1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14
2 parents 8c381a6 + 5f2c7fc commit de72d15

6 files changed

Lines changed: 1189 additions & 25 deletions

File tree

.claude/AMX_GOTCHAS.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ int8 2048³ = 169.7 GMAC/s, 600× scalar, single-thread
4343
| runs fine, `correct=false` | operand index/sign convention mirrored | Gotcha 12 |
4444
| compile error `unstable x86_amx_intrinsics` | used nightly intrinsics | Gotcha 1, 8 |
4545
| compile error `rbx is used internally by LLVM` | inline-asm CPUID | Gotcha 3 |
46+
| exact when idle, silently wrong under CPU load (VM) | tile state lost across host vCPU switch | Gotcha 14 |
4647

4748
---
4849

@@ -222,6 +223,50 @@ once is correct even under rayon. `cpu_model()` is cached the same way.
222223

223224
---
224225

226+
## Gotcha 14: on oversubscribed VMs, tile state is silently corrupted under host CPU contention ⚑
227+
228+
Observed 2026-07-02 on this remote VM (4 vCPU, EMR-class Xeon, guest kernel
229+
6.18.5) by `examples/onebrc_cascade_probe.rs`, reproduced on demand:
230+
231+
```
232+
idle: 413/413 stations bit-exact (10M and 100M rows)
233+
4 busy-loop competitors: 89/413, 152/413 exact — whole rows LOST, no fault
234+
probe pinned to core 0,
235+
load pinned to cores 1-3: 124/413 exact — pinning does NOT mitigate
236+
idle control right after: 413/413 exact again
237+
```
238+
239+
Signature: **no crash, no SIGSEGV/SIGILL — results are silently wrong, and
240+
only under load.** An AVX-512 path in the same process, same run, stays
241+
bit-exact, isolating the corruption to TMM tile state (the tmm0 accumulator
242+
loses in-flight partial sums). Because guest-side pinning doesn't help, the
243+
suspected mechanism is the **host** hypervisor's vCPU context switch failing
244+
to save/restore guest `XTILEDATA` when the host multiplexes oversubscribed
245+
pCPUs (idle guests keep their vCPUs resident → no corruption; loaded guests
246+
get switched → corruption). Guest-side `arch_prctl` permission (Gotcha 4) is
247+
correctly granted — this is a layer below the guest kernel.
248+
249+
Consequences:
250+
251+
- **Never certify AMX numerics from a shared/oversubscribed VM.** Bare metal
252+
or a dedicated-CPU instance only. A "PASS on my cloud box" is worthless
253+
under this gotcha unless the box was provably idle.
254+
- **Extend Gotcha 9's discipline**: a parity test for a tile kernel must ALSO
255+
run under deliberate CPU contention (a few busy loops are enough — see the
256+
reproduction above). Exact-when-idle is necessary, not sufficient.
257+
- **Keep tile residency short.** Long accumulation loops that live in tmm
258+
across many iterations (the 16×16×K pattern holds tmm0 for K/32 iterations)
259+
maximize the exposure window. Draining accumulators to memory more often
260+
shrinks it but does NOT close it — treat it as harm reduction, not a fix.
261+
- Production dispatch on virtualized hosts should either avoid AMX or pair it
262+
with a checksum/parity channel (e.g. a redundant ones-row whose expected
263+
value is known — the onebrc probe's count row doubles as exactly that).
264+
265+
Fault signature: `correct=true` in every quiet test, sporadic wrong results
266+
in production under load, AVX-512 siblings unaffected.
267+
268+
---
269+
225270
## Hardware tiers
226271

227272
```

.claude/blackboard.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,114 @@
33
> **Read this first.** The "Polyglot Notebook" architecture below is a
44
> separate/older program, not the current epoch.
55
6+
## 2026-07-02 (later) — bf16 tile GEMM: VDPBF16PS middle tier + PackedBf16B (loose end closed)
7+
8+
Closed the [LOOSE END] from the 1BRC entry below. `hpc/bf16_tile_gemm.rs`
9+
is now a three-tier ladder — **AMX TDPBF16PS → AVX-512 VDPBF16PS →
10+
decode+FMA polyfill** — with the polyfill kernel (`simd_ops.rs`) untouched:
11+
12+
- **VDPBF16PS tier** (`avx512bf16_path`, private): bf16 pairs multiplied
13+
natively per zmm (no bf16→f32 decode), f32 lane accumulators, SAME VNNI
14+
operand layout as the AMX tile → one packed buffer serves both tile
15+
tiers. `_mm512_dpbf16_ps` verified stable on Rust 1.94. Runtime
16+
`is_x86_feature_detected!("avx512bf16")` (EMR box has it).
17+
- **`PackedBf16B`** + **`bf16_tile_gemm_16x16_packed`**: VNNI pack (and
18+
its per-call allocation) hoisted out of hot loops; `vnni_index(row,col)
19+
= (row/2)·32 + 2·col + (row&1)` supports staging B DIRECTLY in VNNI
20+
layout (zero pack cost — the right shape for one-hot/sparse staging).
21+
- **`bf16_tile_gemm_tier()`**: names the tier that will run (Gotcha 9
22+
reporting). Re-exports via `ndarray::simd::*` (W1a surface).
23+
- **Exactness boundary preserved (operator condition):** bit-exact across
24+
ALL tiers for bf16-exact integer operands with accumulation < 2^24 —
25+
asserted with `assert_eq!` in the new parity tests (vnni_index vs
26+
vnni_pack_bf16; packed==unpacked==i64 reference; VDPBF16PS exact +
27+
tolerance-parity vs polyfill on floats; accumulate semantics). Gotcha-14
28+
contention parity test included as `#[ignore]` (fails on oversubscribed
29+
VMs BY DESIGN; run `--ignored` on dedicated silicon).
30+
31+
[MEASURED] onebrc probe GEMM leg with direct-VNNI staging: **3.6 → 21.3
32+
Mrows/s (5.9×), 23.7 → 141.9 GMAC/s** (single thread — near the 169.7
33+
GMAC/s int8 AMX anchor in AMX_GOTCHAS). 413/413 stations still EXACT;
34+
8/8 lib tests + 2 doctests green; clippy/fmt clean.
35+
36+
[NOTE] Dispatch-behavior change signed off by operator: the row-major
37+
entry `bf16_tile_gemm_16x16` now routes avx512bf16-without-AMX hosts
38+
through VDPBF16PS instead of decode+FMA (bit-exact within the integer
39+
boundary; BF16-precision-class accumulation-order differences on general
40+
floats, same as any tier change).
41+
42+
[ADDED, same day] **LE byte contract on `PackedBf16B`** (operator "Go" —
43+
first brick of the SoA-Morton batch-writer / write-hiding design):
44+
`as_le_bytes()` (zero-cost reinterpret; LE by construction — the module
45+
is x86_64-only) + `from_le_bytes()` (endian-correct anywhere, plain copy
46+
on LE). This is the persistence/mailbox face per lance-graph's
47+
SoaEnvelope discipline (envelope bytes LE from creation to tombstone).
48+
Test `le_byte_view_roundtrips_and_is_truly_le` asserts byte 2i = low
49+
byte of lane i AND that a GEMM over the roundtripped buffer stays
50+
bit-exact. 9/9 lib tests green. Next bricks (lance-graph side): batch
51+
writer flushing tile buffers as envelope tenants; write-hiding = stage
52+
morsel N+1's VNNI writes while morsel N's tiles compute.
53+
54+
## 2026-07-02 — 1BRC-on-substrate probe (`examples/onebrc_cascade_probe.rs`)
55+
56+
1BRC workload (min/mean/max per station) restated on the substrate, as a
57+
sibling of `morton_cascade_probe`. Branch `claude/1brc-lance-graph-xfx5tu`.
58+
Three paths certified bit-for-bit against a scalar integer reference
59+
(413 stations, integer tenths → exact in f32/f64 by construction):
60+
61+
- **Morton scatter**: stations minted as cells on a 64×64 Morton grid
62+
(4×4 tile = one F32x16), morsel-batched (64K rows) scatter into
63+
L1-resident SoA accumulators, (min,max,Σ,n) monoid fold.
64+
- **AMX BF16 tile-GEMM group-by**: (Σ,n) as `C += A[16×K]·B[K×16]` via
65+
the NEW `ndarray::simd::bf16_tile_gemm_16x16_amx` re-export (W1a: the
66+
AMX-dispatching hpc wrapper surfaced through the canonical polyfill,
67+
same pattern as `matmul_i8_to_i32`; the `_amx` suffix disambiguates
68+
from the pure-FMA `simd::bf16_tile_gemm_16x16`) — B = per-row one-hot
69+
station indicator (26 column-blocks of 16), A rows = {1, hi(t), lo(t),
70+
bf16-RNE(t)} with the exactness split `hi=(t/256)·256, lo=t−hi` (both
71+
bf16-exact; f32 tile accumulation exact for K=4096). Clear-by-undo
72+
keeps B staging O(rows). AMX **actually ran** (amx_available()==true
73+
printed per Gotcha 9 discipline; EMR-class Xeon, kernel 6.18.5).
74+
- **Aggregate pyramid** over the tile grid: hierarchical (min,mean,max)
75+
per tile/region/root in the same pass + band-prune queries
76+
(Belichtungsmesser on the MIN channel).
77+
78+
[MEASURED] 10M rows, 4-core Xeon EMR VM, single thread:
79+
reference 453 Mrows/s | morton scatter 443 Mrows/s (**substrate tax ≈ 2%**)
80+
| tile-GEMM 3.6 Mrows/s = 23.7 GMAC/s (dense one-hot indicator = the
81+
honest price of group-by-as-matmul; per-call `vnni_pack_bf16` alloc in
82+
`bf16_tile_gemm_16x16` is a visible overhead) | pyramid fold 0.02 ms |
83+
band query prune 90.2%. All 413 stations EXACT on both paths; PASS.
84+
Also EXACT at 100M rows (idle). **"Is BF16 precise enough?" — measured:**
85+
the naive bf16-RNE row through the same tile gives max per-station
86+
|Δmean| = 0.0123 tenths (0.0012 °C, N≈24k/station — quantization bias
87+
averages out); single readings off by ≤ 2 tenths (half-ulp of bf16 at
88+
|t|∈[512,1024)). Verdict: bf16-direct fine for means, hi/lo split (free —
89+
spare A rows) required for min/max + exactness certification.
90+
91+
[FINDING → **Gotcha 14**, `.claude/AMX_GOTCHAS.md`] On this oversubscribed
92+
VM, **AMX tile state silently corrupts under host CPU contention**: idle
93+
= 413/413 exact at 100M rows; with 4 busy-loop competitors = 89-152/413
94+
(whole rows lost, no fault); guest-side core pinning does NOT mitigate
95+
(124/413); AVX-512 scatter path in the same run stays exact → isolated
96+
to TMM state; suspected host-vCPU-switch XTILEDATA loss. Consequences
97+
written into the gotcha: never certify AMX numerics on shared VMs; parity
98+
tests must also run under deliberate load (Gotcha 9 extension); short
99+
tile residency = harm reduction only.
100+
101+
[CROSS-REPO] Algebraic certification (partition/regroup invariance of the
102+
monoid fold, bf16 hi/lo decomposition exactness) lands as a diagnostic
103+
probe in `lance-graph/crates/jc` (`onebrc_agg`) — kernels here, proof
104+
there, per the architecture rule (ndarray = hardware, jc = proof).
105+
106+
[LOOSE END] AMX has no min/max tile op → min/max stay on the scatter
107+
path by construction. `bf16_tile_gemm_16x16` allocates + VNNI-packs B on
108+
every call — a pre-packed-B variant would lift the GEMM leg
109+
substantially; file under W1-adjacent if the group-by-as-GEMM shape
110+
recurs. Text-ingest leg (SWAR/SIMD parse of the 13 GB file) deliberately
111+
NOT probed here — separate probe if pursued (would exercise
112+
`byte_scan.rs`).
113+
6114
## 2026-06-28 — WASM SIMD128 backend filled in (`src/simd_wasm.rs`)
7115

8116
Replaced the commented-out scaffolding in `src/simd_wasm.rs` with a real

Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,12 @@ required-features = ["std"]
6161
name = "morton_cascade_probe"
6262
required-features = ["std"]
6363

64+
# 1BRC-on-substrate probe: Morton group-by + AMX BF16 tile-GEMM mean path
65+
# (imports `ndarray::simd` + `ndarray::hpc`, both std-gated).
66+
[[example]]
67+
name = "onebrc_cascade_probe"
68+
required-features = ["std"]
69+
6470
[[example]]
6571
name = "golden_helix_probe"
6672
required-features = ["std"]

0 commit comments

Comments
 (0)