AdaWorldAPI · AdaWorldAPI · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/.claude/AMX_GOTCHAS.md b/.claude/AMX_GOTCHAS.md
@@ -43,6 +43,7 @@ int8 2048³ = 169.7 GMAC/s, 600× scalar, single-thread
 | runs fine, `correct=false` | operand index/sign convention mirrored | Gotcha 12 |
 | compile error `unstable x86_amx_intrinsics` | used nightly intrinsics | Gotcha 1, 8 |
 | compile error `rbx is used internally by LLVM` | inline-asm CPUID | Gotcha 3 |
+| exact when idle, silently wrong under CPU load (VM) | tile state lost across host vCPU switch | Gotcha 14 |
 
 ---
 
@@ -222,6 +223,50 @@ once is correct even under rayon. `cpu_model()` is cached the same way.
 
 ---
 
+## Gotcha 14: on oversubscribed VMs, tile state is silently corrupted under host CPU contention  ⚑
+
+Observed 2026-07-02 on this remote VM (4 vCPU, EMR-class Xeon, guest kernel
+6.18.5) by `examples/onebrc_cascade_probe.rs`, reproduced on demand:
+
+```
+idle:                       413/413 stations bit-exact (10M and 100M rows)
+4 busy-loop competitors:    89/413, 152/413 exact — whole rows LOST, no fault
+probe pinned to core 0,
+load pinned to cores 1-3:   124/413 exact — pinning does NOT mitigate
+idle control right after:   413/413 exact again
+```
+
+Signature: **no crash, no SIGSEGV/SIGILL — results are silently wrong, and
+only under load.** An AVX-512 path in the same process, same run, stays
+bit-exact, isolating the corruption to TMM tile state (the tmm0 accumulator
+loses in-flight partial sums). Because guest-side pinning doesn't help, the
+suspected mechanism is the **host** hypervisor's vCPU context switch failing
+to save/restore guest `XTILEDATA` when the host multiplexes oversubscribed
+pCPUs (idle guests keep their vCPUs resident → no corruption; loaded guests
+get switched → corruption). Guest-side `arch_prctl` permission (Gotcha 4) is
+correctly granted — this is a layer below the guest kernel.
+
+Consequences:
+
+- **Never certify AMX numerics from a shared/oversubscribed VM.** Bare metal
+  or a dedicated-CPU instance only. A "PASS on my cloud box" is worthless
+  under this gotcha unless the box was provably idle.
+- **Extend Gotcha 9's discipline**: a parity test for a tile kernel must ALSO
+  run under deliberate CPU contention (a few busy loops are enough — see the
+  reproduction above). Exact-when-idle is necessary, not sufficient.
+- **Keep tile residency short.** Long accumulation loops that live in tmm
+  across many iterations (the 16×16×K pattern holds tmm0 for K/32 iterations)
+  maximize the exposure window. Draining accumulators to memory more often
+  shrinks it but does NOT close it — treat it as harm reduction, not a fix.
+- Production dispatch on virtualized hosts should either avoid AMX or pair it
+  with a checksum/parity channel (e.g. a redundant ones-row whose expected
+  value is known — the onebrc probe's count row doubles as exactly that).
+
+Fault signature: `correct=true` in every quiet test, sporadic wrong results
+in production under load, AVX-512 siblings unaffected.
+
+---
+
 ## Hardware tiers
 
 ```

diff --git a/.claude/blackboard.md b/.claude/blackboard.md
@@ -3,6 +3,114 @@
 > **Read this first.** The "Polyglot Notebook" architecture below is a
 > separate/older program, not the current epoch.
 
+## 2026-07-02 (later) — bf16 tile GEMM: VDPBF16PS middle tier + PackedBf16B (loose end closed)
+
+Closed the [LOOSE END] from the 1BRC entry below. `hpc/bf16_tile_gemm.rs`
+is now a three-tier ladder — **AMX TDPBF16PS → AVX-512 VDPBF16PS →
+decode+FMA polyfill** — with the polyfill kernel (`simd_ops.rs`) untouched:
+
+- **VDPBF16PS tier** (`avx512bf16_path`, private): bf16 pairs multiplied
+  natively per zmm (no bf16→f32 decode), f32 lane accumulators, SAME VNNI
+  operand layout as the AMX tile → one packed buffer serves both tile
+  tiers. `_mm512_dpbf16_ps` verified stable on Rust 1.94. Runtime
+  `is_x86_feature_detected!("avx512bf16")` (EMR box has it).
+- **`PackedBf16B`** + **`bf16_tile_gemm_16x16_packed`**: VNNI pack (and
+  its per-call allocation) hoisted out of hot loops; `vnni_index(row,col)
+  = (row/2)·32 + 2·col + (row&1)` supports staging B DIRECTLY in VNNI
+  layout (zero pack cost — the right shape for one-hot/sparse staging).
+- **`bf16_tile_gemm_tier()`**: names the tier that will run (Gotcha 9
+  reporting). Re-exports via `ndarray::simd::*` (W1a surface).
+- **Exactness boundary preserved (operator condition):** bit-exact across
+  ALL tiers for bf16-exact integer operands with accumulation < 2^24 —
+  asserted with `assert_eq!` in the new parity tests (vnni_index vs
+  vnni_pack_bf16; packed==unpacked==i64 reference; VDPBF16PS exact +
+  tolerance-parity vs polyfill on floats; accumulate semantics). Gotcha-14
+  contention parity test included as `#[ignore]` (fails on oversubscribed
+  VMs BY DESIGN; run `--ignored` on dedicated silicon).
+
+[MEASURED] onebrc probe GEMM leg with direct-VNNI staging: **3.6 → 21.3
+Mrows/s (5.9×), 23.7 → 141.9 GMAC/s** (single thread — near the 169.7
+GMAC/s int8 AMX anchor in AMX_GOTCHAS). 413/413 stations still EXACT;
+8/8 lib tests + 2 doctests green; clippy/fmt clean.
+
+[NOTE] Dispatch-behavior change signed off by operator: the row-major
+entry `bf16_tile_gemm_16x16` now routes avx512bf16-without-AMX hosts
+through VDPBF16PS instead of decode+FMA (bit-exact within the integer
+boundary; BF16-precision-class accumulation-order differences on general
+floats, same as any tier change).
+
+[ADDED, same day] **LE byte contract on `PackedBf16B`** (operator "Go" —
+first brick of the SoA-Morton batch-writer / write-hiding design):
+`as_le_bytes()` (zero-cost reinterpret; LE by construction — the module
+is x86_64-only) + `from_le_bytes()` (endian-correct anywhere, plain copy
+on LE). This is the persistence/mailbox face per lance-graph's
+SoaEnvelope discipline (envelope bytes LE from creation to tombstone).
+Test `le_byte_view_roundtrips_and_is_truly_le` asserts byte 2i = low
+byte of lane i AND that a GEMM over the roundtripped buffer stays
+bit-exact. 9/9 lib tests green. Next bricks (lance-graph side): batch
+writer flushing tile buffers as envelope tenants; write-hiding = stage
+morsel N+1's VNNI writes while morsel N's tiles compute.
+
+## 2026-07-02 — 1BRC-on-substrate probe (`examples/onebrc_cascade_probe.rs`)
+
+1BRC workload (min/mean/max per station) restated on the substrate, as a
+sibling of `morton_cascade_probe`. Branch `claude/1brc-lance-graph-xfx5tu`.
+Three paths certified bit-for-bit against a scalar integer reference
+(413 stations, integer tenths → exact in f32/f64 by construction):
+
+- **Morton scatter**: stations minted as cells on a 64×64 Morton grid
+  (4×4 tile = one F32x16), morsel-batched (64K rows) scatter into
+  L1-resident SoA accumulators, (min,max,Σ,n) monoid fold.
+- **AMX BF16 tile-GEMM group-by**: (Σ,n) as `C += A[16×K]·B[K×16]` via
+  the NEW `ndarray::simd::bf16_tile_gemm_16x16_amx` re-export (W1a: the
+  AMX-dispatching hpc wrapper surfaced through the canonical polyfill,
+  same pattern as `matmul_i8_to_i32`; the `_amx` suffix disambiguates
+  from the pure-FMA `simd::bf16_tile_gemm_16x16`) — B = per-row one-hot
+  station indicator (26 column-blocks of 16), A rows = {1, hi(t), lo(t),
+  bf16-RNE(t)} with the exactness split `hi=(t/256)·256, lo=t−hi` (both
+  bf16-exact; f32 tile accumulation exact for K=4096). Clear-by-undo
+  keeps B staging O(rows). AMX **actually ran** (amx_available()==true
+  printed per Gotcha 9 discipline; EMR-class Xeon, kernel 6.18.5).
+- **Aggregate pyramid** over the tile grid: hierarchical (min,mean,max)
+  per tile/region/root in the same pass + band-prune queries
+  (Belichtungsmesser on the MIN channel).
+
+[MEASURED] 10M rows, 4-core Xeon EMR VM, single thread:
+reference 453 Mrows/s | morton scatter 443 Mrows/s (**substrate tax ≈ 2%**)
+| tile-GEMM 3.6 Mrows/s = 23.7 GMAC/s (dense one-hot indicator = the
+honest price of group-by-as-matmul; per-call `vnni_pack_bf16` alloc in
+`bf16_tile_gemm_16x16` is a visible overhead) | pyramid fold 0.02 ms |
+band query prune 90.2%. All 413 stations EXACT on both paths; PASS.
+Also EXACT at 100M rows (idle). **"Is BF16 precise enough?" — measured:**
+the naive bf16-RNE row through the same tile gives max per-station
+|Δmean| = 0.0123 tenths (0.0012 °C, N≈24k/station — quantization bias
+averages out); single readings off by ≤ 2 tenths (half-ulp of bf16 at
+|t|∈[512,1024)). Verdict: bf16-direct fine for means, hi/lo split (free —
+spare A rows) required for min/max + exactness certification.
+
+[FINDING → **Gotcha 14**, `.claude/AMX_GOTCHAS.md`] On this oversubscribed
+VM, **AMX tile state silently corrupts under host CPU contention**: idle
+= 413/413 exact at 100M rows; with 4 busy-loop competitors = 89-152/413
+(whole rows lost, no fault); guest-side core pinning does NOT mitigate
+(124/413); AVX-512 scatter path in the same run stays exact → isolated
+to TMM state; suspected host-vCPU-switch XTILEDATA loss. Consequences
+written into the gotcha: never certify AMX numerics on shared VMs; parity
+tests must also run under deliberate load (Gotcha 9 extension); short
+tile residency = harm reduction only.
+
+[CROSS-REPO] Algebraic certification (partition/regroup invariance of the
+monoid fold, bf16 hi/lo decomposition exactness) lands as a diagnostic
+probe in `lance-graph/crates/jc` (`onebrc_agg`) — kernels here, proof
+there, per the architecture rule (ndarray = hardware, jc = proof).
+
+[LOOSE END] AMX has no min/max tile op → min/max stay on the scatter
+path by construction. `bf16_tile_gemm_16x16` allocates + VNNI-packs B on
+every call — a pre-packed-B variant would lift the GEMM leg
+substantially; file under W1-adjacent if the group-by-as-GEMM shape
+recurs. Text-ingest leg (SWAR/SIMD parse of the 13 GB file) deliberately
+NOT probed here — separate probe if pursued (would exercise
+`byte_scan.rs`).
+
 ## 2026-06-28 — WASM SIMD128 backend filled in (`src/simd_wasm.rs`)
 
 Replaced the commented-out scaffolding in `src/simd_wasm.rs` with a real

diff --git a/Cargo.toml b/Cargo.toml
@@ -61,6 +61,12 @@ required-features = ["std"]
 name = "morton_cascade_probe"
 required-features = ["std"]
 
+# 1BRC-on-substrate probe: Morton group-by + AMX BF16 tile-GEMM mean path
+# (imports `ndarray::simd` + `ndarray::hpc`, both std-gated).
+[[example]]
+name = "onebrc_cascade_probe"
+required-features = ["std"]
+
 [[example]]
 name = "golden_helix_probe"
 required-features = ["std"]