|
3 | 3 | > **Read this first.** The "Polyglot Notebook" architecture below is a |
4 | 4 | > separate/older program, not the current epoch. |
5 | 5 |
|
| 6 | +## 2026-07-02 (later) — bf16 tile GEMM: VDPBF16PS middle tier + PackedBf16B (loose end closed) |
| 7 | + |
| 8 | +Closed the [LOOSE END] from the 1BRC entry below. `hpc/bf16_tile_gemm.rs` |
| 9 | +is now a three-tier ladder — **AMX TDPBF16PS → AVX-512 VDPBF16PS → |
| 10 | +decode+FMA polyfill** — with the polyfill kernel (`simd_ops.rs`) untouched: |
| 11 | + |
| 12 | +- **VDPBF16PS tier** (`avx512bf16_path`, private): bf16 pairs multiplied |
| 13 | + natively per zmm (no bf16→f32 decode), f32 lane accumulators, SAME VNNI |
| 14 | + operand layout as the AMX tile → one packed buffer serves both tile |
| 15 | + tiers. `_mm512_dpbf16_ps` verified stable on Rust 1.94. Runtime |
| 16 | + `is_x86_feature_detected!("avx512bf16")` (EMR box has it). |
| 17 | +- **`PackedBf16B`** + **`bf16_tile_gemm_16x16_packed`**: VNNI pack (and |
| 18 | + its per-call allocation) hoisted out of hot loops; `vnni_index(row,col) |
| 19 | + = (row/2)·32 + 2·col + (row&1)` supports staging B DIRECTLY in VNNI |
| 20 | + layout (zero pack cost — the right shape for one-hot/sparse staging). |
| 21 | +- **`bf16_tile_gemm_tier()`**: names the tier that will run (Gotcha 9 |
| 22 | + reporting). Re-exports via `ndarray::simd::*` (W1a surface). |
| 23 | +- **Exactness boundary preserved (operator condition):** bit-exact across |
| 24 | + ALL tiers for bf16-exact integer operands with accumulation < 2^24 — |
| 25 | + asserted with `assert_eq!` in the new parity tests (vnni_index vs |
| 26 | + vnni_pack_bf16; packed==unpacked==i64 reference; VDPBF16PS exact + |
| 27 | + tolerance-parity vs polyfill on floats; accumulate semantics). Gotcha-14 |
| 28 | + contention parity test included as `#[ignore]` (fails on oversubscribed |
| 29 | + VMs BY DESIGN; run `--ignored` on dedicated silicon). |
| 30 | + |
| 31 | +[MEASURED] onebrc probe GEMM leg with direct-VNNI staging: **3.6 → 21.3 |
| 32 | +Mrows/s (5.9×), 23.7 → 141.9 GMAC/s** (single thread — near the 169.7 |
| 33 | +GMAC/s int8 AMX anchor in AMX_GOTCHAS). 413/413 stations still EXACT; |
| 34 | +8/8 lib tests + 2 doctests green; clippy/fmt clean. |
| 35 | + |
| 36 | +[NOTE] Dispatch-behavior change signed off by operator: the row-major |
| 37 | +entry `bf16_tile_gemm_16x16` now routes avx512bf16-without-AMX hosts |
| 38 | +through VDPBF16PS instead of decode+FMA (bit-exact within the integer |
| 39 | +boundary; BF16-precision-class accumulation-order differences on general |
| 40 | +floats, same as any tier change). |
| 41 | + |
| 42 | +[ADDED, same day] **LE byte contract on `PackedBf16B`** (operator "Go" — |
| 43 | +first brick of the SoA-Morton batch-writer / write-hiding design): |
| 44 | +`as_le_bytes()` (zero-cost reinterpret; LE by construction — the module |
| 45 | +is x86_64-only) + `from_le_bytes()` (endian-correct anywhere, plain copy |
| 46 | +on LE). This is the persistence/mailbox face per lance-graph's |
| 47 | +SoaEnvelope discipline (envelope bytes LE from creation to tombstone). |
| 48 | +Test `le_byte_view_roundtrips_and_is_truly_le` asserts byte 2i = low |
| 49 | +byte of lane i AND that a GEMM over the roundtripped buffer stays |
| 50 | +bit-exact. 9/9 lib tests green. Next bricks (lance-graph side): batch |
| 51 | +writer flushing tile buffers as envelope tenants; write-hiding = stage |
| 52 | +morsel N+1's VNNI writes while morsel N's tiles compute. |
| 53 | + |
| 54 | +## 2026-07-02 — 1BRC-on-substrate probe (`examples/onebrc_cascade_probe.rs`) |
| 55 | + |
| 56 | +1BRC workload (min/mean/max per station) restated on the substrate, as a |
| 57 | +sibling of `morton_cascade_probe`. Branch `claude/1brc-lance-graph-xfx5tu`. |
| 58 | +Three paths certified bit-for-bit against a scalar integer reference |
| 59 | +(413 stations, integer tenths → exact in f32/f64 by construction): |
| 60 | + |
| 61 | +- **Morton scatter**: stations minted as cells on a 64×64 Morton grid |
| 62 | + (4×4 tile = one F32x16), morsel-batched (64K rows) scatter into |
| 63 | + L1-resident SoA accumulators, (min,max,Σ,n) monoid fold. |
| 64 | +- **AMX BF16 tile-GEMM group-by**: (Σ,n) as `C += A[16×K]·B[K×16]` via |
| 65 | + the NEW `ndarray::simd::bf16_tile_gemm_16x16_amx` re-export (W1a: the |
| 66 | + AMX-dispatching hpc wrapper surfaced through the canonical polyfill, |
| 67 | + same pattern as `matmul_i8_to_i32`; the `_amx` suffix disambiguates |
| 68 | + from the pure-FMA `simd::bf16_tile_gemm_16x16`) — B = per-row one-hot |
| 69 | + station indicator (26 column-blocks of 16), A rows = {1, hi(t), lo(t), |
| 70 | + bf16-RNE(t)} with the exactness split `hi=(t/256)·256, lo=t−hi` (both |
| 71 | + bf16-exact; f32 tile accumulation exact for K=4096). Clear-by-undo |
| 72 | + keeps B staging O(rows). AMX **actually ran** (amx_available()==true |
| 73 | + printed per Gotcha 9 discipline; EMR-class Xeon, kernel 6.18.5). |
| 74 | +- **Aggregate pyramid** over the tile grid: hierarchical (min,mean,max) |
| 75 | + per tile/region/root in the same pass + band-prune queries |
| 76 | + (Belichtungsmesser on the MIN channel). |
| 77 | + |
| 78 | +[MEASURED] 10M rows, 4-core Xeon EMR VM, single thread: |
| 79 | +reference 453 Mrows/s | morton scatter 443 Mrows/s (**substrate tax ≈ 2%**) |
| 80 | +| tile-GEMM 3.6 Mrows/s = 23.7 GMAC/s (dense one-hot indicator = the |
| 81 | +honest price of group-by-as-matmul; per-call `vnni_pack_bf16` alloc in |
| 82 | +`bf16_tile_gemm_16x16` is a visible overhead) | pyramid fold 0.02 ms | |
| 83 | +band query prune 90.2%. All 413 stations EXACT on both paths; PASS. |
| 84 | +Also EXACT at 100M rows (idle). **"Is BF16 precise enough?" — measured:** |
| 85 | +the naive bf16-RNE row through the same tile gives max per-station |
| 86 | +|Δmean| = 0.0123 tenths (0.0012 °C, N≈24k/station — quantization bias |
| 87 | +averages out); single readings off by ≤ 2 tenths (half-ulp of bf16 at |
| 88 | +|t|∈[512,1024)). Verdict: bf16-direct fine for means, hi/lo split (free — |
| 89 | +spare A rows) required for min/max + exactness certification. |
| 90 | + |
| 91 | +[FINDING → **Gotcha 14**, `.claude/AMX_GOTCHAS.md`] On this oversubscribed |
| 92 | +VM, **AMX tile state silently corrupts under host CPU contention**: idle |
| 93 | += 413/413 exact at 100M rows; with 4 busy-loop competitors = 89-152/413 |
| 94 | +(whole rows lost, no fault); guest-side core pinning does NOT mitigate |
| 95 | +(124/413); AVX-512 scatter path in the same run stays exact → isolated |
| 96 | +to TMM state; suspected host-vCPU-switch XTILEDATA loss. Consequences |
| 97 | +written into the gotcha: never certify AMX numerics on shared VMs; parity |
| 98 | +tests must also run under deliberate load (Gotcha 9 extension); short |
| 99 | +tile residency = harm reduction only. |
| 100 | + |
| 101 | +[CROSS-REPO] Algebraic certification (partition/regroup invariance of the |
| 102 | +monoid fold, bf16 hi/lo decomposition exactness) lands as a diagnostic |
| 103 | +probe in `lance-graph/crates/jc` (`onebrc_agg`) — kernels here, proof |
| 104 | +there, per the architecture rule (ndarray = hardware, jc = proof). |
| 105 | + |
| 106 | +[LOOSE END] AMX has no min/max tile op → min/max stay on the scatter |
| 107 | +path by construction. `bf16_tile_gemm_16x16` allocates + VNNI-packs B on |
| 108 | +every call — a pre-packed-B variant would lift the GEMM leg |
| 109 | +substantially; file under W1-adjacent if the group-by-as-GEMM shape |
| 110 | +recurs. Text-ingest leg (SWAR/SIMD parse of the 13 GB file) deliberately |
| 111 | +NOT probed here — separate probe if pursued (would exercise |
| 112 | +`byte_scan.rs`). |
| 113 | + |
6 | 114 | ## 2026-06-28 — WASM SIMD128 backend filled in (`src/simd_wasm.rs`) |
7 | 115 |
|
8 | 116 | Replaced the commented-out scaffolding in `src/simd_wasm.rs` with a real |
|
0 commit comments