1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14 by AdaWorldAPI · Pull Request #227 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-07-02T17:38:16Z

What

Restates the One Billion Row Challenge workload (min/mean/max per station) on the Morton/gridlake substrate as a certified probe, and hardens the BF16 tile-GEMM path it exercises into a three-tier ladder with a pre-packed operand carrier and an explicit little-endian byte contract.

`examples/onebrc_cascade_probe.rs` — the probe

Three aggregation paths, all certified bit-exact against a scalar i16/i64 reference (413 stations, integer tenths):

Morton scatter — stations minted as cells on a 64×64 Z-order grid, morsel-batched (64K rows) into L1-resident SoA accumulators; (min, max, Σ, n) monoid fold. Measured substrate tax vs the raw reference: ~2% (448 vs 457 Mrows/s).
BF16 tile-GEMM group-by — (Σ, n) as C += A[16×K]·B[K×16] with per-row one-hot station indicators. The hi/lo split (hi = (t/256)·256, lo = t − hi) keeps every operand bf16-exact; A-row 3 carries naive bf16-RNE temps through the same tile as a measured answer to "is BF16 precise enough": per-station mean error 0.0123 tenths at 10M rows, single readings ≤ 2 tenths (half-ulp).
Aggregate pyramid — hierarchical (min, mean, max) per tile/region/root in the same pass; band-prune queries (90.2% prune on the demo band).

`src/hpc/bf16_tile_gemm.rs` — tier ladder + packed B

New middle tier: AVX-512 VDPBF16PS — native bf16-pair multiply per zmm, f32 lane accumulators, no bf16→f32 decode. Same VNNI operand layout as the AMX tile, so one packed buffer serves both tile tiers. _mm512_dpbf16_ps verified stable on Rust 1.94. Ladder: AMX TDPBF16PS → AVX-512 VDPBF16PS → decode + F32x16 FMA polyfill (polyfill kernel untouched).
PackedBf16B + bf16_tile_gemm_16x16_packed — hoists the per-call VNNI pack (and its allocation) out of hot loops; vnni_index(row, col) supports staging B directly in VNNI layout (zero pack cost for one-hot/sparse staging). Probe effect: GEMM leg 3.6 → 21.3 Mrows/s (5.9×), 23.7 → 141.9 GMAC/s single-thread.
LE byte contract — as_le_bytes() (zero-cost reinterpret; LE by construction, module is x86_64-only) / from_le_bytes() (endian-correct anywhere). The persistence/mailbox face for a downstream batch writer, per the lance-graph SoaEnvelope discipline; contract test asserts byte 2i = low byte of lane i and GEMM-over-roundtripped-bytes stays bit-exact.
bf16_tile_gemm_tier() — names the dispatch tier for run reports (Gotcha 9 discipline).
src/simd.rs re-exports the new surface through ndarray::simd::* (W1a); the _amx suffix keeps the pure-polyfill kernel and the tile-dispatching wrapper distinct.

`.claude/AMX_GOTCHAS.md` — new Gotcha 14 (discovered by the probe)

On an oversubscribed VM, AMX tile state silently corrupts under host CPU contention: idle = 413/413 exact at 100M rows; with 4 busy-loop competitors = 89–152/413 (whole rows lost, no fault); guest-side core pinning does not mitigate; the AVX-512 path in the same run stays exact, isolating the corruption to TMM state. Suspected host-vCPU-switch XTILEDATA loss. Consequences documented: never certify AMX numerics on shared VMs; parity tests must also run under deliberate load. The corresponding tile_parity_under_cpu_contention test ships #[ignore]d (fails on oversubscribed VMs by design; run --ignored on dedicated silicon).

Exactness boundary

All tiers are bit-exact for bf16-exact integer operands with accumulation < 2²⁴, asserted with assert_eq! (never tolerance) in the parity tests. Cross-repo: the algebra (partition/regroup invariance of the monoid fold, bf16 hi/lo decomposition over all 1999 tenth-values) is certified independently in lance-graph/crates/jc (onebrc_agg probe, same branch) — kernels here, proof there.

Testing

cargo test --release --lib bf16_tile_gemm: 9 passed, 1 ignored (Gotcha 14, by design) — includes VDPBF16PS exact-integer parity on real avx512bf16 silicon
2 doctests green; cargo clippy --release --lib -- -D warnings clean; cargo fmt --check clean
Probe: PASS at 10M and 100M rows (idle), [AMX TDPBF16PS] tier confirmed active

🤖 Generated with Claude Code

https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH

Generated by Claude Code

…+ AMX BF16 tile-GEMM leg Three certified paths for min/mean/max per station (413 stations, integer tenths, bit-exact against a scalar i16/i64 reference): - Morton scatter: stations minted as cells on a 64x64 Z-order grid, morsel-batched (64K rows) into L1-resident SoA accumulators; measured substrate tax vs raw hash-style reference ~2% (443 vs 453 Mrows/s). - BF16 tile-GEMM group-by: (sum, n) as C += A[16xK]*B[Kx16] with per-row one-hot station indicators and A rows {1, hi(t), lo(t), bf16(t)}; hi/lo split keeps every operand bf16-exact, f32 tile accumulation exact at K=4096. Routed through the new simd re-export (below). bf16-direct row measures the no-split cost: max |dmean| = 0.0123 tenths at 10M rows; single readings off by <= 4 tenths. - Aggregate pyramid over Morton tiles: hierarchical (min,mean,max) per tile/region/root in one pass, band-prune queries (90.2% prune). simd.rs: re-export hpc::bf16_tile_gemm::bf16_tile_gemm_16x16 as simd::bf16_tile_gemm_16x16_amx (W1a surface alignment, same pattern as matmul_i8_to_i32; _amx suffix disambiguates the pure-FMA polyfill kernel). AMX_GOTCHAS.md: new Gotcha 14, discovered by this probe - on an oversubscribed VM, AMX tile state silently corrupts under host CPU contention (idle: 413/413 exact at 100M rows; 4 busy loops: 89-152/413, rows lost without faulting; guest core pinning does not mitigate; AVX-512 path in the same run stays exact). Certification of AMX numerics requires bare metal or provably idle hosts, and parity tests must also run under deliberate load. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH

The jc onebrc_agg certification (lance-graph 406b3a0) measured the exact bound exhaustively: RNE errs by at most HALF an ulp, so the bf16-direct single-reading error is <= 2 tenths (0.2 C), not 4. Attained at the range extremes. Wording-only fix in the probe report and blackboard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH

Three-tier runtime ladder, polyfill kernel untouched: AMX TDPBF16PS -> AVX-512 VDPBF16PS -> decode + F32x16 FMA polyfill. - avx512bf16_path (private): native bf16-pair multiply per zmm via _mm512_dpbf16_ps (stable Rust 1.94, verified), f32 lane accumulators, no bf16->f32 decode. Same VNNI operand layout as the AMX tile, so one packed buffer serves both tile tiers. - PackedBf16B + bf16_tile_gemm_16x16_packed: hoist the per-call VNNI pack and its allocation out of hot loops; vnni_index(row, col) lets consumers stage B directly in VNNI layout (zero pack cost for sparse / one-hot staging). - bf16_tile_gemm_tier(): names the dispatch tier for run reports (Gotcha 9 discipline). - simd.rs: re-export the new surface through ndarray::simd::* (W1a), polyfill name untouched. Exactness boundary preserved: all tiers bit-exact for bf16-exact integer operands with accumulation < 2^24, asserted by assert_eq! parity tests (vnni_index vs vnni_pack_bf16, packed == unpacked == i64 reference, VDPBF16PS exact-integer + float tolerance vs polyfill, accumulate semantics). Gotcha-14 contention parity test ships #[ignore]d - it fails on oversubscribed VMs by design; run --ignored on dedicated silicon. onebrc_cascade_probe measured effect (direct-VNNI one-hot staging): GEMM leg 3.6 -> 21.3 Mrows/s (5.9x), 23.7 -> 141.9 GMAC/s single-thread, 413/413 stations still bit-exact. 8 lib tests + 2 doctests green; clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH

…_le_bytes) The persistence/mailbox face of the packed tile buffer, per the lance-graph SoaEnvelope discipline (envelope bytes are LE from creation to tombstone): - as_le_bytes(): zero-cost &[u16] -> &[u8] reinterpret; LE by construction since this module is cfg(x86_64) and x86_64 is LE-only. - from_le_bytes(): endian-correct rebuild via u16::from_le_bytes (compiles to a plain copy on LE targets). - Contract test asserts byte 2i == low byte of lane i (true LE, not just native) and that a GEMM over the roundtripped buffer stays bit-exact. First brick of the SoA-Morton batch-writer / write-hiding design; the writer itself lands lance-graph-side. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH

coderabbitai · 2026-07-02T17:38:26Z

Warning

Review limit reached

@AdaWorldAPI, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 42 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4baad4c3-36d9-4af6-a0dd-89ec14c1833f

📥 Commits

Reviewing files that changed from the base of the PR and between 8c381a6 and 5f2c7fc.

📒 Files selected for processing (6)

.claude/AMX_GOTCHAS.md
.claude/blackboard.md
Cargo.toml
examples/onebrc_cascade_probe.rs
src/hpc/bf16_tile_gemm.rs
src/simd.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

…iLaneColumn Follow-up unblocking the gridlake wiring (lance-graph #635 COMMENTARY): lane J's GridBatch carries i32 min/max and i64 sum columns, but MultiLaneColumn only exposed f32/f64/u64/u8 lane views — #227's onebrc gridlake probe got away with f32 min/max columns. Add the signed integer lane widths so a batch SoA can be viewed through the gridlake carrier directly, no f32 recast. - `i32x16_from_chunk` / `i64x8_from_chunk` — LE decoders mirroring the existing `f32x16_from_chunk` / `u64x8_from_chunk` (scalar `from_le_bytes` loop, lowered to a single register-width load on LE targets; no pointer cast of the u8-aligned Arc<[u8]>). - `iter_i32x16` / `iter_i64x8` methods + `len_i32x16` / `len_i64x8`, routed through `crate::simd::{I32x16, I64x8}` per the W1a layering rule (never dipping into simd_avx512/simd_neon/scalar directly). - Parity tests: `iter_i32x16_le_round_trip` (incl. negatives, proves sign-extension survives the decode) + `iter_i64x8_le_round_trip`; extended the empty-count, 3-lane-count, and len asserts. These are layout-only zero-copy reinterpretations of the backing store (the same category as the existing typed iterators), not new compute kernels — no per-arch AVX/NEON/scalar backend needed beyond the lane types crate::simd already provides. simd_soa: 13/13 tests pass; clippy -D warnings clean; fmt clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM

claude added 4 commits July 2, 2026 17:00

AdaWorldAPI merged commit de72d15 into master Jul 2, 2026
17 checks passed

AdaWorldAPI mentioned this pull request Jul 2, 2026

feat(simd_soa): iter_i32x16 / iter_i64x8 typed lane iterators on MultiLaneColumn #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14#227

1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14#227
AdaWorldAPI merged 4 commits into
masterfrom
claude/1brc-lance-graph-xfx5tu

AdaWorldAPI commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Jul 2, 2026

What

examples/onebrc_cascade_probe.rs — the probe

src/hpc/bf16_tile_gemm.rs — tier ladder + packed B

.claude/AMX_GOTCHAS.md — new Gotcha 14 (discovered by the probe)

Exactness boundary

Testing

Uh oh!

coderabbitai Bot commented Jul 2, 2026

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`examples/onebrc_cascade_probe.rs` — the probe

`src/hpc/bf16_tile_gemm.rs` — tier ladder + packed B

`.claude/AMX_GOTCHAS.md` — new Gotcha 14 (discovered by the probe)