Skip to content

1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14#227

Merged
AdaWorldAPI merged 4 commits into
masterfrom
claude/1brc-lance-graph-xfx5tu
Jul 2, 2026
Merged

1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14#227
AdaWorldAPI merged 4 commits into
masterfrom
claude/1brc-lance-graph-xfx5tu

Conversation

@AdaWorldAPI

Copy link
Copy Markdown
Owner

What

Restates the One Billion Row Challenge workload (min/mean/max per station) on the Morton/gridlake substrate as a certified probe, and hardens the BF16 tile-GEMM path it exercises into a three-tier ladder with a pre-packed operand carrier and an explicit little-endian byte contract.

examples/onebrc_cascade_probe.rs — the probe

Three aggregation paths, all certified bit-exact against a scalar i16/i64 reference (413 stations, integer tenths):

  • Morton scatter — stations minted as cells on a 64×64 Z-order grid, morsel-batched (64K rows) into L1-resident SoA accumulators; (min, max, Σ, n) monoid fold. Measured substrate tax vs the raw reference: ~2% (448 vs 457 Mrows/s).
  • BF16 tile-GEMM group-by(Σ, n) as C += A[16×K]·B[K×16] with per-row one-hot station indicators. The hi/lo split (hi = (t/256)·256, lo = t − hi) keeps every operand bf16-exact; A-row 3 carries naive bf16-RNE temps through the same tile as a measured answer to "is BF16 precise enough": per-station mean error 0.0123 tenths at 10M rows, single readings ≤ 2 tenths (half-ulp).
  • Aggregate pyramid — hierarchical (min, mean, max) per tile/region/root in the same pass; band-prune queries (90.2% prune on the demo band).

src/hpc/bf16_tile_gemm.rs — tier ladder + packed B

  • New middle tier: AVX-512 VDPBF16PS — native bf16-pair multiply per zmm, f32 lane accumulators, no bf16→f32 decode. Same VNNI operand layout as the AMX tile, so one packed buffer serves both tile tiers. _mm512_dpbf16_ps verified stable on Rust 1.94. Ladder: AMX TDPBF16PS → AVX-512 VDPBF16PS → decode + F32x16 FMA polyfill (polyfill kernel untouched).
  • PackedBf16B + bf16_tile_gemm_16x16_packed — hoists the per-call VNNI pack (and its allocation) out of hot loops; vnni_index(row, col) supports staging B directly in VNNI layout (zero pack cost for one-hot/sparse staging). Probe effect: GEMM leg 3.6 → 21.3 Mrows/s (5.9×), 23.7 → 141.9 GMAC/s single-thread.
  • LE byte contractas_le_bytes() (zero-cost reinterpret; LE by construction, module is x86_64-only) / from_le_bytes() (endian-correct anywhere). The persistence/mailbox face for a downstream batch writer, per the lance-graph SoaEnvelope discipline; contract test asserts byte 2i = low byte of lane i and GEMM-over-roundtripped-bytes stays bit-exact.
  • bf16_tile_gemm_tier() — names the dispatch tier for run reports (Gotcha 9 discipline).
  • src/simd.rs re-exports the new surface through ndarray::simd::* (W1a); the _amx suffix keeps the pure-polyfill kernel and the tile-dispatching wrapper distinct.

.claude/AMX_GOTCHAS.md — new Gotcha 14 (discovered by the probe)

On an oversubscribed VM, AMX tile state silently corrupts under host CPU contention: idle = 413/413 exact at 100M rows; with 4 busy-loop competitors = 89–152/413 (whole rows lost, no fault); guest-side core pinning does not mitigate; the AVX-512 path in the same run stays exact, isolating the corruption to TMM state. Suspected host-vCPU-switch XTILEDATA loss. Consequences documented: never certify AMX numerics on shared VMs; parity tests must also run under deliberate load. The corresponding tile_parity_under_cpu_contention test ships #[ignore]d (fails on oversubscribed VMs by design; run --ignored on dedicated silicon).

Exactness boundary

All tiers are bit-exact for bf16-exact integer operands with accumulation < 2²⁴, asserted with assert_eq! (never tolerance) in the parity tests. Cross-repo: the algebra (partition/regroup invariance of the monoid fold, bf16 hi/lo decomposition over all 1999 tenth-values) is certified independently in lance-graph/crates/jc (onebrc_agg probe, same branch) — kernels here, proof there.

Testing

  • cargo test --release --lib bf16_tile_gemm: 9 passed, 1 ignored (Gotcha 14, by design) — includes VDPBF16PS exact-integer parity on real avx512bf16 silicon
  • 2 doctests green; cargo clippy --release --lib -- -D warnings clean; cargo fmt --check clean
  • Probe: PASS at 10M and 100M rows (idle), [AMX TDPBF16PS] tier confirmed active

🤖 Generated with Claude Code

https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH


Generated by Claude Code

claude added 4 commits July 2, 2026 17:00
…+ AMX BF16 tile-GEMM leg

Three certified paths for min/mean/max per station (413 stations, integer
tenths, bit-exact against a scalar i16/i64 reference):

- Morton scatter: stations minted as cells on a 64x64 Z-order grid,
  morsel-batched (64K rows) into L1-resident SoA accumulators; measured
  substrate tax vs raw hash-style reference ~2% (443 vs 453 Mrows/s).
- BF16 tile-GEMM group-by: (sum, n) as C += A[16xK]*B[Kx16] with per-row
  one-hot station indicators and A rows {1, hi(t), lo(t), bf16(t)};
  hi/lo split keeps every operand bf16-exact, f32 tile accumulation
  exact at K=4096. Routed through the new simd re-export (below).
  bf16-direct row measures the no-split cost: max |dmean| = 0.0123
  tenths at 10M rows; single readings off by <= 4 tenths.
- Aggregate pyramid over Morton tiles: hierarchical (min,mean,max) per
  tile/region/root in one pass, band-prune queries (90.2% prune).

simd.rs: re-export hpc::bf16_tile_gemm::bf16_tile_gemm_16x16 as
simd::bf16_tile_gemm_16x16_amx (W1a surface alignment, same pattern as
matmul_i8_to_i32; _amx suffix disambiguates the pure-FMA polyfill kernel).

AMX_GOTCHAS.md: new Gotcha 14, discovered by this probe - on an
oversubscribed VM, AMX tile state silently corrupts under host CPU
contention (idle: 413/413 exact at 100M rows; 4 busy loops: 89-152/413,
rows lost without faulting; guest core pinning does not mitigate;
AVX-512 path in the same run stays exact). Certification of AMX numerics
requires bare metal or provably idle hosts, and parity tests must also
run under deliberate load.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
The jc onebrc_agg certification (lance-graph 406b3a0) measured the exact
bound exhaustively: RNE errs by at most HALF an ulp, so the bf16-direct
single-reading error is <= 2 tenths (0.2 C), not 4. Attained at the
range extremes. Wording-only fix in the probe report and blackboard.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
Three-tier runtime ladder, polyfill kernel untouched:
AMX TDPBF16PS -> AVX-512 VDPBF16PS -> decode + F32x16 FMA polyfill.

- avx512bf16_path (private): native bf16-pair multiply per zmm via
  _mm512_dpbf16_ps (stable Rust 1.94, verified), f32 lane accumulators,
  no bf16->f32 decode. Same VNNI operand layout as the AMX tile, so one
  packed buffer serves both tile tiers.
- PackedBf16B + bf16_tile_gemm_16x16_packed: hoist the per-call VNNI
  pack and its allocation out of hot loops; vnni_index(row, col) lets
  consumers stage B directly in VNNI layout (zero pack cost for sparse /
  one-hot staging).
- bf16_tile_gemm_tier(): names the dispatch tier for run reports
  (Gotcha 9 discipline).
- simd.rs: re-export the new surface through ndarray::simd::* (W1a),
  polyfill name untouched.

Exactness boundary preserved: all tiers bit-exact for bf16-exact integer
operands with accumulation < 2^24, asserted by assert_eq! parity tests
(vnni_index vs vnni_pack_bf16, packed == unpacked == i64 reference,
VDPBF16PS exact-integer + float tolerance vs polyfill, accumulate
semantics). Gotcha-14 contention parity test ships #[ignore]d - it fails
on oversubscribed VMs by design; run --ignored on dedicated silicon.

onebrc_cascade_probe measured effect (direct-VNNI one-hot staging):
GEMM leg 3.6 -> 21.3 Mrows/s (5.9x), 23.7 -> 141.9 GMAC/s single-thread,
413/413 stations still bit-exact. 8 lib tests + 2 doctests green;
clippy -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
…_le_bytes)

The persistence/mailbox face of the packed tile buffer, per the
lance-graph SoaEnvelope discipline (envelope bytes are LE from creation
to tombstone):

- as_le_bytes(): zero-cost &[u16] -> &[u8] reinterpret; LE by
  construction since this module is cfg(x86_64) and x86_64 is LE-only.
- from_le_bytes(): endian-correct rebuild via u16::from_le_bytes
  (compiles to a plain copy on LE targets).
- Contract test asserts byte 2i == low byte of lane i (true LE, not
  just native) and that a GEMM over the roundtripped buffer stays
  bit-exact.

First brick of the SoA-Morton batch-writer / write-hiding design; the
writer itself lands lance-graph-side.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@AdaWorldAPI, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 42 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4baad4c3-36d9-4af6-a0dd-89ec14c1833f

📥 Commits

Reviewing files that changed from the base of the PR and between 8c381a6 and 5f2c7fc.

📒 Files selected for processing (6)
  • .claude/AMX_GOTCHAS.md
  • .claude/blackboard.md
  • Cargo.toml
  • examples/onebrc_cascade_probe.rs
  • src/hpc/bf16_tile_gemm.rs
  • src/simd.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@AdaWorldAPI AdaWorldAPI merged commit de72d15 into master Jul 2, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request Jul 2, 2026
…iLaneColumn

Follow-up unblocking the gridlake wiring (lance-graph #635 COMMENTARY):
lane J's GridBatch carries i32 min/max and i64 sum columns, but
MultiLaneColumn only exposed f32/f64/u64/u8 lane views — #227's onebrc
gridlake probe got away with f32 min/max columns. Add the signed integer
lane widths so a batch SoA can be viewed through the gridlake carrier
directly, no f32 recast.

- `i32x16_from_chunk` / `i64x8_from_chunk` — LE decoders mirroring the
  existing `f32x16_from_chunk` / `u64x8_from_chunk` (scalar `from_le_bytes`
  loop, lowered to a single register-width load on LE targets; no pointer
  cast of the u8-aligned Arc<[u8]>).
- `iter_i32x16` / `iter_i64x8` methods + `len_i32x16` / `len_i64x8`,
  routed through `crate::simd::{I32x16, I64x8}` per the W1a layering rule
  (never dipping into simd_avx512/simd_neon/scalar directly).
- Parity tests: `iter_i32x16_le_round_trip` (incl. negatives, proves
  sign-extension survives the decode) + `iter_i64x8_le_round_trip`;
  extended the empty-count, 3-lane-count, and len asserts.

These are layout-only zero-copy reinterpretations of the backing store
(the same category as the existing typed iterators), not new compute
kernels — no per-arch AVX/NEON/scalar backend needed beyond the lane
types crate::simd already provides.

simd_soa: 13/13 tests pass; clippy -D warnings clean; fmt clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants