From 0776626328abebdbfaf95d6ba2237a9667a3c325 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 06:44:03 +0000 Subject: [PATCH 1/3] =?UTF-8?q?onebrc-probe:=20lane=20T=20=E2=80=94=20HHTL?= =?UTF-8?q?=20trie=20group-by=20+=20trie-vs-RAM=20measurement=20report?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the HHTL trie group-by lane (name-as-prefix-descent instead of hash+slot) in two variants — 16-ary nibble trie (contract::hhtl NiblePath faithful) and 256-ary byte trie — plus a measurement report documenting the full RAM-table-vs-trie ladder. Measured, honest NEGATIVE result (10M rows, 4 workers, n=11 median, mrows/s): the trie is SLOWER than the flat table at ~400-station cardinality — T(16-ary)=54.2, T8(256-ary)=58.3 vs F(flat Morton)=84.6, R(flat radix)=87.7. At this cardinality a single hash + linear probe into a contiguous SoA table beats the trie's dependent-load descent. Parity: both tries produce aggregates identical to lane A (test both_tries_agree_with_lane_a_on_generated_corpus). The report (RESULTS_TRIE_VS_RAM.md) was corrected after adversarial review to report median/spread (n=11), not best-of-3; it flags the trie result as confounded by this arena implementation (runtime-field fanout + in-loop realloc), NOT the trie idea falsified, and marks the "trie wins at routing" claim as unmeasured CONJECTURE. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM --- crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md | 155 +++++++++++++ crates/onebrc-probe/src/lane_t.rs | 245 +++++++++++++++++++++ crates/onebrc-probe/src/lib.rs | 2 + crates/onebrc-probe/src/main.rs | 2 + 4 files changed, 404 insertions(+) create mode 100644 crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md create mode 100644 crates/onebrc-probe/src/lane_t.rs diff --git a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md new file mode 100644 index 00000000..27cc6772 --- /dev/null +++ b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md @@ -0,0 +1,155 @@ +# onebrc-probe — trie vs RAM-table methods and outcomes + +> Measurement report. All numbers are **measured** on this machine, not +> projected. Corpus: `/tmp/brc10m.txt`, 10,000,000 rows, seed 42, +> sha256 `f1853caa30a765883aa655be1c304d956ad8b03e19b3557df2af431d9a955691`. +> Metric: `throughput_mrows_s` (rows / compute-time). **Compute-only** — +> `main.rs` reads the file (`fs::read`, line 91) BEFORE `Instant::now()` +> (line 94), so file I/O and any mmap lever are OUTSIDE the timer. +> Build: `.cargo/config.toml` pins `target-cpu=x86-64-v3` (AVX2) unless a +> row says `native`. 4 workers. +> +> **Statistical note (corrected after adversarial review).** An earlier +> draft of this report headlined single **best-of-3** numbers. Three review +> agents (truth-architect, overclaim-auditor, brutally-honest-tester) +> correctly flagged that best-of-N reports the luckiest run and hides the +> spread — at run-to-run variance of ~8–13%, that is a real reporting sin. +> This version reports **median / min / max / sd over n=11** (v3) and n=7 +> (native) per lane. Cardinality is ~400 stations (`gen.rs STATION_COUNT`); +> results are for THIS workload, THIS machine, ONE corpus — see "Scope" at +> the end. + +## The methods (group-by-aggregate, min/max/sum/count per station) + +Every lane runs the SAME workload and the SAME newline-aligned +`chunk_bounds` split + commutative merge. What varies is (1) how a record's +delimiters are found/parsed and (2) how the station identity becomes an +accumulator slot — the "trie vs RAM-table" axis. + +| Lane | Scan / parse | Group-by structure | Family | +|---|---|---|---| +| **A** scalar | byte-wise `;`/`\n`, int parse | `BTreeMap` | baseline, 1 thread | +| **C** threads | byte-wise, int parse | per-worker `BTreeMap`, merge | baseline, N threads | +| **R** radix | byte-wise, int parse | flat SoA table, slot = `hash & 0xFFFF` | RAM flat table (control) | +| **F** Morton | byte-wise, int parse | flat SoA table, slot = FNV-1a → nibble-interleaved 16-bit Morton tile | RAM flat table (substrate-native) | +| **T8** byte-trie | byte-wise, int parse | 256-ary arena trie, one level per name byte | trie | +| **T** nibble-trie | byte-wise, int parse | 16-ary arena trie (HHTL `NiblePath`-faithful), 2 levels per byte | trie | +| **S** SWAR | **SWAR** `;`/`\n` (haszero u64 trick) + **branchless** int parse | flat SoA table (reuses F verbatim) | RAM flat table + SWAR | + +All lanes are parity-checked: **every lane produces aggregates identical to +lane A** on a generated corpus (unit tests `lane_a_and_lane_c_agree…`, +`lane_f_and_lane_r_agree_with_lane_a…`, `both_tries_agree_with_lane_a…`, +`lane_s_agrees_with_lane_a`, plus a forced-collision probe on the shared +table). Verified in-code, not asserted. + +## The outcomes (10M rows, 4 workers, mrows/s) + +**v3 (x86-64-v3), n=11 per lane:** + +| Lane | median | min | max | sd | vs C (median) | +|---|---:|---:|---:|---:|---:| +| C threads + BTreeMap | 31.2 | 29.4 | 32.0 | 0.7 | 1.0× (ref) | +| T 16-ary nibble trie | 54.2 | 50.4 | 54.7 | 1.4 | 1.7× | +| T8 256-ary byte trie | 58.3 | 54.5 | 66.5 | 3.8 | 1.9× | +| F flat Morton table | 84.6 | 75.1 | 86.3 | 3.6 | 2.7× | +| R flat radix table | 87.7 | 61.2 | 89.1 | 8.6 | 2.8× | +| **S SWAR + flat table** | **103.9** | 76.6 | 105.5 | 7.9 | **3.3×** | + +**native (target-cpu=native), n=7, controlled same-session (F, S only):** + +| Lane | median | min | max | sd | +|---|---:|---:|---:|---:| +| F flat Morton table | 74.0 | 65.8 | 84.0 | 6.3 | +| **S SWAR + flat table** | **96.9** | 90.6 | 106.2 | 5.3 | + +## What the numbers actually say (each claim scoped to its evidence) + +1. **SWAR (S) is the one real, robust win. [supported]** At the median, S + beats F by **+23% on v3** (103.9 vs 84.6) and **+31% on native** (96.9 vs + 74.0). The gap (≈19 mrows/s) is ~2.4× S's own sd and clears F's max + (86.3) at the median. Caveat kept honest: S is the noisier lane — its + *worst* run (76.6) dips below F's median, so the guarantee is "typically + +~25%, occasionally ties F," not a hard floor. The earlier best-of-3 + draft happened to draw an unlucky S run (77.4) that made the number look + cherry-picked; the n=11 median vindicates the SWAR win but only with the + spread disclosed. + +2. **The trie is slower than the flat table here — but this is the arena-trie + IMPLEMENTATION, not "the trie idea," and the distinction matters. + [supported, confounded]** T (54.2) and T8 (58.3) medians both sit far + below F (84.6) and R (87.7); the gap is large and robust across n=11. But + two confounds are uncontrolled and inflate the trie's cost: (a) `Trie` + carries `fanout` as a **runtime struct field** (`descend` computes + `node*self.fanout+sym`), losing the strength-reduction/monomorphization + the flat table gets from its `const SLOTS`; (b) `descend` does an + **in-loop arena realloc** (`children.extend(...)` per new node, 256×u32 = + 1 KB/node for T8) *inside the timed scan*, while the flat table allocates + once up front. So the honest claim is: **this arena-trie is not + competitive with the flat table on dense small-cardinality group-by** — + NOT "the trie is falsified." The direction (a trie chases ~10–20 + dependent loads/record vs the table's ~1 hash + 1 near-L1 slot) is + plausible, but a const-fanout, pre-sized-arena trie was not built, so the + idea itself is untested. This does contradict the earlier-session + hypothesis that the HHTL trie is what reached ~90 — no trie variant here + reaches the flat table's throughput. + +3. **Morton (F) vs plain radix (R): no measurable difference. [supported — + corrected from the prior draft]** R actually medians *slightly above* F + (87.7 vs 84.6), and that 3.1-mrows/s difference is well inside R's sd + (8.6). The nibble-interleave is a **no-op on throughput** (possibly a + marginal negative). The prior draft's "F beats R by a hair" was wrong and + is retracted. The big structural win is flat-SoA-table-vs-BTreeMap + (R−C ≈ +56 median), not the addressing scheme. + +4. **`target-cpu=native` gives no benefit for these lanes — and this table + contains NO SIMD lane, so it says nothing about AVX-512. [narrow claim + supported; the broad one retracted]** Controlled same-session, native F + (74.0) and S (96.9) medians are *below* their v3 counterparts (84.6, + 103.9) — native did not help and if anything ran slightly slower (likely + codegen/thermal, within the noise band). The defensible statement is + "the compiler's `native` flag does not speed up these SCALAR/SWAR lanes." + The prior draft's "native SIMD is noise" overreached: lane S is *SWAR* + (scalar u64 tricks), not vector SIMD, and the actual SIMD lane (B) is + feature-gated and absent from this table. This probe cannot adjudicate + any AVX-512 claim — it runs no AVX-512. + +5. **mmap (lever a) is not measurable in this harness and was not faked. + [verified in code]** The timer starts after `fs::read` (main.rs:91 → :94); + mmap is a wall-clock / 13 GB-allocation lever for the full 1B file, on an + axis this metric does not observe. Measuring it needs an end-to-end + wall-clock mode + a memmap2 dep, which breaks the std-only, zero-dep + contract of lanes A/C/F/R/T/S. + +## The honest bottom line + +For a dense, ~400-cardinality group-by at 10M rows on this machine: **flat +SoA table + SWAR scan/parse is the fastest method measured (~104 mrows/s +median, +~25% over the plain-scalar flat table); the arena-trie lanes are +the slowest of the non-baseline group.** The Morton interleave buys nothing +over plain radix. Native codegen buys nothing over v3. + +**What this does NOT establish (explicit conjecture, unmeasured here):** +- That a trie is the wrong structure *in general* — only that THIS arena + trie loses on THIS workload; a const-fanout/pre-sized variant is untested. +- That the trie "wins at prefix routing" — no prefix-routing / ancestor-query + benchmark exists in this crate. That is the HHTL cascade's claimed job, + but it is a CONJECTURE here, not a result. +- That ~400-cardinality dense group-by is "the substrate's own aggregation + shape" — unmeasured; `lane_f.rs` itself flags high-cardinality as a + different regime. + +## Scope / how to reproduce + +Single machine, single corpus, one cardinality (~400 stations), one row +count (10M). Not a claim about other CPUs, other cardinalities, or the full +1B-row file. + +```bash +cd crates/onebrc-probe +cargo build --release +target/release/onebrc-probe gen /tmp/brc10m.txt 10000000 42 # if absent +for lane in a c r f t8 t s; do + for i in $(seq 1 11); do target/release/onebrc-probe run /tmp/brc10m.txt $lane 4; done +done # take median + min/max/sd per lane, not best-of-N +# native: RUSTFLAGS="-C target-cpu=native" cargo build --release --target-dir target-native +``` diff --git a/crates/onebrc-probe/src/lane_t.rs b/crates/onebrc-probe/src/lane_t.rs new file mode 100644 index 00000000..375245e2 --- /dev/null +++ b/crates/onebrc-probe/src/lane_t.rs @@ -0,0 +1,245 @@ +//! Lane T — the HHTL **trie** lane: group-by as a prefix DESCENT, not a hash. +//! +//! Where lane F/R hash the station name (FNV-1a) into a flat table slot and +//! linear-probe on collision, lane T makes the **name itself the path**: it +//! descends an arena-backed trie one symbol per level to a terminal node that +//! holds the station's accumulator. Collision-free by construction — distinct +//! names reach distinct terminal nodes, shared prefixes share internal nodes. +//! No hash pass, no probe chain, no tag/name re-verification. +//! +//! This is the operational form of the canon's `panCAKES ≡ radix trie ≡ HHTL` +//! (`contract::hhtl::NiblePath`): the keys ARE the tree, routing is pure +//! index arithmetic on the key, and the descent never touches a value until +//! the terminal fold. Two variants, measured side by side: +//! +//! - [`lane_t_trie`] — **16-ary nibble trie** (HHTL-faithful: `FAN_OUT = 16`, +//! one nibble per level, high-then-low per byte → 2 levels per name byte). +//! - [`lane_t_byte`] — **256-ary byte trie** (one level per byte → half the +//! descent depth, larger nodes). The honest "is the 16-ary descent depth or +//! the trie idea itself the cost?" control. +//! +//! Same scalar `;`/`\n` byte scan and same `chunk_bounds`/`merge_maps` threaded +//! driver as lanes A/C/F/R — the ONLY variable vs lane F is the accumulator +//! (trie descent instead of hash+slot+probe). std-only; keeps the crate's +//! zero-dep contract. + +use crate::{chunk_bounds, merge_maps, parse_temp_tenths, Stats}; +use std::collections::BTreeMap; + +/// HHTL fan-out: 16 children per level (one nibble). Matches `contract::hhtl`. +const FANOUT16: usize = 16; +/// Byte-trie fan-out: 256 children per level (one byte, half the depth). +const FANOUT256: usize = 256; + +/// Arena-backed trie over the station-name bytes, generic in fan-out via the +/// two `observe_*` descents. `children[node * fanout + sym] = child index` +/// (`0` = empty; node `0` is the root and is never a child, so `0` is a safe +/// empty sentinel). SoA accumulators are one slot per node; only terminal +/// nodes are folded. +struct Trie { + fanout: usize, + children: Vec, + mins: Vec, + maxs: Vec, + sums: Vec, + counts: Vec, + names: Vec>, +} + +impl Trie { + fn new(fanout: usize) -> Self { + // node 0 = root, with `fanout` empty children. + Self { + fanout, + children: vec![0u32; fanout], + mins: vec![i32::MAX], + maxs: vec![i32::MIN], + sums: vec![0], + counts: vec![0], + names: vec![Vec::new()], + } + } + + /// Follow (or create) the child of `node` at symbol `sym`, returning the + /// child node index. + #[inline(always)] + fn descend(&mut self, node: usize, sym: usize) -> usize { + let idx = node * self.fanout + sym; + let child = self.children[idx]; + if child != 0 { + child as usize + } else { + let new = self.counts.len(); + self.children.extend(std::iter::repeat(0u32).take(self.fanout)); + self.mins.push(i32::MAX); + self.maxs.push(i32::MIN); + self.sums.push(0); + self.counts.push(0); + self.names.push(Vec::new()); + self.children[idx] = new as u32; + new + } + } + + /// Fold one observation into the terminal node reached for `name`. + #[inline(always)] + fn fold(&mut self, node: usize, name: &[u8], tenths: i32) { + if self.counts[node] == 0 { + self.names[node] = name.to_vec(); + } + if tenths < self.mins[node] { + self.mins[node] = tenths; + } + if tenths > self.maxs[node] { + self.maxs[node] = tenths; + } + self.sums[node] += tenths as i64; + self.counts[node] += 1; + } + + /// 16-ary descent: high nibble then low nibble of each byte. + #[inline(always)] + fn observe_nibble(&mut self, name: &[u8], tenths: i32) { + let mut node = 0usize; + for &b in name { + node = self.descend(node, (b >> 4) as usize); + node = self.descend(node, (b & 0x0F) as usize); + } + self.fold(node, name, tenths); + } + + /// 256-ary descent: one byte per level. + #[inline(always)] + fn observe_byte(&mut self, name: &[u8], tenths: i32) { + let mut node = 0usize; + for &b in name { + node = self.descend(node, b as usize); + } + self.fold(node, name, tenths); + } + + fn into_map(self) -> BTreeMap { + let mut out = BTreeMap::new(); + for node in 0..self.counts.len() { + if self.counts[node] > 0 { + let name = String::from_utf8(self.names[node].clone()).expect("station name utf8"); + out.insert( + name, + Stats { + min: self.mins[node], + max: self.maxs[node], + sum: self.sums[node], + count: self.counts[node], + }, + ); + } + } + out + } +} + +/// Scan `data` (the same scalar `;`/`\n` byte scan as lane F), routing every +/// record through the trie `descend` closure into an owned [`Trie`]. +#[inline] +fn accumulate_trie(data: &[u8], fanout: usize, nibble: bool) -> Trie { + let mut trie = Trie::new(fanout); + let len = data.len(); + let mut i = 0usize; + while i < len { + let name_start = i; + while data[i] != b';' { + i += 1; + } + let name = &data[name_start..i]; + i += 1; // skip ';' + let temp_start = i; + while data[i] != b'\n' { + i += 1; + } + let tenths = parse_temp_tenths(&data[temp_start..i]); + i += 1; // skip '\n' + if nibble { + trie.observe_nibble(name, tenths); + } else { + trie.observe_byte(name, tenths); + } + } + trie +} + +fn lane_trie_threads(data: &[u8], workers: usize, fanout: usize, nibble: bool) -> BTreeMap { + let workers = workers.max(1); + let bounds = chunk_bounds(data, workers); + let results: Vec> = std::thread::scope(|scope| { + let handles: Vec<_> = bounds + .iter() + .map(|&(start, end)| { + let slice = &data[start..end]; + scope.spawn(move || accumulate_trie(slice, fanout, nibble).into_map()) + }) + .collect(); + handles + .into_iter() + .map(|h| h.join().expect("lane T worker panicked")) + .collect() + }); + merge_maps(results) +} + +/// Lane T — HHTL **16-ary nibble trie**: the name descends one nibble per +/// level (high-then-low per byte), the terminal node IS the accumulator. +pub fn lane_t_trie(data: &[u8], workers: usize) -> BTreeMap { + lane_trie_threads(data, workers, FANOUT16, true) +} + +/// Lane T (byte) — **256-ary byte trie**: one level per byte (half the descent +/// depth). The control for "is the 16-ary depth or the trie itself the cost?". +pub fn lane_t_byte(data: &[u8], workers: usize) -> BTreeMap { + lane_trie_threads(data, workers, FANOUT256, false) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn prefix_sharing_and_prefix_of_another_name() { + // "ab" is a prefix of "abc" — the "ab" terminal is ALSO an internal + // node on the "abc" path. Both must accumulate independently. + let corpus = b"ab;1.0\nabc;2.0\nab;3.0\nabc;-4.0\nz;0.5\n"; + for nibble in [true, false] { + let fanout = if nibble { FANOUT16 } else { FANOUT256 }; + let trie = accumulate_trie(corpus, fanout, nibble); + let map = trie.into_map(); + assert_eq!(map.len(), 3, "three stations (nibble={nibble})"); + assert_eq!( + map["ab"], + Stats { min: 10, max: 30, sum: 40, count: 2 }, + "nibble={nibble}" + ); + assert_eq!( + map["abc"], + Stats { min: -40, max: 20, sum: -20, count: 2 }, + "nibble={nibble}" + ); + assert_eq!(map["z"], Stats { min: 5, max: 5, sum: 5, count: 1 }); + } + } + + #[test] + fn both_tries_agree_with_lane_a_on_generated_corpus() { + let dir = std::env::temp_dir(); + let path = dir.join(format!("onebrc_probe_test_t_{}.txt", std::process::id())); + let result = crate::gen::gen(&path, 50_000, 71).expect("gen"); + assert_eq!(result.rows, 50_000); + let data = std::fs::read(&path).expect("read generated corpus"); + std::fs::remove_file(&path).ok(); + + let a = crate::lane_a_scalar(&data); + let t16 = lane_t_trie(&data, 3); + let t256 = lane_t_byte(&data, 3); + assert_eq!(a, t16, "16-ary nibble trie must equal lane A"); + assert_eq!(a, t256, "256-ary byte trie must equal lane A"); + assert!(!a.is_empty()); + } +} diff --git a/crates/onebrc-probe/src/lib.rs b/crates/onebrc-probe/src/lib.rs index 301a98ac..390290e6 100644 --- a/crates/onebrc-probe/src/lib.rs +++ b/crates/onebrc-probe/src/lib.rs @@ -68,6 +68,7 @@ pub mod lane_j; pub mod presets; pub mod sha256; pub mod lane_s; +pub mod lane_t; #[cfg(feature = "lane-b")] pub use lane_b::lane_b_simd; @@ -77,6 +78,7 @@ pub use lane_d::lane_d_ractor; pub use lane_e::lane_e_kanban; pub use lane_f::{lane_f_morton, lane_r_radix}; pub use lane_s::lane_s_swar; +pub use lane_t::{lane_t_byte, lane_t_trie}; #[cfg(feature = "lane-g")] pub use lane_g::{lane_g_kanban_soa, lane_g_kanban_soa_with_morsel}; #[cfg(feature = "lane-h")] diff --git a/crates/onebrc-probe/src/main.rs b/crates/onebrc-probe/src/main.rs index c818d3f4..a4cc19f9 100644 --- a/crates/onebrc-probe/src/main.rs +++ b/crates/onebrc-probe/src/main.rs @@ -133,6 +133,8 @@ fn cmd_run(args: &[String]) { "f" => onebrc_probe::lane_f_morton(&data, workers), "r" => onebrc_probe::lane_r_radix(&data, workers), "s" => onebrc_probe::lane_s_swar(&data, workers), + "t" => onebrc_probe::lane_t_trie(&data, workers), + "t8" => onebrc_probe::lane_t_byte(&data, workers), "g" => { #[cfg(feature = "lane-g")] { From 837d744168e7f8cf1823dcaea745b19d0ec40850 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 07:08:45 +0000 Subject: [PATCH 2/3] onebrc-probe(report): mark lane S as companion-PR-provided, fix reproduce loop MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses codex P2 (#638): on this T-only branch lane S is not defined, so the report published results + a parity test for a lane users cannot run here. Adds a 'Lane S provenance' note (S ships in the companion lane-S PR; runnable lanes on this branch are a c r f t8 t) and drops `s` from the reproduce loop with a note to re-add it once the companion PR merges. S rows are kept in the ladder — the report's purpose is the full RAM-vs-trie comparison and S is the fastest RAM-table method. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM --- crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md index 27cc6772..eca1ad1e 100644 --- a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md +++ b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md @@ -18,6 +18,13 @@ > (native) per lane. Cardinality is ~400 stations (`gen.rs STATION_COUNT`); > results are for THIS workload, THIS machine, ONE corpus — see "Scope" at > the end. +> +> **Lane S provenance.** Lane S (SWAR) is NOT in this PR's diff — it ships in +> the companion PR (`onebrc-probe: lane S`). Its rows below were measured with +> that PR applied; on **this branch alone**, the runnable lanes are +> `a c r f t8 t` and the `lane_s_agrees_with_lane_a` parity test lives in the +> companion PR. S is kept in the ladder because the report's whole point is the +> full RAM-table-vs-trie comparison, and S is the fastest RAM-table method. ## The methods (group-by-aggregate, min/max/sum/count per station) @@ -148,7 +155,7 @@ count (10M). Not a claim about other CPUs, other cardinalities, or the full cd crates/onebrc-probe cargo build --release target/release/onebrc-probe gen /tmp/brc10m.txt 10000000 42 # if absent -for lane in a c r f t8 t s; do +for lane in a c r f t8 t; do # add `s` once the companion lane-S PR is merged for i in $(seq 1 11); do target/release/onebrc-probe run /tmp/brc10m.txt $lane 4; done done # take median + min/max/sd per lane, not best-of-N # native: RUSTFLAGS="-C target-cpu=native" cargo build --release --target-dir target-native From 6d624ca503f2509e94d2d7ec03f9756e95d60b00 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 4 Jul 2026 07:14:32 +0000 Subject: [PATCH 3/3] onebrc-probe(report): undo lane-S-companion workaround post-#637 merge #637 (lane S) merged to main; this branch is rebased on top, so lane_s and its parity test are present and all ladder lanes (a c r f t8 t s) run here. Restores `s` to the reproduce loop and rewrites the provenance note. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM --- crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md index eca1ad1e..0f6893c7 100644 --- a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md +++ b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md @@ -19,12 +19,12 @@ > results are for THIS workload, THIS machine, ONE corpus — see "Scope" at > the end. > -> **Lane S provenance.** Lane S (SWAR) is NOT in this PR's diff — it ships in -> the companion PR (`onebrc-probe: lane S`). Its rows below were measured with -> that PR applied; on **this branch alone**, the runnable lanes are -> `a c r f t8 t` and the `lane_s_agrees_with_lane_a` parity test lives in the -> companion PR. S is kept in the ladder because the report's whole point is the -> full RAM-table-vs-trie comparison, and S is the fastest RAM-table method. +> **Lane S provenance.** Lane S (SWAR) shipped separately (PR #637, merged to +> `main`); this branch is rebased on top of it, so `lane_s` and its +> `lane_s_agrees_with_lane_a` parity test are present here and all lanes in the +> ladder — `a c r f t8 t s` — are runnable. S is kept in the ladder because the +> report's whole point is the full RAM-table-vs-trie comparison, and S is the +> fastest RAM-table method. ## The methods (group-by-aggregate, min/max/sum/count per station) @@ -155,7 +155,7 @@ count (10M). Not a claim about other CPUs, other cardinalities, or the full cd crates/onebrc-probe cargo build --release target/release/onebrc-probe gen /tmp/brc10m.txt 10000000 42 # if absent -for lane in a c r f t8 t; do # add `s` once the companion lane-S PR is merged +for lane in a c r f t8 t s; do for i in $(seq 1 11); do target/release/onebrc-probe run /tmp/brc10m.txt $lane 4; done done # take median + min/max/sd per lane, not best-of-N # native: RUSTFLAGS="-C target-cpu=native" cargo build --release --target-dir target-native