From 0776626328abebdbfaf95d6ba2237a9667a3c325 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 4 Jul 2026 06:44:03 +0000
Subject: [PATCH 1/3] =?UTF-8?q?onebrc-probe:=20lane=20T=20=E2=80=94=20HHTL?=
 =?UTF-8?q?=20trie=20group-by=20+=20trie-vs-RAM=20measurement=20report?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds the HHTL trie group-by lane (name-as-prefix-descent instead of
hash+slot) in two variants — 16-ary nibble trie (contract::hhtl NiblePath
faithful) and 256-ary byte trie — plus a measurement report documenting the
full RAM-table-vs-trie ladder.

Measured, honest NEGATIVE result (10M rows, 4 workers, n=11 median, mrows/s):
the trie is SLOWER than the flat table at ~400-station cardinality —
T(16-ary)=54.2, T8(256-ary)=58.3 vs F(flat Morton)=84.6, R(flat radix)=87.7.
At this cardinality a single hash + linear probe into a contiguous SoA table
beats the trie's dependent-load descent. Parity: both tries produce aggregates
identical to lane A (test both_tries_agree_with_lane_a_on_generated_corpus).

The report (RESULTS_TRIE_VS_RAM.md) was corrected after adversarial review
to report median/spread (n=11), not best-of-3; it flags the trie result as
confounded by this arena implementation (runtime-field fanout + in-loop
realloc), NOT the trie idea falsified, and marks the "trie wins at routing"
claim as unmeasured CONJECTURE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
---
 crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md | 155 +++++++++++++
 crates/onebrc-probe/src/lane_t.rs          | 245 +++++++++++++++++++++
 crates/onebrc-probe/src/lib.rs             |   2 +
 crates/onebrc-probe/src/main.rs            |   2 +
 4 files changed, 404 insertions(+)
 create mode 100644 crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
 create mode 100644 crates/onebrc-probe/src/lane_t.rs

diff --git a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
new file mode 100644
index 00000000..27cc6772
--- /dev/null
+++ b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
@@ -0,0 +1,155 @@
+# onebrc-probe — trie vs RAM-table methods and outcomes
+
+> Measurement report. All numbers are **measured** on this machine, not
+> projected. Corpus: `/tmp/brc10m.txt`, 10,000,000 rows, seed 42,
+> sha256 `f1853caa30a765883aa655be1c304d956ad8b03e19b3557df2af431d9a955691`.
+> Metric: `throughput_mrows_s` (rows / compute-time). **Compute-only** —
+> `main.rs` reads the file (`fs::read`, line 91) BEFORE `Instant::now()`
+> (line 94), so file I/O and any mmap lever are OUTSIDE the timer.
+> Build: `.cargo/config.toml` pins `target-cpu=x86-64-v3` (AVX2) unless a
+> row says `native`. 4 workers.
+>
+> **Statistical note (corrected after adversarial review).** An earlier
+> draft of this report headlined single **best-of-3** numbers. Three review
+> agents (truth-architect, overclaim-auditor, brutally-honest-tester)
+> correctly flagged that best-of-N reports the luckiest run and hides the
+> spread — at run-to-run variance of ~8–13%, that is a real reporting sin.
+> This version reports **median / min / max / sd over n=11** (v3) and n=7
+> (native) per lane. Cardinality is ~400 stations (`gen.rs STATION_COUNT`);
+> results are for THIS workload, THIS machine, ONE corpus — see "Scope" at
+> the end.
+
+## The methods (group-by-aggregate, min/max/sum/count per station)
+
+Every lane runs the SAME workload and the SAME newline-aligned
+`chunk_bounds` split + commutative merge. What varies is (1) how a record's
+delimiters are found/parsed and (2) how the station identity becomes an
+accumulator slot — the "trie vs RAM-table" axis.
+
+| Lane | Scan / parse | Group-by structure | Family |
+|---|---|---|---|
+| **A** scalar | byte-wise `;`/`\n`, int parse | `BTreeMap<String,Stats>` | baseline, 1 thread |
+| **C** threads | byte-wise, int parse | per-worker `BTreeMap`, merge | baseline, N threads |
+| **R** radix | byte-wise, int parse | flat SoA table, slot = `hash & 0xFFFF` | RAM flat table (control) |
+| **F** Morton | byte-wise, int parse | flat SoA table, slot = FNV-1a → nibble-interleaved 16-bit Morton tile | RAM flat table (substrate-native) |
+| **T8** byte-trie | byte-wise, int parse | 256-ary arena trie, one level per name byte | trie |
+| **T** nibble-trie | byte-wise, int parse | 16-ary arena trie (HHTL `NiblePath`-faithful), 2 levels per byte | trie |
+| **S** SWAR | **SWAR** `;`/`\n` (haszero u64 trick) + **branchless** int parse | flat SoA table (reuses F verbatim) | RAM flat table + SWAR |
+
+All lanes are parity-checked: **every lane produces aggregates identical to
+lane A** on a generated corpus (unit tests `lane_a_and_lane_c_agree…`,
+`lane_f_and_lane_r_agree_with_lane_a…`, `both_tries_agree_with_lane_a…`,
+`lane_s_agrees_with_lane_a`, plus a forced-collision probe on the shared
+table). Verified in-code, not asserted.
+
+## The outcomes (10M rows, 4 workers, mrows/s)
+
+**v3 (x86-64-v3), n=11 per lane:**
+
+| Lane | median | min | max | sd | vs C (median) |
+|---|---:|---:|---:|---:|---:|
+| C threads + BTreeMap | 31.2 | 29.4 | 32.0 | 0.7 | 1.0× (ref) |
+| T 16-ary nibble trie | 54.2 | 50.4 | 54.7 | 1.4 | 1.7× |
+| T8 256-ary byte trie | 58.3 | 54.5 | 66.5 | 3.8 | 1.9× |
+| F flat Morton table | 84.6 | 75.1 | 86.3 | 3.6 | 2.7× |
+| R flat radix table | 87.7 | 61.2 | 89.1 | 8.6 | 2.8× |
+| **S SWAR + flat table** | **103.9** | 76.6 | 105.5 | 7.9 | **3.3×** |
+
+**native (target-cpu=native), n=7, controlled same-session (F, S only):**
+
+| Lane | median | min | max | sd |
+|---|---:|---:|---:|---:|
+| F flat Morton table | 74.0 | 65.8 | 84.0 | 6.3 |
+| **S SWAR + flat table** | **96.9** | 90.6 | 106.2 | 5.3 |
+
+## What the numbers actually say (each claim scoped to its evidence)
+
+1. **SWAR (S) is the one real, robust win. [supported]** At the median, S
+   beats F by **+23% on v3** (103.9 vs 84.6) and **+31% on native** (96.9 vs
+   74.0). The gap (≈19 mrows/s) is ~2.4× S's own sd and clears F's max
+   (86.3) at the median. Caveat kept honest: S is the noisier lane — its
+   *worst* run (76.6) dips below F's median, so the guarantee is "typically
+   +~25%, occasionally ties F," not a hard floor. The earlier best-of-3
+   draft happened to draw an unlucky S run (77.4) that made the number look
+   cherry-picked; the n=11 median vindicates the SWAR win but only with the
+   spread disclosed.
+
+2. **The trie is slower than the flat table here — but this is the arena-trie
+   IMPLEMENTATION, not "the trie idea," and the distinction matters.
+   [supported, confounded]** T (54.2) and T8 (58.3) medians both sit far
+   below F (84.6) and R (87.7); the gap is large and robust across n=11. But
+   two confounds are uncontrolled and inflate the trie's cost: (a) `Trie`
+   carries `fanout` as a **runtime struct field** (`descend` computes
+   `node*self.fanout+sym`), losing the strength-reduction/monomorphization
+   the flat table gets from its `const SLOTS`; (b) `descend` does an
+   **in-loop arena realloc** (`children.extend(...)` per new node, 256×u32 =
+   1 KB/node for T8) *inside the timed scan*, while the flat table allocates
+   once up front. So the honest claim is: **this arena-trie is not
+   competitive with the flat table on dense small-cardinality group-by** —
+   NOT "the trie is falsified." The direction (a trie chases ~10–20
+   dependent loads/record vs the table's ~1 hash + 1 near-L1 slot) is
+   plausible, but a const-fanout, pre-sized-arena trie was not built, so the
+   idea itself is untested. This does contradict the earlier-session
+   hypothesis that the HHTL trie is what reached ~90 — no trie variant here
+   reaches the flat table's throughput.
+
+3. **Morton (F) vs plain radix (R): no measurable difference. [supported —
+   corrected from the prior draft]** R actually medians *slightly above* F
+   (87.7 vs 84.6), and that 3.1-mrows/s difference is well inside R's sd
+   (8.6). The nibble-interleave is a **no-op on throughput** (possibly a
+   marginal negative). The prior draft's "F beats R by a hair" was wrong and
+   is retracted. The big structural win is flat-SoA-table-vs-BTreeMap
+   (R−C ≈ +56 median), not the addressing scheme.
+
+4. **`target-cpu=native` gives no benefit for these lanes — and this table
+   contains NO SIMD lane, so it says nothing about AVX-512. [narrow claim
+   supported; the broad one retracted]** Controlled same-session, native F
+   (74.0) and S (96.9) medians are *below* their v3 counterparts (84.6,
+   103.9) — native did not help and if anything ran slightly slower (likely
+   codegen/thermal, within the noise band). The defensible statement is
+   "the compiler's `native` flag does not speed up these SCALAR/SWAR lanes."
+   The prior draft's "native SIMD is noise" overreached: lane S is *SWAR*
+   (scalar u64 tricks), not vector SIMD, and the actual SIMD lane (B) is
+   feature-gated and absent from this table. This probe cannot adjudicate
+   any AVX-512 claim — it runs no AVX-512.
+
+5. **mmap (lever a) is not measurable in this harness and was not faked.
+   [verified in code]** The timer starts after `fs::read` (main.rs:91 → :94);
+   mmap is a wall-clock / 13 GB-allocation lever for the full 1B file, on an
+   axis this metric does not observe. Measuring it needs an end-to-end
+   wall-clock mode + a memmap2 dep, which breaks the std-only, zero-dep
+   contract of lanes A/C/F/R/T/S.
+
+## The honest bottom line
+
+For a dense, ~400-cardinality group-by at 10M rows on this machine: **flat
+SoA table + SWAR scan/parse is the fastest method measured (~104 mrows/s
+median, +~25% over the plain-scalar flat table); the arena-trie lanes are
+the slowest of the non-baseline group.** The Morton interleave buys nothing
+over plain radix. Native codegen buys nothing over v3.
+
+**What this does NOT establish (explicit conjecture, unmeasured here):**
+- That a trie is the wrong structure *in general* — only that THIS arena
+  trie loses on THIS workload; a const-fanout/pre-sized variant is untested.
+- That the trie "wins at prefix routing" — no prefix-routing / ancestor-query
+  benchmark exists in this crate. That is the HHTL cascade's claimed job,
+  but it is a CONJECTURE here, not a result.
+- That ~400-cardinality dense group-by is "the substrate's own aggregation
+  shape" — unmeasured; `lane_f.rs` itself flags high-cardinality as a
+  different regime.
+
+## Scope / how to reproduce
+
+Single machine, single corpus, one cardinality (~400 stations), one row
+count (10M). Not a claim about other CPUs, other cardinalities, or the full
+1B-row file.
+
+```bash
+cd crates/onebrc-probe
+cargo build --release
+target/release/onebrc-probe gen /tmp/brc10m.txt 10000000 42   # if absent
+for lane in a c r f t8 t s; do
+  for i in $(seq 1 11); do target/release/onebrc-probe run /tmp/brc10m.txt $lane 4; done
+done   # take median + min/max/sd per lane, not best-of-N
+# native: RUSTFLAGS="-C target-cpu=native" cargo build --release --target-dir target-native
+```
diff --git a/crates/onebrc-probe/src/lane_t.rs b/crates/onebrc-probe/src/lane_t.rs
new file mode 100644
index 00000000..375245e2
--- /dev/null
+++ b/crates/onebrc-probe/src/lane_t.rs
@@ -0,0 +1,245 @@
+//! Lane T — the HHTL **trie** lane: group-by as a prefix DESCENT, not a hash.
+//!
+//! Where lane F/R hash the station name (FNV-1a) into a flat table slot and
+//! linear-probe on collision, lane T makes the **name itself the path**: it
+//! descends an arena-backed trie one symbol per level to a terminal node that
+//! holds the station's accumulator. Collision-free by construction — distinct
+//! names reach distinct terminal nodes, shared prefixes share internal nodes.
+//! No hash pass, no probe chain, no tag/name re-verification.
+//!
+//! This is the operational form of the canon's `panCAKES ≡ radix trie ≡ HHTL`
+//! (`contract::hhtl::NiblePath`): the keys ARE the tree, routing is pure
+//! index arithmetic on the key, and the descent never touches a value until
+//! the terminal fold. Two variants, measured side by side:
+//!
+//! - [`lane_t_trie`] — **16-ary nibble trie** (HHTL-faithful: `FAN_OUT = 16`,
+//!   one nibble per level, high-then-low per byte → 2 levels per name byte).
+//! - [`lane_t_byte`] — **256-ary byte trie** (one level per byte → half the
+//!   descent depth, larger nodes). The honest "is the 16-ary descent depth or
+//!   the trie idea itself the cost?" control.
+//!
+//! Same scalar `;`/`\n` byte scan and same `chunk_bounds`/`merge_maps` threaded
+//! driver as lanes A/C/F/R — the ONLY variable vs lane F is the accumulator
+//! (trie descent instead of hash+slot+probe). std-only; keeps the crate's
+//! zero-dep contract.
+
+use crate::{chunk_bounds, merge_maps, parse_temp_tenths, Stats};
+use std::collections::BTreeMap;
+
+/// HHTL fan-out: 16 children per level (one nibble). Matches `contract::hhtl`.
+const FANOUT16: usize = 16;
+/// Byte-trie fan-out: 256 children per level (one byte, half the depth).
+const FANOUT256: usize = 256;
+
+/// Arena-backed trie over the station-name bytes, generic in fan-out via the
+/// two `observe_*` descents. `children[node * fanout + sym] = child index`
+/// (`0` = empty; node `0` is the root and is never a child, so `0` is a safe
+/// empty sentinel). SoA accumulators are one slot per node; only terminal
+/// nodes are folded.
+struct Trie {
+    fanout: usize,
+    children: Vec<u32>,
+    mins: Vec<i32>,
+    maxs: Vec<i32>,
+    sums: Vec<i64>,
+    counts: Vec<u32>,
+    names: Vec<Vec<u8>>,
+}
+
+impl Trie {
+    fn new(fanout: usize) -> Self {
+        // node 0 = root, with `fanout` empty children.
+        Self {
+            fanout,
+            children: vec![0u32; fanout],
+            mins: vec![i32::MAX],
+            maxs: vec![i32::MIN],
+            sums: vec![0],
+            counts: vec![0],
+            names: vec![Vec::new()],
+        }
+    }
+
+    /// Follow (or create) the child of `node` at symbol `sym`, returning the
+    /// child node index.
+    #[inline(always)]
+    fn descend(&mut self, node: usize, sym: usize) -> usize {
+        let idx = node * self.fanout + sym;
+        let child = self.children[idx];
+        if child != 0 {
+            child as usize
+        } else {
+            let new = self.counts.len();
+            self.children.extend(std::iter::repeat(0u32).take(self.fanout));
+            self.mins.push(i32::MAX);
+            self.maxs.push(i32::MIN);
+            self.sums.push(0);
+            self.counts.push(0);
+            self.names.push(Vec::new());
+            self.children[idx] = new as u32;
+            new
+        }
+    }
+
+    /// Fold one observation into the terminal node reached for `name`.
+    #[inline(always)]
+    fn fold(&mut self, node: usize, name: &[u8], tenths: i32) {
+        if self.counts[node] == 0 {
+            self.names[node] = name.to_vec();
+        }
+        if tenths < self.mins[node] {
+            self.mins[node] = tenths;
+        }
+        if tenths > self.maxs[node] {
+            self.maxs[node] = tenths;
+        }
+        self.sums[node] += tenths as i64;
+        self.counts[node] += 1;
+    }
+
+    /// 16-ary descent: high nibble then low nibble of each byte.
+    #[inline(always)]
+    fn observe_nibble(&mut self, name: &[u8], tenths: i32) {
+        let mut node = 0usize;
+        for &b in name {
+            node = self.descend(node, (b >> 4) as usize);
+            node = self.descend(node, (b & 0x0F) as usize);
+        }
+        self.fold(node, name, tenths);
+    }
+
+    /// 256-ary descent: one byte per level.
+    #[inline(always)]
+    fn observe_byte(&mut self, name: &[u8], tenths: i32) {
+        let mut node = 0usize;
+        for &b in name {
+            node = self.descend(node, b as usize);
+        }
+        self.fold(node, name, tenths);
+    }
+
+    fn into_map(self) -> BTreeMap<String, Stats> {
+        let mut out = BTreeMap::new();
+        for node in 0..self.counts.len() {
+            if self.counts[node] > 0 {
+                let name = String::from_utf8(self.names[node].clone()).expect("station name utf8");
+                out.insert(
+                    name,
+                    Stats {
+                        min: self.mins[node],
+                        max: self.maxs[node],
+                        sum: self.sums[node],
+                        count: self.counts[node],
+                    },
+                );
+            }
+        }
+        out
+    }
+}
+
+/// Scan `data` (the same scalar `;`/`\n` byte scan as lane F), routing every
+/// record through the trie `descend` closure into an owned [`Trie`].
+#[inline]
+fn accumulate_trie(data: &[u8], fanout: usize, nibble: bool) -> Trie {
+    let mut trie = Trie::new(fanout);
+    let len = data.len();
+    let mut i = 0usize;
+    while i < len {
+        let name_start = i;
+        while data[i] != b';' {
+            i += 1;
+        }
+        let name = &data[name_start..i];
+        i += 1; // skip ';'
+        let temp_start = i;
+        while data[i] != b'\n' {
+            i += 1;
+        }
+        let tenths = parse_temp_tenths(&data[temp_start..i]);
+        i += 1; // skip '\n'
+        if nibble {
+            trie.observe_nibble(name, tenths);
+        } else {
+            trie.observe_byte(name, tenths);
+        }
+    }
+    trie
+}
+
+fn lane_trie_threads(data: &[u8], workers: usize, fanout: usize, nibble: bool) -> BTreeMap<String, Stats> {
+    let workers = workers.max(1);
+    let bounds = chunk_bounds(data, workers);
+    let results: Vec<BTreeMap<String, Stats>> = std::thread::scope(|scope| {
+        let handles: Vec<_> = bounds
+            .iter()
+            .map(|&(start, end)| {
+                let slice = &data[start..end];
+                scope.spawn(move || accumulate_trie(slice, fanout, nibble).into_map())
+            })
+            .collect();
+        handles
+            .into_iter()
+            .map(|h| h.join().expect("lane T worker panicked"))
+            .collect()
+    });
+    merge_maps(results)
+}
+
+/// Lane T — HHTL **16-ary nibble trie**: the name descends one nibble per
+/// level (high-then-low per byte), the terminal node IS the accumulator.
+pub fn lane_t_trie(data: &[u8], workers: usize) -> BTreeMap<String, Stats> {
+    lane_trie_threads(data, workers, FANOUT16, true)
+}
+
+/// Lane T (byte) — **256-ary byte trie**: one level per byte (half the descent
+/// depth). The control for "is the 16-ary depth or the trie itself the cost?".
+pub fn lane_t_byte(data: &[u8], workers: usize) -> BTreeMap<String, Stats> {
+    lane_trie_threads(data, workers, FANOUT256, false)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn prefix_sharing_and_prefix_of_another_name() {
+        // "ab" is a prefix of "abc" — the "ab" terminal is ALSO an internal
+        // node on the "abc" path. Both must accumulate independently.
+        let corpus = b"ab;1.0\nabc;2.0\nab;3.0\nabc;-4.0\nz;0.5\n";
+        for nibble in [true, false] {
+            let fanout = if nibble { FANOUT16 } else { FANOUT256 };
+            let trie = accumulate_trie(corpus, fanout, nibble);
+            let map = trie.into_map();
+            assert_eq!(map.len(), 3, "three stations (nibble={nibble})");
+            assert_eq!(
+                map["ab"],
+                Stats { min: 10, max: 30, sum: 40, count: 2 },
+                "nibble={nibble}"
+            );
+            assert_eq!(
+                map["abc"],
+                Stats { min: -40, max: 20, sum: -20, count: 2 },
+                "nibble={nibble}"
+            );
+            assert_eq!(map["z"], Stats { min: 5, max: 5, sum: 5, count: 1 });
+        }
+    }
+
+    #[test]
+    fn both_tries_agree_with_lane_a_on_generated_corpus() {
+        let dir = std::env::temp_dir();
+        let path = dir.join(format!("onebrc_probe_test_t_{}.txt", std::process::id()));
+        let result = crate::gen::gen(&path, 50_000, 71).expect("gen");
+        assert_eq!(result.rows, 50_000);
+        let data = std::fs::read(&path).expect("read generated corpus");
+        std::fs::remove_file(&path).ok();
+
+        let a = crate::lane_a_scalar(&data);
+        let t16 = lane_t_trie(&data, 3);
+        let t256 = lane_t_byte(&data, 3);
+        assert_eq!(a, t16, "16-ary nibble trie must equal lane A");
+        assert_eq!(a, t256, "256-ary byte trie must equal lane A");
+        assert!(!a.is_empty());
+    }
+}
diff --git a/crates/onebrc-probe/src/lib.rs b/crates/onebrc-probe/src/lib.rs
index 301a98ac..390290e6 100644
--- a/crates/onebrc-probe/src/lib.rs
+++ b/crates/onebrc-probe/src/lib.rs
@@ -68,6 +68,7 @@ pub mod lane_j;
 pub mod presets;
 pub mod sha256;
 pub mod lane_s;
+pub mod lane_t;
 
 #[cfg(feature = "lane-b")]
 pub use lane_b::lane_b_simd;
@@ -77,6 +78,7 @@ pub use lane_d::lane_d_ractor;
 pub use lane_e::lane_e_kanban;
 pub use lane_f::{lane_f_morton, lane_r_radix};
 pub use lane_s::lane_s_swar;
+pub use lane_t::{lane_t_byte, lane_t_trie};
 #[cfg(feature = "lane-g")]
 pub use lane_g::{lane_g_kanban_soa, lane_g_kanban_soa_with_morsel};
 #[cfg(feature = "lane-h")]
diff --git a/crates/onebrc-probe/src/main.rs b/crates/onebrc-probe/src/main.rs
index c818d3f4..a4cc19f9 100644
--- a/crates/onebrc-probe/src/main.rs
+++ b/crates/onebrc-probe/src/main.rs
@@ -133,6 +133,8 @@ fn cmd_run(args: &[String]) {
         "f" => onebrc_probe::lane_f_morton(&data, workers),
         "r" => onebrc_probe::lane_r_radix(&data, workers),
         "s" => onebrc_probe::lane_s_swar(&data, workers),
+        "t" => onebrc_probe::lane_t_trie(&data, workers),
+        "t8" => onebrc_probe::lane_t_byte(&data, workers),
         "g" => {
             #[cfg(feature = "lane-g")]
             {

From 837d744168e7f8cf1823dcaea745b19d0ec40850 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 4 Jul 2026 07:08:45 +0000
Subject: [PATCH 2/3] onebrc-probe(report): mark lane S as
 companion-PR-provided, fix reproduce loop
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses codex P2 (#638): on this T-only branch lane S is not defined, so
the report published results + a parity test for a lane users cannot run
here. Adds a 'Lane S provenance' note (S ships in the companion lane-S PR;
runnable lanes on this branch are a c r f t8 t) and drops `s` from the
reproduce loop with a note to re-add it once the companion PR merges. S rows
are kept in the ladder — the report's purpose is the full RAM-vs-trie
comparison and S is the fastest RAM-table method.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
---
 crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
index 27cc6772..eca1ad1e 100644
--- a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
+++ b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
@@ -18,6 +18,13 @@
 > (native) per lane. Cardinality is ~400 stations (`gen.rs STATION_COUNT`);
 > results are for THIS workload, THIS machine, ONE corpus — see "Scope" at
 > the end.
+>
+> **Lane S provenance.** Lane S (SWAR) is NOT in this PR's diff — it ships in
+> the companion PR (`onebrc-probe: lane S`). Its rows below were measured with
+> that PR applied; on **this branch alone**, the runnable lanes are
+> `a c r f t8 t` and the `lane_s_agrees_with_lane_a` parity test lives in the
+> companion PR. S is kept in the ladder because the report's whole point is the
+> full RAM-table-vs-trie comparison, and S is the fastest RAM-table method.
 
 ## The methods (group-by-aggregate, min/max/sum/count per station)
 
@@ -148,7 +155,7 @@ count (10M). Not a claim about other CPUs, other cardinalities, or the full
 cd crates/onebrc-probe
 cargo build --release
 target/release/onebrc-probe gen /tmp/brc10m.txt 10000000 42   # if absent
-for lane in a c r f t8 t s; do
+for lane in a c r f t8 t; do   # add `s` once the companion lane-S PR is merged
   for i in $(seq 1 11); do target/release/onebrc-probe run /tmp/brc10m.txt $lane 4; done
 done   # take median + min/max/sd per lane, not best-of-N
 # native: RUSTFLAGS="-C target-cpu=native" cargo build --release --target-dir target-native

From 6d624ca503f2509e94d2d7ec03f9756e95d60b00 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 4 Jul 2026 07:14:32 +0000
Subject: [PATCH 3/3] onebrc-probe(report): undo lane-S-companion workaround
 post-#637 merge

#637 (lane S) merged to main; this branch is rebased on top, so lane_s and
its parity test are present and all ladder lanes (a c r f t8 t s) run here.
Restores `s` to the reproduce loop and rewrites the provenance note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
---
 crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
index eca1ad1e..0f6893c7 100644
--- a/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
+++ b/crates/onebrc-probe/RESULTS_TRIE_VS_RAM.md
@@ -19,12 +19,12 @@
 > results are for THIS workload, THIS machine, ONE corpus — see "Scope" at
 > the end.
 >
-> **Lane S provenance.** Lane S (SWAR) is NOT in this PR's diff — it ships in
-> the companion PR (`onebrc-probe: lane S`). Its rows below were measured with
-> that PR applied; on **this branch alone**, the runnable lanes are
-> `a c r f t8 t` and the `lane_s_agrees_with_lane_a` parity test lives in the
-> companion PR. S is kept in the ladder because the report's whole point is the
-> full RAM-table-vs-trie comparison, and S is the fastest RAM-table method.
+> **Lane S provenance.** Lane S (SWAR) shipped separately (PR #637, merged to
+> `main`); this branch is rebased on top of it, so `lane_s` and its
+> `lane_s_agrees_with_lane_a` parity test are present here and all lanes in the
+> ladder — `a c r f t8 t s` — are runnable. S is kept in the ladder because the
+> report's whole point is the full RAM-table-vs-trie comparison, and S is the
+> fastest RAM-table method.
 
 ## The methods (group-by-aggregate, min/max/sum/count per station)
 
@@ -155,7 +155,7 @@ count (10M). Not a claim about other CPUs, other cardinalities, or the full
 cd crates/onebrc-probe
 cargo build --release
 target/release/onebrc-probe gen /tmp/brc10m.txt 10000000 42   # if absent
-for lane in a c r f t8 t; do   # add `s` once the companion lane-S PR is merged
+for lane in a c r f t8 t s; do
   for i in $(seq 1 11); do target/release/onebrc-probe run /tmp/brc10m.txt $lane 4; done
 done   # take median + min/max/sd per lane, not best-of-N
 # native: RUSTFLAGS="-C target-cpu=native" cargo build --release --target-dir target-native