A reproducible CPU benchmark that asks a deceptively simple question:
For a 2D grid of tiles, does the physical memory layout — row-major, chunked, space-filling-curve, hex — actually change how fast a colony-sim / city-builder spends its CPU? And if so, by how much, on which workloads, and why?
A 2D game operation paints the same shape on the map, but each layout paints a
different pattern in memory. The benchmark measures which patterns run fastest —
and finds the answer depends on the workload and the chip more than on the
layout's reputation. 33
layouts (21 square + 12 hex; a LAYOUT_BENCH_CURVE_B_SWEEP flag adds 9 more)
are measured across 26 colony-sim workload families — 23 on both grids plus 3
square-only — most with sequential and parallel variants
(radius/rect queries, cellular-sim stencils, BFS / A* / hierarchical
pathfinding, line-of-sight, flow fields, broad-phase, bit-packing) on two
architectures (Apple M3 Max, Intel Sapphire Rapids), under a strict
cross-layout bit-exact-equivalence gate.
The same 16-tile world, stored three ways. Every layout keeps each tile's data identical and changes only the order in memory — row-major (row by row), chunked (small 2D blocks kept contiguous), or a space-filling curve (the example shows a Hilbert order within each chunk). That memory order is the only thing the benchmark varies.
The headline is not which layout "won" — it is that grid choice (square vs hex) barely matters, while the within-grid layout choice is the real lever, and the best layout flips with the CPU. Both are quantified below.
Grid choice (square vs hex) is a near-tie on both architectures. Across five colony-sim presets, the cheaper grid is the same on M3 and x86 (bar two dead-tie cells), and every margin is ≤1.16×. If you have a square map already, hex buys you almost nothing; if you have hex, square costs almost nothing.
The lever is the within-grid layout choice — and it depends on the workload and the CPU. Under parallel load, chunked/halo layouts beat plain row-major by ~15–38% on x86 for the stencil/diffusion-heavy mix these presets model, but on M3 they barely help (≤~9%) and plain row-major usually wins. So the default differs by CPU:
| Apple M3 Max | Intel Sapphire Rapids | |
|---|---|---|
| fastest square layout (parallel mix) | plain row-major (row_major_dense_simd) |
chunked/halo (chunked_row_major_64, sierpinski_chunked_halo_64, …) |
| cost of shipping the other CPU's choice | — | +15–38% |
| cross-grid margin | near-tie (≤1.06×) | near-tie (≤1.16×) |
A layout is never chosen on performance alone. Memory budget (halo storage is
(B+2)²per chunk), code complexity, and your actual workload mix all weigh in — the numbers here are one input, measured for a specific colony-sim call mix. They tell you where the layout matters and by how much, not that you should swap layouts mechanically.
Percentage convention: README/RESULTS quote row-major's extra cost,
(ratio − 1).docs/decision_views/quotes the same gap as the best layout's savings,(1 − 1/ratio)(so x86 square reads ~13–27% there). Same data, reciprocal framing — not a discrepancy.
The benchmark's product is a set of generated decision views. Three of them
summarise the trade-offs below; the full set (per architecture, per framing,
plus the optimization ladder) is in
docs/decision_views/.
Each cell is a layout family's best variant divided by the fastest family for
that workload (x86, W=1024); 1.00 (pale) is the family to use, dark is slow.
Plain row-major wins or ties most workloads. Chunked families pull ahead
on the span/scan-heavy ones — rect, find_first, stencil8, diffusion.
The space-filling-curve-within-a-chunk family (full-block) collapses on
exactly those workloads (find_first up to ~1500× slower) because it loses the
interior stride-math fast path. The practical read: default to row-major, reach
for chunking when your hot workloads are span/stencil queries, and never order
tiles by a curve inside a chunk. (Shown at W=1024; world size shifts the
chunk-LUT pressure — see RESULTS § World-size sensitivity.)
For each colony-sim preset, how much the cheaper grid beats the other (1.00 = dead tie) — x86 left, M3 right. Bars hug 1.0 on both CPUs: the grid is a coin-flip. Colour marks which grid is cheaper; the hatched bars are the parallel (engine-threads) framing. The verdict is arch-stable — the same grid wins each preset on both machines, bar two dead-tie cells. (Genre scores weight workloads by an author-estimated call mix; this is a relative-CPU model, not a frame budget.)
On 96 physical cores (x86 bare metal), the scalable shape is independent work with per-thread state: A* queries where each thread owns its buffers scale to ~34×, knee at the physical-core count, dipping past it onto SMT siblings. Splitting one grid pass over a shared buffer (stencil/diffusion band-split) anti-scales — slower than serial at every thread count, because the buffer is first-touched on one NUMA node and most threads then read remote memory. This is a lesson about how you parallelize, not which layout you pick.
How to read the full set: V1 leaderboard (above), V2 optimization ladder (which optimizations pay off), V3 single-layout cost per genre, V4 best-choice summary (above), V5 thread scaling — each generated per architecture by
analysis/decision_views.py/plot_scaling.py. Seedocs/decision_views/.
Requires CMake ≥ 3.24, a C++23 compiler (AppleClang 16+, or Clang 16+ / GCC 13+ on Linux — prefer Clang), and Python 3.11+ for analysis.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure # all 9 must pass
./scripts/run_local.sh # build + bench + plots, < 1 minThe first
cmakeconfigure fetches Google Highway 1.3.0 via CMakeFetchContent(used by the SIMD kernel), so the initial build needs network access. Subsequent builds are offline.
ctest is the trust gate: every layout must produce bit-exact results
for the same query (test_equivalence, test_hex_equivalence), and the
parallel workloads must be thread-count-invariant. Numbers from a build
with failing tests are not trustworthy.
docs/decision_views/— the generated decision views (×2 architectures). Start here.RESULTS.md— the current verdict + per-workload ratio tables, parallel-scaling detail, robustness sweeps, and the compiler/ methodology caveats.docs/archive/results-history.md— the full chronological lab notebook (optimization iterations, rejected approaches, superseded runs). The interesting process; secondary to the verdict.- Raw per-run CSVs live in the GCS bucket (large); small provenance
(fingerprints, scores, ctest logs) is committed under
results/.
A game reasons about a meaningful 2D world — scan an area, pathfind, find the nearest tile. The CPU sees flat 1D memory, cache lines, and prefetchers. The layout is the mapping between the two, so it decides which game operations stay cache-friendly and which thrash. The benchmark times those operations across every layout.
docs/layouts.md— every layout's index formula, chunk/storage model, halo semantics, and world-size constraints.docs/workloads.md— every workload's algorithm, the colony-sim operation it models, CSV keys, swept parameters.docs/relationships.md— why a layout wins a workload, by memory-access pattern (the meta-map).- Genre scoring weights each layout by
Σ median_ns × calls_per_secondfor a colony-sim call mix (analysis/genres.py). The Hz are author-estimated, not measured — this is a relative-CPU model, not a frame budget. The scorer warns loudly on any (grid, workload) that matches zero rows, so a coverage gap can't masquerade as a zero cost.
Fairness: one world + seed + query-sample-set is shared across all layouts
in a run, so the only variable is layout.index() and its cache behaviour.
- Locally:
./scripts/run_local.sh(square) and./build/hex_layout_bench --dev(hex).analysis/aggregate.py→score_genre.py→plot.py→decision_views.py. - Cross-architecture on GCP (x86, ~$0.15/run, auto-cleanup):
./scripts/run_gcp_bench.sh --bucket=gs://… --cxx=clang-20— seedocs/gcp/README.md. - Profiling / the "why" (samply, disassembly, opt-remarks, static
cache-line analysis, HW counters):
docs/perf/README.md.
- Relative-CPU model, not frame time. Rankings are conditional on the
author-estimated call mix in
analysis/genres.py; adjust for your game. - Synthetic kernels, single-process. No multi-layer/3-D worlds, fluids,
or GPU. See
docs/roadmap.mdfor what's deliberately out of scope. - x86 HW-counter evidence required a bare-metal host (virtualized C3
exposes no PMU). That run is done (
tlb-20260531-054123-2ccf,c3-standard-192-metal, counters 100% populated); it produced the bare-metal parallel-scaling result and the counter-based attribution of the arch-split — the halo win is instruction-count-bound, not a cache effect (docs/perf/measurements/metal_counter_attribution.md). M3 "why" evidence (static analysis, disassembly, profiles) is indocs/perf/measurements/.
| Path | What |
|---|---|
RESULTS.md |
current verdict + per-workload data |
docs/decision_views/ |
the generated decision plots, ×2 arches |
docs/layouts.md · docs/workloads.md · docs/relationships.md |
reference: layouts, workloads, the why-map |
docs/roadmap.md |
source-of-truth pipeline (layouts/workloads/optimizations) |
docs/gcp/README.md · docs/perf/README.md |
cloud runs · profiling |
docs/perf/guides/ |
generic deep-profiling guides for your own agents (diagnosis flowchart · interpretation traps · M3 & x86-Linux capture recipes) |
docs/archive/results-history.md |
the full chronological story |
results/README.md |
benchmark output index: runs of record, tracked-vs-ignored policy, naming |
src/ · bench/ · tests/ · analysis/ · scripts/ |
harness, drivers, tests, analysis, runners |
AGENTS.md |
agent operating guide (this is an agent-maintained repo; CLAUDE.md imports it) |
MIT — see LICENSE. © 2026 Owen Wiggins.





