tile-layout-bench

A reproducible CPU benchmark that asks a deceptively simple question:

For a 2D grid of tiles, does the physical memory layout — row-major, chunked, space-filling-curve, hex — actually change how fast a colony-sim / city-builder spends its CPU? And if so, by how much, on which workloads, and why?

A 2D game operation paints the same shape on the map, but each layout paints a different pattern in memory. The benchmark measures which patterns run fastest — and finds the answer depends on the workload and the chip more than on the layout's reputation. 33 layouts (21 square + 12 hex; a LAYOUT_BENCH_CURVE_B_SWEEP flag adds 9 more) are measured across 26 colony-sim workload families — 23 on both grids plus 3 square-only — most with sequential and parallel variants (radius/rect queries, cellular-sim stencils, BFS / A* / hierarchical pathfinding, line-of-sight, flow fields, broad-phase, bit-packing) on two architectures (Apple M3 Max, Intel Sapphire Rapids), under a strict cross-layout bit-exact-equivalence gate.

The same 16-tile world, stored three ways. Every layout keeps each tile's data identical and changes only the order in memory — row-major (row by row), chunked (small 2D blocks kept contiguous), or a space-filling curve (the example shows a Hilbert order within each chunk). That memory order is the only thing the benchmark varies.

The headline is not which layout "won" — it is that grid choice (square vs hex) barely matters, while the within-grid layout choice is the real lever, and the best layout flips with the CPU. Both are quantified below.

TL;DR — the verdict

Grid choice (square vs hex) is a near-tie on both architectures. Across five colony-sim presets, the cheaper grid is the same on M3 and x86 (bar two dead-tie cells), and every margin is ≤1.16×. If you have a square map already, hex buys you almost nothing; if you have hex, square costs almost nothing.

The lever is the within-grid layout choice — and it depends on the workload and the CPU. Under parallel load, chunked/halo layouts beat plain row-major by ~15–38% on x86 for the stencil/diffusion-heavy mix these presets model, but on M3 they barely help (≤~9%) and plain row-major usually wins. So the default differs by CPU:

	Apple M3 Max	Intel Sapphire Rapids
fastest square layout (parallel mix)	plain row-major (`row_major_dense_simd`)	chunked/halo (`chunked_row_major_64`, `sierpinski_chunked_halo_64`, …)
cost of shipping the other CPU's choice	—	+15–38%
cross-grid margin	near-tie (≤1.06×)	near-tie (≤1.16×)

A layout is never chosen on performance alone. Memory budget (halo storage is (B+2)² per chunk), code complexity, and your actual workload mix all weigh in — the numbers here are one input, measured for a specific colony-sim call mix. They tell you where the layout matters and by how much, not that you should swap layouts mechanically.

Percentage convention: README/RESULTS quote row-major's extra cost, (ratio − 1). docs/decision_views/ quotes the same gap as the best layout's savings, (1 − 1/ratio) (so x86 square reads ~13–27% there). Same data, reciprocal framing — not a discrepancy.

The data at a glance

The benchmark's product is a set of generated decision views. Three of them summarise the trade-offs below; the full set (per architecture, per framing, plus the optimization ladder) is in docs/decision_views/.

1. By workload — which layout family to use

Each cell is a layout family's best variant divided by the fastest family for that workload (x86, W=1024); 1.00 (pale) is the family to use, dark is slow. Plain row-major wins or ties most workloads. Chunked families pull ahead on the span/scan-heavy ones — rect, find_first, stencil8, diffusion. The space-filling-curve-within-a-chunk family (full-block) collapses on exactly those workloads (find_first up to ~1500× slower) because it loses the interior stride-math fast path. The practical read: default to row-major, reach for chunking when your hot workloads are span/stencil queries, and never order tiles by a curve inside a chunk. (Shown at W=1024; world size shifts the chunk-LUT pressure — see RESULTS § World-size sensitivity.)

2. By situation — does the grid choice change the answer?

For each colony-sim preset, how much the cheaper grid beats the other (1.00 = dead tie) — x86 left, M3 right. Bars hug 1.0 on both CPUs: the grid is a coin-flip. Colour marks which grid is cheaper; the hatched bars are the parallel (engine-threads) framing. The verdict is arch-stable — the same grid wins each preset on both machines, bar two dead-tie cells. (Genre scores weight workloads by an author-estimated call mix; this is a relative-CPU model, not a frame budget.)

3. By thread count — how layouts scale in parallel

On 96 physical cores (x86 bare metal), the scalable shape is independent work with per-thread state: A* queries where each thread owns its buffers scale to ~34×, knee at the physical-core count, dipping past it onto SMT siblings. Splitting one grid pass over a shared buffer (stencil/diffusion band-split) anti-scales — slower than serial at every thread count, because the buffer is first-touched on one NUMA node and most threads then read remote memory. This is a lesson about how you parallelize, not which layout you pick.

How to read the full set: V1 leaderboard (above), V2 optimization ladder (which optimizations pay off), V3 single-layout cost per genre, V4 best-choice summary (above), V5 thread scaling — each generated per architecture by analysis/decision_views.py / plot_scaling.py. See docs/decision_views/.

Quickstart

Requires CMake ≥ 3.24, a C++23 compiler (AppleClang 16+, or Clang 16+ / GCC 13+ on Linux — prefer Clang), and Python 3.11+ for analysis.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure        # all 9 must pass
./scripts/run_local.sh                              # build + bench + plots, < 1 min

The first cmake configure fetches Google Highway 1.3.0 via CMake FetchContent (used by the SIMD kernel), so the initial build needs network access. Subsequent builds are offline.

ctest is the trust gate: every layout must produce bit-exact results for the same query (test_equivalence, test_hex_equivalence), and the parallel workloads must be thread-count-invariant. Numbers from a build with failing tests are not trustworthy.

Results

docs/decision_views/ — the generated decision views (×2 architectures). Start here.
RESULTS.md — the current verdict + per-workload ratio tables, parallel-scaling detail, robustness sweeps, and the compiler/ methodology caveats.
docs/archive/results-history.md — the full chronological lab notebook (optimization iterations, rejected approaches, superseded runs). The interesting process; secondary to the verdict.
Raw per-run CSVs live in the GCS bucket (large); small provenance (fingerprints, scores, ctest logs) is committed under results/.

How it works

A game reasons about a meaningful 2D world — scan an area, pathfind, find the nearest tile. The CPU sees flat 1D memory, cache lines, and prefetchers. The layout is the mapping between the two, so it decides which game operations stay cache-friendly and which thrash. The benchmark times those operations across every layout.

docs/layouts.md — every layout's index formula, chunk/storage model, halo semantics, and world-size constraints.
docs/workloads.md — every workload's algorithm, the colony-sim operation it models, CSV keys, swept parameters.
docs/relationships.md — why a layout wins a workload, by memory-access pattern (the meta-map).
Genre scoring weights each layout by Σ median_ns × calls_per_second for a colony-sim call mix (analysis/genres.py). The Hz are author-estimated, not measured — this is a relative-CPU model, not a frame budget. The scorer warns loudly on any (grid, workload) that matches zero rows, so a coverage gap can't masquerade as a zero cost.

Fairness: one world + seed + query-sample-set is shared across all layouts in a run, so the only variable is layout.index() and its cache behaviour.

Reproduce

Locally: ./scripts/run_local.sh (square) and ./build/hex_layout_bench --dev (hex). analysis/aggregate.py → score_genre.py → plot.py → decision_views.py.
Cross-architecture on GCP (x86, ~$0.15/run, auto-cleanup): ./scripts/run_gcp_bench.sh --bucket=gs://… --cxx=clang-20 — see docs/gcp/README.md.
Profiling / the "why" (samply, disassembly, opt-remarks, static cache-line analysis, HW counters): docs/perf/README.md.

Caveats

Relative-CPU model, not frame time. Rankings are conditional on the author-estimated call mix in analysis/genres.py; adjust for your game.
Synthetic kernels, single-process. No multi-layer/3-D worlds, fluids, or GPU. See docs/roadmap.md for what's deliberately out of scope.
x86 HW-counter evidence required a bare-metal host (virtualized C3 exposes no PMU). That run is done (tlb-20260531-054123-2ccf, c3-standard-192-metal, counters 100% populated); it produced the bare-metal parallel-scaling result and the counter-based attribution of the arch-split — the halo win is instruction-count-bound, not a cache effect (docs/perf/measurements/metal_counter_attribution.md). M3 "why" evidence (static analysis, disassembly, profiles) is in docs/perf/measurements/.

Repository map

Path	What
`RESULTS.md`	current verdict + per-workload data
`docs/decision_views/`	the generated decision plots, ×2 arches
`docs/layouts.md` · `docs/workloads.md` · `docs/relationships.md`	reference: layouts, workloads, the why-map
`docs/roadmap.md`	source-of-truth pipeline (layouts/workloads/optimizations)
`docs/gcp/README.md` · `docs/perf/README.md`	cloud runs · profiling
`docs/perf/guides/`	generic deep-profiling guides for your own agents (diagnosis flowchart · interpretation traps · M3 & x86-Linux capture recipes)
`docs/archive/results-history.md`	the full chronological story
`results/README.md`	benchmark output index: runs of record, tracked-vs-ignored policy, naming
`src/` · `bench/` · `tests/` · `analysis/` · `scripts/`	harness, drivers, tests, analysis, runners
`AGENTS.md`	agent operating guide (this is an agent-maintained repo; `CLAUDE.md` imports it)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tile-layout-bench

TL;DR — the verdict

The data at a glance

1. By workload — which layout family to use

2. By situation — does the grid choice change the answer?

3. By thread count — how layouts scale in parallel

Quickstart

Results

How it works

Reproduce

Caveats

Repository map

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
analysis		analysis
bench		bench
docs		docs
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md

Folders and files

Latest commit

History

Repository files navigation

tile-layout-bench

TL;DR — the verdict

The data at a glance

1. By workload — which layout family to use

2. By situation — does the grid choice change the answer?

3. By thread count — how layouts scale in parallel

Quickstart

Results

How it works

Reproduce

Caveats

Repository map

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages