Skip to content

kindjie/tile-layout-bench

Repository files navigation

tile-layout-bench

CI

A reproducible CPU benchmark that asks a deceptively simple question:

For a 2D grid of tiles, does the physical memory layout — row-major, chunked, space-filling-curve, hex — actually change how fast a colony-sim / city-builder spends its CPU? And if so, by how much, on which workloads, and why?

A 2D game operation paints the same shape on the map, but each layout paints a different pattern in memory. The benchmark measures which patterns run fastest — and finds the answer depends on the workload and the chip more than on the layout's reputation. 33 layouts (21 square + 12 hex; a LAYOUT_BENCH_CURVE_B_SWEEP flag adds 9 more) are measured across 26 colony-sim workload families — 23 on both grids plus 3 square-only — most with sequential and parallel variants (radius/rect queries, cellular-sim stencils, BFS / A* / hierarchical pathfinding, line-of-sight, flow fields, broad-phase, bit-packing) on two architectures (Apple M3 Max, Intel Sapphire Rapids), under a strict cross-layout bit-exact-equivalence gate.

The same 4x4 tile world stored three ways in 1D memory — row-major, chunked, and a space-filling curve — each preserving every tile's data but producing a different memory access pattern.

The same 16-tile world, stored three ways. Every layout keeps each tile's data identical and changes only the order in memory — row-major (row by row), chunked (small 2D blocks kept contiguous), or a space-filling curve (the example shows a Hilbert order within each chunk). That memory order is the only thing the benchmark varies.

The headline is not which layout "won" — it is that grid choice (square vs hex) barely matters, while the within-grid layout choice is the real lever, and the best layout flips with the CPU. Both are quantified below.


TL;DR — the verdict

Grid choice (square vs hex) is a near-tie on both architectures. Across five colony-sim presets, the cheaper grid is the same on M3 and x86 (bar two dead-tie cells), and every margin is ≤1.16×. If you have a square map already, hex buys you almost nothing; if you have hex, square costs almost nothing.

The lever is the within-grid layout choice — and it depends on the workload and the CPU. Under parallel load, chunked/halo layouts beat plain row-major by ~15–38% on x86 for the stencil/diffusion-heavy mix these presets model, but on M3 they barely help (≤~9%) and plain row-major usually wins. So the default differs by CPU:

Apple M3 Max Intel Sapphire Rapids
fastest square layout (parallel mix) plain row-major (row_major_dense_simd) chunked/halo (chunked_row_major_64, sierpinski_chunked_halo_64, …)
cost of shipping the other CPU's choice +15–38%
cross-grid margin near-tie (≤1.06×) near-tie (≤1.16×)

A layout is never chosen on performance alone. Memory budget (halo storage is (B+2)² per chunk), code complexity, and your actual workload mix all weigh in — the numbers here are one input, measured for a specific colony-sim call mix. They tell you where the layout matters and by how much, not that you should swap layouts mechanically.

Percentage convention: README/RESULTS quote row-major's extra cost, (ratio − 1). docs/decision_views/ quotes the same gap as the best layout's savings, (1 − 1/ratio) (so x86 square reads ~13–27% there). Same data, reciprocal framing — not a discrepancy.


The data at a glance

The benchmark's product is a set of generated decision views. Three of them summarise the trade-offs below; the full set (per architecture, per framing, plus the optimization ladder) is in docs/decision_views/.

1. By workload — which layout family to use

Layout-type leaderboard: each layout family vs the fastest family per workload, x86, square grid

Each cell is a layout family's best variant divided by the fastest family for that workload (x86, W=1024); 1.00 (pale) is the family to use, dark is slow. Plain row-major wins or ties most workloads. Chunked families pull ahead on the span/scan-heavy ones — rect, find_first, stencil8, diffusion. The space-filling-curve-within-a-chunk family (full-block) collapses on exactly those workloads (find_first up to ~1500× slower) because it loses the interior stride-math fast path. The practical read: default to row-major, reach for chunking when your hot workloads are span/stencil queries, and never order tiles by a curve inside a chunk. (Shown at W=1024; world size shifts the chunk-LUT pressure — see RESULTS § World-size sensitivity.)

2. By situation — does the grid choice change the answer?

Cross-grid margin per colony-sim preset on x86 Cross-grid margin per colony-sim preset on M3

For each colony-sim preset, how much the cheaper grid beats the other (1.00 = dead tie) — x86 left, M3 right. Bars hug 1.0 on both CPUs: the grid is a coin-flip. Colour marks which grid is cheaper; the hatched bars are the parallel (engine-threads) framing. The verdict is arch-stable — the same grid wins each preset on both machines, bar two dead-tie cells. (Genre scores weight workloads by an author-estimated call mix; this is a relative-CPU model, not a frame budget.)

3. By thread count — how layouts scale in parallel

Thread scaling on 96 physical cores: independent A* queries scale to ~34x while shared band-split stencil/diffusion anti-scale

On 96 physical cores (x86 bare metal), the scalable shape is independent work with per-thread state: A* queries where each thread owns its buffers scale to ~34×, knee at the physical-core count, dipping past it onto SMT siblings. Splitting one grid pass over a shared buffer (stencil/diffusion band-split) anti-scales — slower than serial at every thread count, because the buffer is first-touched on one NUMA node and most threads then read remote memory. This is a lesson about how you parallelize, not which layout you pick.

How to read the full set: V1 leaderboard (above), V2 optimization ladder (which optimizations pay off), V3 single-layout cost per genre, V4 best-choice summary (above), V5 thread scaling — each generated per architecture by analysis/decision_views.py / plot_scaling.py. See docs/decision_views/.


Quickstart

Requires CMake ≥ 3.24, a C++23 compiler (AppleClang 16+, or Clang 16+ / GCC 13+ on Linux — prefer Clang), and Python 3.11+ for analysis.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure        # all 9 must pass
./scripts/run_local.sh                              # build + bench + plots, < 1 min

The first cmake configure fetches Google Highway 1.3.0 via CMake FetchContent (used by the SIMD kernel), so the initial build needs network access. Subsequent builds are offline.

ctest is the trust gate: every layout must produce bit-exact results for the same query (test_equivalence, test_hex_equivalence), and the parallel workloads must be thread-count-invariant. Numbers from a build with failing tests are not trustworthy.


Results

  • docs/decision_views/ — the generated decision views (×2 architectures). Start here.
  • RESULTS.md — the current verdict + per-workload ratio tables, parallel-scaling detail, robustness sweeps, and the compiler/ methodology caveats.
  • docs/archive/results-history.md — the full chronological lab notebook (optimization iterations, rejected approaches, superseded runs). The interesting process; secondary to the verdict.
  • Raw per-run CSVs live in the GCS bucket (large); small provenance (fingerprints, scores, ctest logs) is committed under results/.

How it works

The game/human view of a 2D world — coordinates, neighbors, operations like scan-area, pathfind, flood-fill — contrasted with the CPU/hardware view of flat 1D memory, cache lines, prefetchers, TLB, and branches, connected by the layout mapping.

A game reasons about a meaningful 2D world — scan an area, pathfind, find the nearest tile. The CPU sees flat 1D memory, cache lines, and prefetchers. The layout is the mapping between the two, so it decides which game operations stay cache-friendly and which thrash. The benchmark times those operations across every layout.

  • docs/layouts.md — every layout's index formula, chunk/storage model, halo semantics, and world-size constraints.
  • docs/workloads.md — every workload's algorithm, the colony-sim operation it models, CSV keys, swept parameters.
  • docs/relationships.mdwhy a layout wins a workload, by memory-access pattern (the meta-map).
  • Genre scoring weights each layout by Σ median_ns × calls_per_second for a colony-sim call mix (analysis/genres.py). The Hz are author-estimated, not measured — this is a relative-CPU model, not a frame budget. The scorer warns loudly on any (grid, workload) that matches zero rows, so a coverage gap can't masquerade as a zero cost.

Fairness: one world + seed + query-sample-set is shared across all layouts in a run, so the only variable is layout.index() and its cache behaviour.


Reproduce

  • Locally: ./scripts/run_local.sh (square) and ./build/hex_layout_bench --dev (hex). analysis/aggregate.pyscore_genre.pyplot.pydecision_views.py.
  • Cross-architecture on GCP (x86, ~$0.15/run, auto-cleanup): ./scripts/run_gcp_bench.sh --bucket=gs://… --cxx=clang-20 — see docs/gcp/README.md.
  • Profiling / the "why" (samply, disassembly, opt-remarks, static cache-line analysis, HW counters): docs/perf/README.md.

Caveats

  • Relative-CPU model, not frame time. Rankings are conditional on the author-estimated call mix in analysis/genres.py; adjust for your game.
  • Synthetic kernels, single-process. No multi-layer/3-D worlds, fluids, or GPU. See docs/roadmap.md for what's deliberately out of scope.
  • x86 HW-counter evidence required a bare-metal host (virtualized C3 exposes no PMU). That run is done (tlb-20260531-054123-2ccf, c3-standard-192-metal, counters 100% populated); it produced the bare-metal parallel-scaling result and the counter-based attribution of the arch-split — the halo win is instruction-count-bound, not a cache effect (docs/perf/measurements/metal_counter_attribution.md). M3 "why" evidence (static analysis, disassembly, profiles) is in docs/perf/measurements/.

Repository map

Path What
RESULTS.md current verdict + per-workload data
docs/decision_views/ the generated decision plots, ×2 arches
docs/layouts.md · docs/workloads.md · docs/relationships.md reference: layouts, workloads, the why-map
docs/roadmap.md source-of-truth pipeline (layouts/workloads/optimizations)
docs/gcp/README.md · docs/perf/README.md cloud runs · profiling
docs/perf/guides/ generic deep-profiling guides for your own agents (diagnosis flowchart · interpretation traps · M3 & x86-Linux capture recipes)
docs/archive/results-history.md the full chronological story
results/README.md benchmark output index: runs of record, tracked-vs-ignored policy, naming
src/ · bench/ · tests/ · analysis/ · scripts/ harness, drivers, tests, analysis, runners
AGENTS.md agent operating guide (this is an agent-maintained repo; CLAUDE.md imports it)

License

MIT — see LICENSE. © 2026 Owen Wiggins.

About

A reproducible CPU benchmark: does a 2D tile grid's memory layout change a colony-sim's CPU cost? Across 26 workloads on Apple M3 + Intel Sapphire Rapids — the grid barely matters, the within-grid layout does, and it depends on the workload and the CPU.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors