Get benchmarking in 5 minutes on multi-core NUMA systems.
- Python 3.7+
numactl(for NUMA pinning)- At least one llama.cpp build with
llama-benchbinary - Multi-core CPU with NUMA (AMD Threadripper, EPYC, Intel Xeon, etc.)
cd /var/www/hft-cpu-test
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtStart with the minimal template:
cp configs/minimal.yaml configs/mytest.yamlEdit configs/mytest.yaml:
-
Set model path:
model: path: /data/models/Qwen2.5-14B-Q4_K_M.gguf name: Qwen2.5-14B-Q4_K_M
-
Set build path:
builds: - name: my-build path: /opt/llama.cpp/build/bin/llama-bench provider: OpenBLAS # or BLIS-OpenMP, MKL, none env: OPENBLAS_NUM_THREADS: "1" OMP_NUM_THREADS: "1" builds_select: - my-build
-
NUMA pinning (adjust for your CPU):
First, check your CPU topology:
lscpu --parse=CPU,Core,Node numactl --hardware
Then update the
physcpubindvalues with your physical core IDs (not SMT/HT siblings):pinning: presets: all-cores: numactl: "-N 0,1 -m 0,1 --physcpubind=<your_physical_core_ids>"
Check your system is ready:
./scripts/check_system.shEnsure CPU governor is set to "performance":
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorDisable NUMA balancing for reproducible results:
echo 0 | sudo tee /proc/sys/kernel/numa_balancingTest your config without executing benchmarks:
./run_bench.sh configs/mytest.yaml --dry-runThis shows the test matrix that will be executed.
./run_bench.sh configs/mytest.yamlResults:
- Runs all test combinations (2-3 reps each)
- Takes ~10-30 minutes depending on matrix size
- Outputs to
reports/<timestamp>-exploratory/
cat reports/latest/summary.mdLook for:
- Top performers per metric (pp512, tg128, mixed)
- Consistency across repetitions
- Any obvious outliers or anomalies
After exploratory identifies winners, create a deep config to test parameter variations:
# Copy starter template
cp configs/example-deep.yaml configs/mytest-deep.yaml
# Edit to use your winning builds and add parameter sweep
nano configs/mytest-deep.yamlDeep testing provides:
- Parameter sweeps: KV cache types, MLA variants, batch sizes
- Tests combinations to find optimal settings
- Same repetitions (3), but many more parameter variations
- Production-ready recommendations with optimal parameters
From reports/latest/summary.md, identify the winning configuration:
Winner: blis-omp-znver1 / all-16-cores / b=256,ub=96 / kv=q8_0 / mla3-fa
- pp512: 145.3 t/s ±2.1
- tg128: 47.8 t/s ±0.8
- mixed: 52.1 t/s ±1.2
sudo apt-get install numactl # Debian/Ubuntu
sudo yum install numactl # RHEL/CentOSllama-bench output format may have changed. Check:
/path/to/llama-bench --help
/path/to/llama-bench -m model.gguf -p 10 -n 0 # Quick testCheck:
# CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be "performance"
# NUMA balancing
cat /proc/sys/kernel/numa_balancing
# Should be "0"
# Verify binary is optimized
file /path/to/llama-bench
ldd /path/to/llama-bench | grep -E 'blas|omp'- Read
docs/CONFIG_SCHEMA.mdfor full YAML reference - Read
docs/PROVENANCE.mdto understand captured data - Customize scenarios for your workload
- Use
scripts/setup_builds.shto create multiple builds
# 1. One-time setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# 2. System prep
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# 3. Create config
cp configs/minimal.yaml configs/prod-test.yaml
nano configs/prod-test.yaml # Edit model/build paths and CPU bindings
# 4. Verify setup
./scripts/check_system.sh
./run_bench.sh configs/prod-test.yaml --dry-run
# 5. Exploratory (fast sweep)
./run_bench.sh configs/prod-test.yaml
# ~15-20 minutes
# 6. Review winners
cat reports/latest/summary.md
cat reports/latest/promote.yaml
# 7. Deep test winners
./run_bench.sh reports/latest/promote.yaml
# ~30-40 minutes
# 8. Final decision
cat reports/latest/summary.md
# Apply winning config to production!Every CPU is different. Here's how to configure correctly:
lscpu --parse=CPU,Core,NodeLook for the pattern:
- Physical cores: Unique "Core" numbers (use these!)
- SMT/HT siblings: Duplicate "Core" numbers (skip these)
AMD Threadripper/EPYC (sequential):
- Physical cores: 0, 1, 2, 3, ... N
- SMT siblings: N+1, N+2, ... 2N
- Use:
--physcpubind=0,1,2,3,...,N
Some Intel Xeon (even-numbered):
- Physical cores: 0, 2, 4, 6, ... 2N
- SMT siblings: 1, 3, 5, 7, ... 2N+1
- Use:
--physcpubind=0,2,4,6,...,2N
Replace the --physcpubind values in your YAML with your physical core IDs.