Version: 0.1.0
Date: 2026-05-25
Platform: Apple M4 Max, 16 cores (macOS 26.4) / Linux x86_64
Compiler: Apple clang 17.0.0 / GCC 15, C++20, Release mode (-O3 via Meson buildtype=release)
SIMD: NEON enabled (macOS), AVX disabled (Linux — no compile_commands.json flags)
Library: Google Benchmark 1.9.5
Note: These benchmark settings are intentionally tuned for local performance measurement. They are not the recommended defaults for packaging or portable distribution builds.
Audit update (2026-05-25): Added M-parameter sweep and ef_construction sweep benchmarks to validate default HNSW parameters against research findings. See
AUDIT-2026-05-25.mdandbenchmarks/hnsw_m_sweep_benchmark.cpp. They are not the recommended defaults for packaging or portable distribution builds.The Jan 2026 baseline ran on Apple M3 Max with DotProd enabled. This run uses Apple M4 Max without DotProd. Latency comparisons reflect both hardware and code changes; recall comparisons are hardware-independent.
Previous benchmark runs are archived in benchmarks/archive/.
This benchmark is implemented by benchmarks/hnsw_engine_comparison_benchmark.cpp and built as
hnsw_engine_comparison_benchmark.
Current status for this report: YAMS-only run (zvec was not linked in this run).
- Index build time
- Search latency and QPS at
ef_searchvalues 50, 100, 200 - Recall@K against brute-force ground truth
From third_party/sqlite-vec-cpp/:
meson setup builddir
meson compile -C builddir
# YAMS baseline (default)
./builddir/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768Optional zvec-enabled comparison:
# build zvec separately
git clone https://github.com/alibaba/zvec.git /opt/zvec
cd /opt/zvec && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# reconfigure sqlite-vec-cpp with zvec headers
cd /path/to/yams/third_party/sqlite-vec-cpp
meson setup builddir -Dzvec-root=/opt/zvec --reconfigure
meson compile -C builddir
./builddir/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768Run: ./hnsw_engine_comparison_benchmark --corpus=10000 --dim=768
| Engine | M | ef_search | Build (ms) | Latency (us) | QPS | Recall@10 |
|---|---|---|---|---|---|---|
| yams-hnsw | 16 | 50 | 38,863 | 502 | 1,991 | 54.1% |
| yams-hnsw | 16 | 100 | 38,863 | 843 | 1,186 | 74.7% |
| yams-hnsw | 16 | 200 | 38,863 | 1,308 | 765 | 92.9% |
| yams-hnsw | 24 | 50 | 83,918 | 721 | 1,386 | 68.7% |
| yams-hnsw | 24 | 100 | 83,918 | 1,116 | 896 | 86.9% |
| yams-hnsw | 24 | 200 | 83,918 | 1,643 | 609 | 98.4% |
| yams-hnsw | 32 | 50 | 144,767 | 949 | 1,054 | 77.4% |
| yams-hnsw | 32 | 100 | 144,767 | 1,347 | 743 | 93.6% |
| yams-hnsw | 32 | 200 | 144,767 | 1,873 | 534 | 99.5% |
Read-only mode (yams-hnsw-ro) produces equivalent latency and recall.
Parallel build (M=24, ef_c=200): 6,785 ms (11.2x speedup over sequential, 16 threads).
Hardware note: Apple M4 Max, NEON (no DotProd), FP32, single-threaded query loop.
The Jan baseline ran on Apple M3 Max with DotProd enabled. This run uses Apple M4 Max without DotProd. Latency deltas reflect both hardware differences and code optimizations (prefetch hints, zero-copy neighbor traversal, flat dense_id lookup, visited pool bitmap). Recall is deterministic and hardware-independent — identical recall confirms no algorithmic regressions.
| M | ef_search | Jan Latency (M3) | Apr Latency (M4) | Delta | Recall |
|---|---|---|---|---|---|
| 16 | 50 | 693 us | 502 us | -27.6% | 54.1% (unchanged) |
| 16 | 100 | 1,129 us | 843 us | -25.3% | 74.7% (unchanged) |
| 16 | 200 | 1,819 us | 1,308 us | -28.1% | 92.9% (unchanged) |
| 24 | 50 | 1,028 us | 721 us | -29.9% | 68.2% -> 68.7% |
| 24 | 100 | 1,529 us | 1,116 us | -27.0% | 86.9% (unchanged) |
| 24 | 200 | 2,219 us | 1,643 us | -26.0% | 98.5% -> 98.4% |
| 32 | 100 | 1,811 us | 1,347 us | -25.6% | 93.1% -> 93.6% |
| 32 | 200 | 2,405 us | 1,873 us | -22.1% | 99.5% (unchanged) |
From zvec published benchmarks (https://zvec.org/en/docs/benchmarks/):
| Dataset | Config | QPS | Recall |
|---|---|---|---|
| Cohere 1M (768d) | INT8, M=15, ef_search=180 | ~16,000 | 95%+ |
| Cohere 10M (768d) | INT8, M=50, ef_search=118, refiner | ~8,000 | 95%+ |
Key differences vs this run:
- zvec numbers use INT8 + refiner and multithreaded query load
- this run uses FP32 and single-threaded query loop
- hardware differs (cloud ECS vs local Apple Silicon)
This benchmark is implemented by benchmarks/quantized_search_benchmark.cpp and built as
quantized_search_benchmark. It measures two-stage quantized search: approximate distance
computation using quantized codes for HNSW traversal, followed by exact FP32 reranking.
- Build time for quantization codes
- Search latency and QPS at various ef_search values
- Recall@K against brute-force ground truth
- Memory usage of quantized stores vs FP32 vectors
./builddir/benchmarks/quantized_search_benchmark --corpus 5000 --dim 384 --queries 100Corpus: 5000 vectors, 384d, 100 queries, k=10
| Method | Build (ms) | Latency (us) | QPS | Recall@10 | Quant Memory |
|---|---|---|---|---|---|
| FP32 baseline | 0.8 | 139.0 | 7,192 | 91.0% | 0 B |
| LVQ-8 (2x rerank) | 1.9 | 246.8 | 4,052 | 98.1% | 1,960,000 B |
| LVQ-8 (3x rerank) | 1.9 | 314.1 | 3,184 | 99.5% | 1,960,000 B |
| LVQ-4 (3x rerank) | 3.1 | 305.2 | 3,276 | 99.4% | 1,000,000 B |
| RaBitQ (3x rerank) | 9.3 | 171.4 | 5,834 | 75.5% | 261,536 B |
| RaBitQ (5x rerank) | 9.4 | 239.9 | 4,169 | 86.3% | 261,536 B |
| Method | Build (ms) | Latency (us) | QPS | Recall@10 | Quant Memory |
|---|---|---|---|---|---|
| FP32 baseline | 0.5 | 240.7 | 4,155 | 98.2% | 0 B |
| LVQ-8 (2x rerank) | 1.7 | 361.2 | 2,769 | 99.9% | 1,960,000 B |
| LVQ-8 (3x rerank) | 1.7 | 446.4 | 2,240 | 100.0% | 1,960,000 B |
| LVQ-4 (3x rerank) | 3.3 | 430.9 | 2,321 | 100.0% | 1,000,000 B |
| RaBitQ (3x rerank) | 9.4 | 249.6 | 4,006 | 88.5% | 261,536 B |
| RaBitQ (5x rerank) | 9.6 | 393.8 | 2,539 | 93.7% | 261,536 B |
| Method | Build (ms) | Latency (us) | QPS | Recall@10 | Quant Memory |
|---|---|---|---|---|---|
| FP32 baseline | 0.5 | 321.4 | 3,111 | 99.9% | 0 B |
| LVQ-8 (2x rerank) | 1.8 | 530.6 | 1,885 | 100.0% | 1,960,000 B |
| LVQ-8 (3x rerank) | 1.8 | 712.3 | 1,404 | 100.0% | 1,960,000 B |
| LVQ-4 (3x rerank) | 3.1 | 749.5 | 1,334 | 100.0% | 1,000,000 B |
| RaBitQ (3x rerank) | 9.5 | 526.2 | 1,900 | 94.8% | 261,536 B |
| RaBitQ (5x rerank) | 9.4 | 758.7 | 1,318 | 98.4% | 261,536 B |
FP32 vector memory: 7,680,000 bytes (7.3 MB).
| Method | Memory | Compression vs FP32 |
|---|---|---|
| LVQ-8 | 1.96 MB | 3.9x |
| LVQ-4 | 1.00 MB | 7.7x |
| RaBitQ | 0.26 MB | 29.4x |
- LVQ-8 (2x rerank) at ef_search=50: 98.1% recall at 247 us (7.1% more recall than FP32 baseline at 1.8x latency cost)
- LVQ-4 (3x rerank): matches LVQ-8 latency (305 us) with 7.7x compression (NEON-optimized nibble unpacking)
- RaBitQ: lowest memory (29.4x compression) but lower recall; best for memory-constrained deployments
- At ef_search=100, both LVQ-8 and LVQ-4 achieve 100% recall@10
| Scenario | Time | Throughput |
|---|---|---|
| 100x384d (Sequential) | 2.222 us | 45.01 M/s |
| 100x384d (Batch) | 2.201 us | 45.42 M/s |
| 1Kx384d (Sequential) | 24.04 us | 41.60 M/s |
| 1Kx384d (Batch) | 24.03 us | 41.61 M/s |
| Layout | Time | Throughput |
|---|---|---|
| Contiguous (1Kx384d) | 21.32 us | 46.90 M/s |
- Latency: 25.37 us
- Throughput: 39.41 M/s
| Type | Time | Throughput |
|---|---|---|
| int8 | 16.24 us | 61.57 M/s |
- Latency: 96.63 us
- Throughput: 10.35 M/s
| Corpus | Latency | Throughput |
|---|---|---|
| 1K | 26.6 us | 37.64 M/s |
| 10K | 238 us | 41.95 M/s |
| 100K | 5.68 ms | 17.60 M/s |
| K | Latency | Throughput |
|---|---|---|
| 1 | 376 us | 26.61 M/s |
| 5 | 238 us | 41.95 M/s |
| 10 | 313 us | 31.91 M/s |
| 50 | 240 us | 41.61 M/s |
| Dimensions | Latency | Throughput | Scaling Factor |
|---|---|---|---|
| 384d | 238 us | 41.95 M/s | 1.00x |
| 768d | 810 us | 12.35 M/s | 3.40x |
| 1536d | 1088 us | 9.19 M/s | 4.57x |
| Type | Latency | Throughput |
|---|---|---|
| float | 238 us | 41.95 M/s |
| int8 | 159 us | 62.81 M/s |
- Total time: 2.32 ms
- Throughput: 43.15 M/s
| Scenario | Time | Throughput |
|---|---|---|
| No filter | 10.75 ms | 9.30 k/s |
| Bitset filter 10% | 54.72 ms | 1.83 k/s |
| Bitset filter 50% | 19.96 ms | 5.01 k/s |
| Bitset filter 90% | 11.35 ms | 8.81 k/s |
| Set filter 10% | 66.86 ms | 1.50 k/s |
| Set filter 50% | 24.12 ms | 4.15 k/s |
| Set filter 90% | 11.29 ms | 8.86 k/s |
Full HNSW benchmark run was stopped due to long runtime. Partial results are logged in
benchmarks/logs/2026-04-12_post-quantization/hnsw_benchmark.log. We will update this section after
optimizing the long-running benchmark and re-running.
Release build with NEON:
meson setup builddir-release -Dbuildtype=release -Denable_benchmarks=true -Denable_simd_neon=true
meson compile -C builddir-release
./builddir-release/benchmarks/batch_distance_benchmark
./builddir-release/benchmarks/rag_pipeline_benchmark
./builddir-release/benchmarks/filtered_search_benchmark
./builddir-release/benchmarks/quantized_search_benchmark --corpus 5000 --dim 384 --queries 100
Logs are stored under benchmarks/logs/2026-04-12_post-quantization/.
For the HNSW engine comparison benchmark:
./builddir-release/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768These benchmarks validate default HNSW parameters against research findings
(see AUDIT-2026-05-25.md). They are standalone executables with no external
dependencies.
Benchmark: hnsw_m_sweep_benchmark
Purpose: Systematically test M values from 8 to 32 to find the recall/cost Pareto frontier.
Identifies the knee where diminishing returns begin (expected around M=20-24 for 768d embeddings).
meson setup build-sweep -Dbuildtype=release -Denable_benchmarks=true
meson compile -C build-sweep hnsw_m_sweep_benchmark
# Default: 768d, 10K corpus, 100 queries, M=8..32
./build-sweep/benchmarks/hnsw_m_sweep_benchmark
# Custom: 384d, 5K corpus, explicit ef_construction
./build-sweep/benchmarks/hnsw_m_sweep_benchmark --corpus=5000 --dim=384 --ef-construction=200
# Test across different dimensions
for dim in 128 384 768 1536; do
./build-sweep/benchmarks/hnsw_m_sweep_benchmark --corpus=10000 --dim=$dim --queries=100
doneOutput format: CSV with columns M,ef_search,Build(ms),Latency(us),QPS,Recall@10,TotalEdges,EdgesPerNode.
Suitable for plotting recall-vs-build-time Pareto frontier.
Benchmark: hnsw_ef_sweep_benchmark
Purpose: Validate that ef_construction=200 is the correct minimum for for_corpus()
(was 100 before audit). Tests ef_construction from 50 to 500.
meson compile -C build-sweep hnsw_ef_sweep_benchmark
# Default: M=16, 768d, 10K corpus, ef_construction=50..500
./build-sweep/benchmarks/hnsw_ef_sweep_benchmark
# Test with higher M to see if ef_construction benefit compounds
./build-sweep/benchmarks/hnsw_ef_sweep_benchmark --M=24 --corpus=10000 --dim=768Output format: Table with columns ef_construction, ef_search, Build(ms), Latency(us), QPS, Recall@10.
Based on the April 2026 engine comparison benchmark:
- M=16 vs M=24: At 768d/10K, M=16 at ef_search=200 achieves 92.9% recall@10. M=24 achieves 98.4% (+5.5pp) at 2.2x build cost but only 1.26x query latency. M=24 is the likely sweet spot for high-dimensional embeddings.
- ef_construction=100 vs 200: The
for_corpus()factory previously used 100 for <100K corpora. The ef_construction sweep will quantify the recall gap and confirm whether 200 should be the floor. - Research alignment: Elliott & Clark (2024) found that real embedding vectors benefit from higher connectivity than SIFT1M-calibrated defaults. The M sweep directly measures this effect.
# Build all benchmarks
meson setup build-audit -Dbuildtype=release -Denable_benchmarks=true
meson compile -C build-audit
# M sweep (core finding)
./build-audit/benchmarks/hnsw_m_sweep_benchmark --corpus=10000 --dim=768 --queries=200
# ef_construction sweep (threshold validation)
./build-audit/benchmarks/hnsw_ef_sweep_benchmark --corpus=10000 --dim=768 --M=16
./build-audit/benchmarks/hnsw_ef_sweep_benchmark --corpus=10000 --dim=768 --M=24
# Engine comparison (existing, with zvec if available)
./build-audit/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768