SQLite-Vec C++ Benchmark Results

Version: 0.1.0 Date: 2026-05-25 Platform: Apple M4 Max, 16 cores (macOS 26.4) / Linux x86_64 Compiler: Apple clang 17.0.0 / GCC 15, C++20, Release mode (-O3 via Meson buildtype=release) SIMD: NEON enabled (macOS), AVX disabled (Linux — no compile_commands.json flags) Library: Google Benchmark 1.9.5

Note: These benchmark settings are intentionally tuned for local performance measurement. They are not the recommended defaults for packaging or portable distribution builds.

Audit update (2026-05-25): Added M-parameter sweep and ef_construction sweep benchmarks to validate default HNSW parameters against research findings. See AUDIT-2026-05-25.md and benchmarks/hnsw_m_sweep_benchmark.cpp. They are not the recommended defaults for packaging or portable distribution builds.

The Jan 2026 baseline ran on Apple M3 Max with DotProd enabled. This run uses Apple M4 Max without DotProd. Latency comparisons reflect both hardware and code changes; recall comparisons are hardware-independent.

HNSW Engine Comparison Benchmark (YAMS baseline + optional zvec)

This benchmark is implemented by benchmarks/hnsw_engine_comparison_benchmark.cpp and built as hnsw_engine_comparison_benchmark.

Current status for this report: YAMS-only run (zvec was not linked in this run).

What it measures

Index build time
Search latency and QPS at ef_search values 50, 100, 200
Recall@K against brute-force ground truth

Build and run

From third_party/sqlite-vec-cpp/:

meson setup builddir
meson compile -C builddir

# YAMS baseline (default)
./builddir/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768

Optional zvec-enabled comparison:

# build zvec separately
git clone https://github.com/alibaba/zvec.git /opt/zvec
cd /opt/zvec && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# reconfigure sqlite-vec-cpp with zvec headers
cd /path/to/yams/third_party/sqlite-vec-cpp
meson setup builddir -Dzvec-root=/opt/zvec --reconfigure
meson compile -C builddir
./builddir/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768

Current run results (2026-04-12, Apple M4 Max)

Run: ./hnsw_engine_comparison_benchmark --corpus=10000 --dim=768

Engine	M	ef_search	Build (ms)	Latency (us)	QPS	Recall@10
yams-hnsw	16	50	38,863	502	1,991	54.1%
yams-hnsw	16	100	38,863	843	1,186	74.7%
yams-hnsw	16	200	38,863	1,308	765	92.9%
yams-hnsw	24	50	83,918	721	1,386	68.7%
yams-hnsw	24	100	83,918	1,116	896	86.9%
yams-hnsw	24	200	83,918	1,643	609	98.4%
yams-hnsw	32	50	144,767	949	1,054	77.4%
yams-hnsw	32	100	144,767	1,347	743	93.6%
yams-hnsw	32	200	144,767	1,873	534	99.5%

Read-only mode (yams-hnsw-ro) produces equivalent latency and recall.

Parallel build (M=24, ef_c=200): 6,785 ms (11.2x speedup over sequential, 16 threads).

Hardware note: Apple M4 Max, NEON (no DotProd), FP32, single-threaded query loop.

Comparison vs Jan 2026 baseline (M3 Max)

The Jan baseline ran on Apple M3 Max with DotProd enabled. This run uses Apple M4 Max without DotProd. Latency deltas reflect both hardware differences and code optimizations (prefetch hints, zero-copy neighbor traversal, flat dense_id lookup, visited pool bitmap). Recall is deterministic and hardware-independent — identical recall confirms no algorithmic regressions.

M	ef_search	Jan Latency (M3)	Apr Latency (M4)	Delta	Recall
16	50	693 us	502 us	-27.6%	54.1% (unchanged)
16	100	1,129 us	843 us	-25.3%	74.7% (unchanged)
16	200	1,819 us	1,308 us	-28.1%	92.9% (unchanged)
24	50	1,028 us	721 us	-29.9%	68.2% -> 68.7%
24	100	1,529 us	1,116 us	-27.0%	86.9% (unchanged)
24	200	2,219 us	1,643 us	-26.0%	98.5% -> 98.4%
32	100	1,811 us	1,347 us	-25.6%	93.1% -> 93.6%
32	200	2,405 us	1,873 us	-22.1%	99.5% (unchanged)

External zvec reference numbers (not measured in this run)

From zvec published benchmarks (https://zvec.org/en/docs/benchmarks/):

Dataset	Config	QPS	Recall
Cohere 1M (768d)	INT8, M=15, ef_search=180	~16,000	95%+
Cohere 10M (768d)	INT8, M=50, ef_search=118, refiner	~8,000	95%+

Key differences vs this run:

zvec numbers use INT8 + refiner and multithreaded query load
this run uses FP32 and single-threaded query loop
hardware differs (cloud ECS vs local Apple Silicon)

Quantized HNSW Search Benchmark (NEW)

This benchmark is implemented by benchmarks/quantized_search_benchmark.cpp and built as quantized_search_benchmark. It measures two-stage quantized search: approximate distance computation using quantized codes for HNSW traversal, followed by exact FP32 reranking.

What it measures

Build time for quantization codes
Search latency and QPS at various ef_search values
Recall@K against brute-force ground truth
Memory usage of quantized stores vs FP32 vectors

Build and run

./builddir/benchmarks/quantized_search_benchmark --corpus 5000 --dim 384 --queries 100

Current run results (2026-04-12, Apple M4 Max)

Corpus: 5000 vectors, 384d, 100 queries, k=10

ef_search = 50

Method	Build (ms)	Latency (us)	QPS	Recall@10	Quant Memory
FP32 baseline	0.8	139.0	7,192	91.0%	0 B
LVQ-8 (2x rerank)	1.9	246.8	4,052	98.1%	1,960,000 B
LVQ-8 (3x rerank)	1.9	314.1	3,184	99.5%	1,960,000 B
LVQ-4 (3x rerank)	3.1	305.2	3,276	99.4%	1,000,000 B
RaBitQ (3x rerank)	9.3	171.4	5,834	75.5%	261,536 B
RaBitQ (5x rerank)	9.4	239.9	4,169	86.3%	261,536 B

ef_search = 100

Method	Build (ms)	Latency (us)	QPS	Recall@10	Quant Memory
FP32 baseline	0.5	240.7	4,155	98.2%	0 B
LVQ-8 (2x rerank)	1.7	361.2	2,769	99.9%	1,960,000 B
LVQ-8 (3x rerank)	1.7	446.4	2,240	100.0%	1,960,000 B
LVQ-4 (3x rerank)	3.3	430.9	2,321	100.0%	1,000,000 B
RaBitQ (3x rerank)	9.4	249.6	4,006	88.5%	261,536 B
RaBitQ (5x rerank)	9.6	393.8	2,539	93.7%	261,536 B

ef_search = 200

Method	Build (ms)	Latency (us)	QPS	Recall@10	Quant Memory
FP32 baseline	0.5	321.4	3,111	99.9%	0 B
LVQ-8 (2x rerank)	1.8	530.6	1,885	100.0%	1,960,000 B
LVQ-8 (3x rerank)	1.8	712.3	1,404	100.0%	1,960,000 B
LVQ-4 (3x rerank)	3.1	749.5	1,334	100.0%	1,000,000 B
RaBitQ (3x rerank)	9.5	526.2	1,900	94.8%	261,536 B
RaBitQ (5x rerank)	9.4	758.7	1,318	98.4%	261,536 B

FP32 vector memory: 7,680,000 bytes (7.3 MB).

Compression ratios

Method	Memory	Compression vs FP32
LVQ-8	1.96 MB	3.9x
LVQ-4	1.00 MB	7.7x
RaBitQ	0.26 MB	29.4x

Key findings

LVQ-8 (2x rerank) at ef_search=50: 98.1% recall at 247 us (7.1% more recall than FP32 baseline at 1.8x latency cost)
LVQ-4 (3x rerank): matches LVQ-8 latency (305 us) with 7.7x compression (NEON-optimized nibble unpacking)
RaBitQ: lowest memory (29.4x compression) but lower recall; best for memory-constrained deployments
At ef_search=100, both LVQ-8 and LVQ-4 achieve 100% recall@10

Batch Distance Benchmark

1. Sequential vs Batch Comparison

Scenario	Time	Throughput
100x384d (Sequential)	2.222 us	45.01 M/s
100x384d (Batch)	2.201 us	45.42 M/s
1Kx384d (Sequential)	24.04 us	41.60 M/s
1Kx384d (Batch)	24.03 us	41.61 M/s

2. Memory Layout Optimization

Layout	Time	Throughput
Contiguous (1Kx384d)	21.32 us	46.90 M/s

3. Top-K Performance (1Kx384d, K=10)

Latency: 25.37 us
Throughput: 39.41 M/s

4. Quantization (1Kx384d)

Type	Time	Throughput
int8	16.24 us	61.57 M/s

5. Large Embeddings (1Kx1536d)

Latency: 96.63 us
Throughput: 10.35 M/s

RAG Pipeline Benchmark

1. Corpus Size Scaling (384d, K=5)

Corpus	Latency	Throughput
1K	26.6 us	37.64 M/s
10K	238 us	41.95 M/s
100K	5.68 ms	17.60 M/s

2. K-Value Scaling (10K docs, 384d)

K	Latency	Throughput
1	376 us	26.61 M/s
5	238 us	41.95 M/s
10	313 us	31.91 M/s
50	240 us	41.61 M/s

3. Embedding Dimension Scaling (10K docs, K=5)

Dimensions	Latency	Throughput	Scaling Factor
384d	238 us	41.95 M/s	1.00x
768d	810 us	12.35 M/s	3.40x
1536d	1088 us	9.19 M/s	4.57x

4. Quantization (10K docs, 384d, K=5)

Type	Latency	Throughput
float	238 us	41.95 M/s
int8	159 us	62.81 M/s

5. Multi-Query Throughput (10K docs, 384d, 10 queries)

Total time: 2.32 ms
Throughput: 43.15 M/s

Filtered Search Benchmark (HNSW, 10K corpus)

Scenario	Time	Throughput
No filter	10.75 ms	9.30 k/s
Bitset filter 10%	54.72 ms	1.83 k/s
Bitset filter 50%	19.96 ms	5.01 k/s
Bitset filter 90%	11.35 ms	8.81 k/s
Set filter 10%	66.86 ms	1.50 k/s
Set filter 50%	24.12 ms	4.15 k/s
Set filter 90%	11.29 ms	8.86 k/s

HNSW Index Performance

Full HNSW benchmark run was stopped due to long runtime. Partial results are logged in benchmarks/logs/2026-04-12_post-quantization/hnsw_benchmark.log. We will update this section after optimizing the long-running benchmark and re-running.

Reproducibility

Release build with NEON:

meson setup builddir-release -Dbuildtype=release -Denable_benchmarks=true -Denable_simd_neon=true
meson compile -C builddir-release
./builddir-release/benchmarks/batch_distance_benchmark
./builddir-release/benchmarks/rag_pipeline_benchmark
./builddir-release/benchmarks/filtered_search_benchmark
./builddir-release/benchmarks/quantized_search_benchmark --corpus 5000 --dim 384 --queries 100

Logs are stored under benchmarks/logs/2026-04-12_post-quantization/.

For the HNSW engine comparison benchmark:

./builddir-release/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768

Parameter Sweep Benchmarks (NEW — 2026-05-25)

These benchmarks validate default HNSW parameters against research findings (see AUDIT-2026-05-25.md). They are standalone executables with no external dependencies.

HNSW M-Parameter Sweep

Benchmark: hnsw_m_sweep_benchmark Purpose: Systematically test M values from 8 to 32 to find the recall/cost Pareto frontier. Identifies the knee where diminishing returns begin (expected around M=20-24 for 768d embeddings).

meson setup build-sweep -Dbuildtype=release -Denable_benchmarks=true
meson compile -C build-sweep hnsw_m_sweep_benchmark

# Default: 768d, 10K corpus, 100 queries, M=8..32
./build-sweep/benchmarks/hnsw_m_sweep_benchmark

# Custom: 384d, 5K corpus, explicit ef_construction
./build-sweep/benchmarks/hnsw_m_sweep_benchmark --corpus=5000 --dim=384 --ef-construction=200

# Test across different dimensions
for dim in 128 384 768 1536; do
  ./build-sweep/benchmarks/hnsw_m_sweep_benchmark --corpus=10000 --dim=$dim --queries=100
done

Output format: CSV with columns M,ef_search,Build(ms),Latency(us),QPS,Recall@10,TotalEdges,EdgesPerNode. Suitable for plotting recall-vs-build-time Pareto frontier.

HNSW ef_construction Sweep

Benchmark: hnsw_ef_sweep_benchmark Purpose: Validate that ef_construction=200 is the correct minimum for for_corpus() (was 100 before audit). Tests ef_construction from 50 to 500.

meson compile -C build-sweep hnsw_ef_sweep_benchmark

# Default: M=16, 768d, 10K corpus, ef_construction=50..500
./build-sweep/benchmarks/hnsw_ef_sweep_benchmark

# Test with higher M to see if ef_construction benefit compounds
./build-sweep/benchmarks/hnsw_ef_sweep_benchmark --M=24 --corpus=10000 --dim=768

Output format: Table with columns ef_construction, ef_search, Build(ms), Latency(us), QPS, Recall@10.

Expected Findings (from audit)

Based on the April 2026 engine comparison benchmark:

M=16 vs M=24: At 768d/10K, M=16 at ef_search=200 achieves 92.9% recall@10. M=24 achieves 98.4% (+5.5pp) at 2.2x build cost but only 1.26x query latency. M=24 is the likely sweet spot for high-dimensional embeddings.
ef_construction=100 vs 200: The for_corpus() factory previously used 100 for <100K corpora. The ef_construction sweep will quantify the recall gap and confirm whether 200 should be the floor.
Research alignment: Elliott & Clark (2024) found that real embedding vectors benefit from higher connectivity than SIFT1M-calibrated defaults. The M sweep directly measures this effect.

Running the Full Audit Suite

# Build all benchmarks
meson setup build-audit -Dbuildtype=release -Denable_benchmarks=true
meson compile -C build-audit

# M sweep (core finding)
./build-audit/benchmarks/hnsw_m_sweep_benchmark --corpus=10000 --dim=768 --queries=200

# ef_construction sweep (threshold validation)
./build-audit/benchmarks/hnsw_ef_sweep_benchmark --corpus=10000 --dim=768 --M=16
./build-audit/benchmarks/hnsw_ef_sweep_benchmark --corpus=10000 --dim=768 --M=24

# Engine comparison (existing, with zvec if available)
./build-audit/benchmarks/hnsw_engine_comparison_benchmark --corpus=10000 --dim=768

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLite-Vec C++ Benchmark Results

Archive

HNSW Engine Comparison Benchmark (YAMS baseline + optional zvec)

What it measures

Build and run

Current run results (2026-04-12, Apple M4 Max)

Comparison vs Jan 2026 baseline (M3 Max)

External zvec reference numbers (not measured in this run)

Quantized HNSW Search Benchmark (NEW)

What it measures

Build and run

Current run results (2026-04-12, Apple M4 Max)

ef_search = 50

ef_search = 100

ef_search = 200

Compression ratios

Key findings

Batch Distance Benchmark

1. Sequential vs Batch Comparison

2. Memory Layout Optimization

3. Top-K Performance (1Kx384d, K=10)

4. Quantization (1Kx384d)

5. Large Embeddings (1Kx1536d)

RAG Pipeline Benchmark

1. Corpus Size Scaling (384d, K=5)

2. K-Value Scaling (10K docs, 384d)

3. Embedding Dimension Scaling (10K docs, K=5)

4. Quantization (10K docs, 384d, K=5)

5. Multi-Query Throughput (10K docs, 384d, 10 queries)

Filtered Search Benchmark (HNSW, 10K corpus)

HNSW Index Performance

Reproducibility

Parameter Sweep Benchmarks (NEW — 2026-05-25)

HNSW M-Parameter Sweep

HNSW ef_construction Sweep

Expected Findings (from audit)

Running the Full Audit Suite

FilesExpand file tree

BENCHMARKS.md

Latest commit

History

BENCHMARKS.md

File metadata and controls

SQLite-Vec C++ Benchmark Results

Archive

HNSW Engine Comparison Benchmark (YAMS baseline + optional zvec)

What it measures

Build and run

Current run results (2026-04-12, Apple M4 Max)

Comparison vs Jan 2026 baseline (M3 Max)

External zvec reference numbers (not measured in this run)

Quantized HNSW Search Benchmark (NEW)

What it measures

Build and run

Current run results (2026-04-12, Apple M4 Max)

ef_search = 50

ef_search = 100

ef_search = 200

Compression ratios

Key findings

Batch Distance Benchmark

1. Sequential vs Batch Comparison

2. Memory Layout Optimization

3. Top-K Performance (1Kx384d, K=10)

4. Quantization (1Kx384d)

5. Large Embeddings (1Kx1536d)

RAG Pipeline Benchmark

1. Corpus Size Scaling (384d, K=5)

2. K-Value Scaling (10K docs, 384d)

3. Embedding Dimension Scaling (10K docs, K=5)

4. Quantization (10K docs, 384d, K=5)

5. Multi-Query Throughput (10K docs, 384d, 10 queries)

Filtered Search Benchmark (HNSW, 10K corpus)

HNSW Index Performance

Reproducibility

Parameter Sweep Benchmarks (NEW — 2026-05-25)

HNSW M-Parameter Sweep

HNSW ef_construction Sweep

Expected Findings (from audit)

Running the Full Audit Suite