Python implementation of the TurboQuant and QJL vector quantization algorithms.
turboquant-py implements the TurboQuant and QJL vector quantization algorithms from Google Research (ICLR 2026 / AISTATS 2026). It compresses high-dimensional floating-point vectors to 1-4 bits per coordinate while preserving inner products and distances with provably near-optimal distortion. The library offers two quantization modes — MSE mode for reconstruction fidelity and inner-product mode for unbiased similarity search — plus a standalone 1-bit QJL quantizer, all built on a NumPy-first core with optional PyTorch acceleration.
pip install turboquant-py
# With PyTorch acceleration (optional — supports CUDA and Apple Silicon MPS)
pip install "turboquant-py[torch]"
# From source
pip install -e .
pip install -e ".[torch]"import numpy as np
from turboquant import TurboQuant
vectors = np.random.randn(1000, 384) # 1000 vectors, dim=384
tq = TurboQuant(dim=384, bit_width=2, mode="mse", seed=42)
compressed = tq.quantize(vectors)
reconstructed = tq.dequantize(compressed)
mse = float(np.mean((vectors - reconstructed) ** 2))
print(f"Reconstruction MSE: {mse:.6f}")
# Save and reload
compressed.save("my_index")
from turboquant import CompressedVectors
reloaded = CompressedVectors.load("my_index")import numpy as np
from turboquant import TurboQuant
db = np.random.randn(10000, 768)
query = np.random.randn(768)
tq = TurboQuant(dim=768, bit_width=3, seed=42) # inner_product mode is the default
compressed = tq.quantize(db)
# Estimate inner products against all compressed database vectors
scores = tq.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")import numpy as np
from turboquant import QJL
db = np.random.randn(10000, 1536)
query = np.random.randn(1536)
qjl = QJL(dim=1536, seed=42)
compressed = qjl.quantize(db)
scores = qjl.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")Multi-bit vector quantizer supporting MSE and inner-product modes.
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int |
required | Input vector dimensionality |
bit_width |
int |
required | Bits per coordinate (1-4; inner-product mode requires 2+) |
mode |
str |
"inner_product" |
"mse" for reconstruction quality; "inner_product" for unbiased similarity search |
seed |
int | None |
None |
Random seed for the rotation matrix and QJL projection |
outlier_channels |
int |
0 |
Number of high-magnitude channels to quantize at higher precision |
outlier_bit_width |
int | None |
None |
Bit-width for outlier channels |
Methods:
quantize(vectors)— compress an(n, dim)array; returnsCompressedVectorsdequantize(compressed)— reconstruct approximate originals; returns(n, dim)arrayinner_product(query, compressed)— estimate inner products; returns(n,)scoresquantize_batched(vectors, batch_size, output_path, entropy_encode)— stream-quantize large collections to disk
1-bit quantizer using the Quantized Johnson-Lindenstrauss transform. Applies a random Gaussian projection followed by sign-bit quantization to produce an unbiased inner-product estimator with no reconstruction capability.
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int |
required | Input vector dimensionality |
projection_dim |
int | None |
dim |
Projection dimension (must be <= dim); higher values improve accuracy at the cost of storage |
seed |
int | None |
None |
Random seed for the projection matrix |
Methods:
quantize(vectors)— compress to 1-bit sign representation; returnsCompressedVectorsinner_product(query, compressed)— estimate inner products; returns(n,)scores
In-memory container for quantized vectors. Holds the bit-packed quantization indices, per-vector L2 norms, and any auxiliary arrays (QJL signs, residual norms for inner-product mode). Supports slicing with [start:end] and merging with CompressedVectors.concatenate(parts).
Save/load:
compressed.save("path/to/dir") # bit-packed on disk
compressed.save("path/to/dir", entropy_encode=True) # Huffman-encoded (~5% smaller at 4-bit)
loaded = CompressedVectors.load("path/to/dir")
loaded = CompressedVectors.load("path/to/dir", mmap_mode="r") # memory-mappedKey attributes: indices, norms, dim, bit_width, num_vectors, metadata, extra_arrays
On-disk vector store backed by memory-mapped arrays. Reconstructs the original quantizer from saved metadata (using the stored seed to regenerate rotation/projection matrices) and supports brute-force top-k search without loading all vectors into RAM.
store = CompressedStore.load("path/to/dir")
results = store.search(query, k=10) # returns list[tuple[int, float]], sorted descending by scoreSupports all modes: MSE, inner-product, QJL, and outlier configurations are all reconstructed automatically from the saved metadata.
Properties: dim, num_vectors, bit_width, mode, vectors
compute_codebook runs Lloyd-Max optimization on the Beta distribution that describes coordinates of randomly rotated unit vectors, returning (centroids, boundaries) arrays of sizes 2^bit_width and 2^bit_width + 1.
get_codebook is an lru_cache-wrapped convenience wrapper around compute_codebook. Precomputed codebooks are shipped for common dimensions (64-4096); unsupported dimensions are computed on-the-fly.
Compute the theoretical entropy and Huffman coding savings for a given configuration, as described in Section 3.1 of the TurboQuant paper.
from turboquant import compute_theoretical_savings
savings = compute_theoretical_savings(dim=256, bit_width=4)
# {'entropy': 3.742, 'avg_bits_huffman': 3.779, 'savings_pct': 5.5}On-disk format uses a directory containing:
| File | Contents |
|---|---|
meta.json |
Dimensions, bit-width, mode, seed, outlier config, encoding flags |
indices.npy |
Bit-packed quantization indices (or indices.huffman if entropy-encoded) |
norms.npy |
Per-vector L2 norms |
huffman_table.json |
Huffman coding table (only when entropy-encoded) |
Additional .npy files |
QJL sign vectors, residual norms, outlier indices (mode-dependent) |
Rotation and projection matrices are not stored — they are reconstructed deterministically from the saved seed, reducing storage overhead.
Indices can optionally be Huffman-encoded when saving, providing lossless compression. Savings are measured vs fixed-width encoding (not vs bit-packing):
| Bit-width | Shannon entropy | Huffman avg bits | Savings |
|---|---|---|---|
| 1 | 1.000 | 1.000 | 0.0% |
| 2 | 1.911 | 1.989 | 0.5% |
| 3 | 2.819 | 2.876 | 4.1% |
| 4 | 3.742 | 3.779 | 5.5% |
compressed.save("my_index", entropy_encode=True)
# Loading is automatic — detects encoding from metadata
loaded = CompressedVectors.load("my_index")Results on synthetic unit vectors (dim=768, n=1000).
| Bit-width | Compression ratio | Reconstruction MSE | Recall@1 (MSE mode) | Recall@10 (MSE mode) |
|---|---|---|---|---|
| 1 | 32x | 0.000473 | 0.19 | 0.65 |
| 2 | 16x | 0.000153 | 0.46 | 0.91 |
| 3 | 10.7x | 0.000045 | 0.67 | 0.99 |
| 4 | 8x | 0.000013 | 0.73 | 1.00 |
All results use all-MiniLM-L6-v2 embeddings (dim=384, n=2000 database, n=200 queries, seed=42).
| Method | Bit-width | MSE | Recall@1 | Recall@10 |
|---|---|---|---|---|
| NaiveUniform | 2 | 0.001079 | 0.675 | 0.721 |
| NaiveUniform | 3 | 0.000195 | 0.830 | 0.850 |
| NaiveUniform | 4 | 0.000042 | 0.895 | 0.913 |
| TurboQuant-mse | 2 | 0.000305 | 0.755 | 0.817 |
| TurboQuant-mse | 3 | 0.000090 | 0.895 | 0.878 |
| TurboQuant-mse | 4 | 0.000025 | 0.895 | 0.918 |
| TurboQuant-inner_product | 2 | 0.000946 | 0.605 | 0.713 |
| TurboQuant-inner_product | 3 | 0.000305 | 0.760 | 0.820 |
| TurboQuant-inner_product | 4 | 0.000090 | 0.895 | 0.881 |
| QJL (1-bit) | 1-4 | — | 0.580 | 0.703 |
TurboQuant-mse achieves 3.5x lower MSE than naive uniform quantization at 2 bits. At 4 bits, TurboQuant-inner_product matches QJL recall while storing structured residuals that enable better approximation.
| Method | MSE | Recall@1 | Recall@10 |
|---|---|---|---|
| NaiveUniform | 0.000856 | 0.30 | 0.80 |
| PerChannelUniform | 0.001032 | 0.29 | 0.82 |
| RandProj+Uniform | 0.001027 | 0.28 | 0.76 |
| TurboQuant-mse | 0.000304 | 0.44 | 0.95 |
TurboQuant-mse reduces MSE by 2.8x over NaiveUniform and improves Recall@10 from 80% to 95% at the same 2-bit budget.
All core operations are implemented in NumPy. When PyTorch is installed, matrix operations (rotation, projection, batch inner products) dispatch to PyTorch tensors automatically.
| Backend | Device detection | Notes |
|---|---|---|
| NumPy (default) | Always available | Fastest on CPU for most workloads |
| PyTorch + CUDA | Auto-detected | Best for large-scale GPU workloads |
| PyTorch + MPS | Auto-detected | Apple Silicon GPU; uses float32 internally (MPS does not support float64) |
On CPU, NumPy and PyTorch perform similarly. GPU acceleration benefits primarily the matrix operations (rotation, projection); the scalar quantization is already vectorized in NumPy.
TurboQuant MSE mode applies a random orthogonal rotation to each input vector before scalar quantization. Because coordinates of a randomly rotated unit vector follow a known Beta distribution, a precomputed Lloyd-Max codebook can be derived analytically for that distribution rather than estimated from data. Lloyd-Max quantization is provably optimal for a fixed scalar quantizer, and the rotation ensures coordinates match the distribution the codebook was designed for. Rotating back after dequantization reconstructs the original vector with near-optimal MSE.
TurboQuant inner-product mode extends MSE quantization to produce an unbiased inner-product estimator. It quantizes each vector at (b-1) bits using the MSE codebook, computes the residual between the original and the MSE reconstruction, and then applies the QJL transform (random Gaussian projection followed by sign extraction) to that residual. The stored representation consists of the MSE quantization indices plus the sign bits and L2 norm of the residual. At query time, the inner product estimate combines the MSE dot product and a QJL correction term scaled by sqrt(pi/2) / d, which exactly cancels the bias introduced by the MSE quantizer. Using one bit for QJL on the residual and (b-1) bits for MSE thus achieves better inner-product accuracy than spending all b bits on MSE quantization alone.
QJL (Quantized Johnson-Lindenstrauss) is a 1-bit scheme that projects each key vector through a random Gaussian matrix S and stores only the sign vector sign(S·k) along with the vector norm. For a query q, the inner product is estimated as sqrt(pi/2) / m * ||k|| * <S·q, sign(S·k)>, where m is the projection dimension. This estimator is unbiased and requires only 1 bit per projected coordinate, making it suitable for extreme compression where reconstruction is not needed.
- TurboQuant: "TurboQuant: Redefining AI Efficiency with Extreme Compression" — arXiv:2504.19874
- QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AISTATS 2026) — arXiv:2406.03482
- PolarQuant: arXiv:2502.02617