turboquant-py

Python implementation of the TurboQuant and QJL vector quantization algorithms.

turboquant-py implements the TurboQuant and QJL vector quantization algorithms from Google Research (ICLR 2026 / AISTATS 2026). It compresses high-dimensional floating-point vectors to 1-4 bits per coordinate while preserving inner products and distances with provably near-optimal distortion. The library offers two quantization modes — MSE mode for reconstruction fidelity and inner-product mode for unbiased similarity search — plus a standalone 1-bit QJL quantizer, all built on a NumPy-first core with optional PyTorch acceleration.

Installation

pip install turboquant-py
# With PyTorch acceleration (optional — supports CUDA and Apple Silicon MPS)
pip install "turboquant-py[torch]"

# From source
pip install -e .
pip install -e ".[torch]"

Quick Start

TurboQuant MSE mode

import numpy as np
from turboquant import TurboQuant

vectors = np.random.randn(1000, 384)  # 1000 vectors, dim=384
tq = TurboQuant(dim=384, bit_width=2, mode="mse", seed=42)

compressed = tq.quantize(vectors)
reconstructed = tq.dequantize(compressed)

mse = float(np.mean((vectors - reconstructed) ** 2))
print(f"Reconstruction MSE: {mse:.6f}")

# Save and reload
compressed.save("my_index")
from turboquant import CompressedVectors
reloaded = CompressedVectors.load("my_index")

TurboQuant inner product mode

import numpy as np
from turboquant import TurboQuant

db = np.random.randn(10000, 768)
query = np.random.randn(768)

tq = TurboQuant(dim=768, bit_width=3, seed=42)  # inner_product mode is the default
compressed = tq.quantize(db)

# Estimate inner products against all compressed database vectors
scores = tq.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")

QJL 1-bit quantization

import numpy as np
from turboquant import QJL

db = np.random.randn(10000, 1536)
query = np.random.randn(1536)

qjl = QJL(dim=1536, seed=42)
compressed = qjl.quantize(db)

scores = qjl.inner_product(query, compressed)
top10 = np.argsort(scores)[::-1][:10]
print(f"Top-10 indices: {top10}")

API Reference

`TurboQuant(dim, bit_width, mode, seed, outlier_channels, outlier_bit_width)`

Multi-bit vector quantizer supporting MSE and inner-product modes.

Parameter	Type	Default	Description
`dim`	`int`	required	Input vector dimensionality
`bit_width`	`int`	required	Bits per coordinate (1-4; inner-product mode requires 2+)
`mode`	`str`	`"inner_product"`	`"mse"` for reconstruction quality; `"inner_product"` for unbiased similarity search
`seed`	`int \| None`	`None`	Random seed for the rotation matrix and QJL projection
`outlier_channels`	`int`	`0`	Number of high-magnitude channels to quantize at higher precision
`outlier_bit_width`	`int \| None`	`None`	Bit-width for outlier channels

Methods:

quantize(vectors) — compress an (n, dim) array; returns CompressedVectors
dequantize(compressed) — reconstruct approximate originals; returns (n, dim) array
inner_product(query, compressed) — estimate inner products; returns (n,) scores
quantize_batched(vectors, batch_size, output_path, entropy_encode) — stream-quantize large collections to disk

`QJL(dim, projection_dim, seed)`

1-bit quantizer using the Quantized Johnson-Lindenstrauss transform. Applies a random Gaussian projection followed by sign-bit quantization to produce an unbiased inner-product estimator with no reconstruction capability.

Parameter	Type	Default	Description
`dim`	`int`	required	Input vector dimensionality
`projection_dim`	`int \| None`	`dim`	Projection dimension (must be <= dim); higher values improve accuracy at the cost of storage
`seed`	`int \| None`	`None`	Random seed for the projection matrix

Methods:

quantize(vectors) — compress to 1-bit sign representation; returns CompressedVectors
inner_product(query, compressed) — estimate inner products; returns (n,) scores

`CompressedVectors`

In-memory container for quantized vectors. Holds the bit-packed quantization indices, per-vector L2 norms, and any auxiliary arrays (QJL signs, residual norms for inner-product mode). Supports slicing with [start:end] and merging with CompressedVectors.concatenate(parts).

Save/load:

compressed.save("path/to/dir")                          # bit-packed on disk
compressed.save("path/to/dir", entropy_encode=True)     # Huffman-encoded (~5% smaller at 4-bit)
loaded = CompressedVectors.load("path/to/dir")
loaded = CompressedVectors.load("path/to/dir", mmap_mode="r")  # memory-mapped

Key attributes: indices, norms, dim, bit_width, num_vectors, metadata, extra_arrays

`CompressedStore`

On-disk vector store backed by memory-mapped arrays. Reconstructs the original quantizer from saved metadata (using the stored seed to regenerate rotation/projection matrices) and supports brute-force top-k search without loading all vectors into RAM.

store = CompressedStore.load("path/to/dir")
results = store.search(query, k=10)  # returns list[tuple[int, float]], sorted descending by score

Supports all modes: MSE, inner-product, QJL, and outlier configurations are all reconstructed automatically from the saved metadata.

Properties: dim, num_vectors, bit_width, mode, vectors

`compute_codebook(dim, bit_width)` / `get_codebook(dim, bit_width)`

compute_codebook runs Lloyd-Max optimization on the Beta distribution that describes coordinates of randomly rotated unit vectors, returning (centroids, boundaries) arrays of sizes 2^bit_width and 2^bit_width + 1.

get_codebook is an lru_cache-wrapped convenience wrapper around compute_codebook. Precomputed codebooks are shipped for common dimensions (64-4096); unsupported dimensions are computed on-the-fly.

`compute_theoretical_savings(dim, bit_width)`

Compute the theoretical entropy and Huffman coding savings for a given configuration, as described in Section 3.1 of the TurboQuant paper.

from turboquant import compute_theoretical_savings
savings = compute_theoretical_savings(dim=256, bit_width=4)
# {'entropy': 3.742, 'avg_bits_huffman': 3.779, 'savings_pct': 5.5}

Storage

On-disk format uses a directory containing:

File	Contents
`meta.json`	Dimensions, bit-width, mode, seed, outlier config, encoding flags
`indices.npy`	Bit-packed quantization indices (or `indices.huffman` if entropy-encoded)
`norms.npy`	Per-vector L2 norms
`huffman_table.json`	Huffman coding table (only when entropy-encoded)
Additional `.npy` files	QJL sign vectors, residual norms, outlier indices (mode-dependent)

Rotation and projection matrices are not stored — they are reconstructed deterministically from the saved seed, reducing storage overhead.

Entropy encoding

Indices can optionally be Huffman-encoded when saving, providing lossless compression. Savings are measured vs fixed-width encoding (not vs bit-packing):

Bit-width	Shannon entropy	Huffman avg bits	Savings
1	1.000	1.000	0.0%
2	1.911	1.989	0.5%
3	2.819	2.876	4.1%
4	3.742	3.779	5.5%

compressed.save("my_index", entropy_encode=True)
# Loading is automatic — detects encoding from metadata
loaded = CompressedVectors.load("my_index")

Supported Bit-Widths

Results on synthetic unit vectors (dim=768, n=1000).

Bit-width	Compression ratio	Reconstruction MSE	Recall@1 (MSE mode)	Recall@10 (MSE mode)
1	32x	0.000473	0.19	0.65
2	16x	0.000153	0.46	0.91
3	10.7x	0.000045	0.67	0.99
4	8x	0.000013	0.73	1.00

Benchmarks

All results use all-MiniLM-L6-v2 embeddings (dim=384, n=2000 database, n=200 queries, seed=42).

MSE distortion and recall on neural embeddings

Method	Bit-width	MSE	Recall@1	Recall@10
NaiveUniform	2	0.001079	0.675	0.721
NaiveUniform	3	0.000195	0.830	0.850
NaiveUniform	4	0.000042	0.895	0.913
TurboQuant-mse	2	0.000305	0.755	0.817
TurboQuant-mse	3	0.000090	0.895	0.878
TurboQuant-mse	4	0.000025	0.895	0.918
TurboQuant-inner_product	2	0.000946	0.605	0.713
TurboQuant-inner_product	3	0.000305	0.760	0.820
TurboQuant-inner_product	4	0.000090	0.895	0.881
QJL (1-bit)	1-4	—	0.580	0.703

TurboQuant-mse achieves 3.5x lower MSE than naive uniform quantization at 2 bits. At 4 bits, TurboQuant-inner_product matches QJL recall while storing structured residuals that enable better approximation.

Comparison vs. uniform quantization at 2 bits (dim=384)

Method	MSE	Recall@1	Recall@10
NaiveUniform	0.000856	0.30	0.80
PerChannelUniform	0.001032	0.29	0.82
RandProj+Uniform	0.001027	0.28	0.76
TurboQuant-mse	0.000304	0.44	0.95

TurboQuant-mse reduces MSE by 2.8x over NaiveUniform and improves Recall@10 from 80% to 95% at the same 2-bit budget.

Acceleration

All core operations are implemented in NumPy. When PyTorch is installed, matrix operations (rotation, projection, batch inner products) dispatch to PyTorch tensors automatically.

Backend	Device detection	Notes
NumPy (default)	Always available	Fastest on CPU for most workloads
PyTorch + CUDA	Auto-detected	Best for large-scale GPU workloads
PyTorch + MPS	Auto-detected	Apple Silicon GPU; uses float32 internally (MPS does not support float64)

On CPU, NumPy and PyTorch perform similarly. GPU acceleration benefits primarily the matrix operations (rotation, projection); the scalar quantization is already vectorized in NumPy.

How It Works

TurboQuant MSE mode applies a random orthogonal rotation to each input vector before scalar quantization. Because coordinates of a randomly rotated unit vector follow a known Beta distribution, a precomputed Lloyd-Max codebook can be derived analytically for that distribution rather than estimated from data. Lloyd-Max quantization is provably optimal for a fixed scalar quantizer, and the rotation ensures coordinates match the distribution the codebook was designed for. Rotating back after dequantization reconstructs the original vector with near-optimal MSE.

TurboQuant inner-product mode extends MSE quantization to produce an unbiased inner-product estimator. It quantizes each vector at (b-1) bits using the MSE codebook, computes the residual between the original and the MSE reconstruction, and then applies the QJL transform (random Gaussian projection followed by sign extraction) to that residual. The stored representation consists of the MSE quantization indices plus the sign bits and L2 norm of the residual. At query time, the inner product estimate combines the MSE dot product and a QJL correction term scaled by sqrt(pi/2) / d, which exactly cancels the bias introduced by the MSE quantizer. Using one bit for QJL on the residual and (b-1) bits for MSE thus achieves better inner-product accuracy than spending all b bits on MSE quantization alone.

QJL (Quantized Johnson-Lindenstrauss) is a 1-bit scheme that projects each key vector through a random Gaussian matrix S and stores only the sign vector sign(S·k) along with the vector norm. For a query q, the inner product is estimated as sqrt(pi/2) / m * ||k|| * <S·q, sign(S·k)>, where m is the projection dimension. This estimator is unbiased and requires only 1 bit per projected coordinate, making it suitable for extreme compression where reconstruction is not needed.

References

TurboQuant: "TurboQuant: Redefining AI Efficiency with Extreme Compression" — arXiv:2504.19874
QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AISTATS 2026) — arXiv:2406.03482
PolarQuant: arXiv:2502.02617

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
scripts		scripts
src/turboquant		src/turboquant
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

turboquant-py

Installation

Quick Start

TurboQuant MSE mode

TurboQuant inner product mode

QJL 1-bit quantization

API Reference

`TurboQuant(dim, bit_width, mode, seed, outlier_channels, outlier_bit_width)`

`QJL(dim, projection_dim, seed)`

`CompressedVectors`

`CompressedStore`

`compute_codebook(dim, bit_width)` / `get_codebook(dim, bit_width)`

`compute_theoretical_savings(dim, bit_width)`

Storage

Entropy encoding

Supported Bit-Widths

Benchmarks

MSE distortion and recall on neural embeddings

Comparison vs. uniform quantization at 2 bits (dim=384)

Acceleration

How It Works

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

turboquant-py

Installation

Quick Start

TurboQuant MSE mode

TurboQuant inner product mode

QJL 1-bit quantization

API Reference

TurboQuant(dim, bit_width, mode, seed, outlier_channels, outlier_bit_width)

QJL(dim, projection_dim, seed)

CompressedVectors

CompressedStore

compute_codebook(dim, bit_width) / get_codebook(dim, bit_width)

compute_theoretical_savings(dim, bit_width)

Storage

Entropy encoding

Supported Bit-Widths

Benchmarks

MSE distortion and recall on neural embeddings

Comparison vs. uniform quantization at 2 bits (dim=384)

Acceleration

How It Works

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`TurboQuant(dim, bit_width, mode, seed, outlier_channels, outlier_bit_width)`

`QJL(dim, projection_dim, seed)`

`CompressedVectors`

`CompressedStore`

`compute_codebook(dim, bit_width)` / `get_codebook(dim, bit_width)`

`compute_theoretical_savings(dim, bit_width)`

Packages