256,000 msg/s on 8x A100. Up to 3.6x faster than Hugging Face TEI on same hardware.
357,893 msg/s sustained in production with workload-specific tuning.
IgniteMS is a batch text embedding engine. Rust, native TensorRT, no Python at runtime. You give it text, it gives you embeddings.
Use it for workloads where millions of texts need embeddings quickly: vector DB reindexing, search rebuilds after model swaps, corpus-scale processing.
p4d.24xlarge (8x A100 80GB), 1M MSMARCO passages, TensorRT 11 mixed precision:
| Model | GPUs | msg/s | tok/s | TEI msg/s | Speedup |
|---|---|---|---|---|---|
| e5-small-v2 | 1 | 56,002 | 2,860,377 | 16,412 | 3.4x |
| e5-small-v2 | 8 | 254,979 | 12,988,479 | 88,912 | 2.9x |
| e5-small | 1 | 55,958 | 3,178,595 | 15,378 | 3.6x |
| e5-small | 8 | 255,958 | 14,539,275 | 76,480 | 3.3x |
| e5-base | 1 | 18,626 | 1,058,018 | 8,843 | 2.1x |
| e5-base | 8 | 126,614 | 7,192,032 | 57,423 | 2.2x |
| e5-large | 1 | 5,861 | 332,982 | 4,029 | 1.5x |
| e5-large | 8 | 40,445 | 2,297,994 | 28,664 | 1.4x |
| Tool | msg/s | Relative |
|---|---|---|
| IgniteMS | 56,002 | 1.0x |
| TEI | 16,412 | 0.29x |
| Fastembed (ORT+CUDA) | 8,907 | 0.16x |
| SentenceTransformers | 2,468 | 0.04x |
60 models supported out of the box — E5, BGE, GTE, MiniLM, MPNet, Nomic, Jina, mxbai, Snowflake Arctic, LaBSE, stella, plus language-specific models for Chinese, French, Russian, Korean, Indonesian, and domain models for scientific/biomedical text. Supports both encoder (BERT-style) and decoder (LLM-based) architectures with mean-pool or last-token pooling. Works with any Hugging Face model that exports to ONNX and compiles to TensorRT. Models are downloaded and compiled on first run. See MODELS.md for the full list with verified throughput and correctness results.
Real production pipeline, not a controlled benchmark:
| Metric | Value | Note |
|---|---|---|
| Messages embedded | 685,520,494 | |
| Sustained throughput | 357,893 msg/s | average across full run |
| Peak throughput | 506,589 msg/s | short text, GPUs saturated |
| Low throughput | 196,676 msg/s | dense/long text files, reader-bound |
| Wall clock | 1,915s (31.9 min) | |
| Hardware | 1x p4d.24xlarge | 8x A100 40GB, spot |
Full pipeline: read zstd-compressed social media events (Reddit, Hacker News), extract and normalize text, tokenize, infer on 8 GPUs, write aggregated parquet output. Not a GPU microbenchmark.
For cost context: at ~$12.68/hr p4d spot pricing, this production run cost about $0.01 per 1M messages embedded. On the same 68-token/message dataset, OpenAI text-embedding-3-small would be about $1.36 per 1M messages at current API pricing.
No single trick. Just removing waste everywhere:
- TensorRT compiles kernels specific to the GPU architecture and batch shape. Not generic ONNX or PyTorch.
- Bucketed batching groups texts by token length so you're not padding a 6-token string to 512.
- CPU-side pipeline keeps tokenization, batching, and GPU dispatch moving together without waiting on each other.
- Rust end-to-end. No GIL, no Python request path, no HTTP serialization at runtime.
- Multi-GPU in one process. Lock-free work stealing across GPUs. Most serving stacks run one container per GPU and glue them together with HTTP — we just don't.
- Engine caching. TRT engines compile once and get reused until something actually changes (model, runtime version, or batch profile).
Docker (just needs Docker + NVIDIA runtime):
python3 quickstart.pyNative (needs Rust, CUDA 12+, TensorRT 11+):
python3 quickstart.py --nativeDownloads a public dataset, embeds it, writes output. First run takes ~5 minutes for TensorRT engine compilation. After that, engines are cached and startup is instant.
docker run --rm --gpus all \
-v "$PWD/data:/data" \
-v ignite-ms-cache:/cache \
ghcr.io/artain-ai/ignite-ms:v1.1.0 \
embed \
--model intfloat/e5-small-v2 \
--input /data/input.jsonl \
--output /data/embeddings.npy \
--cache-dir /cache \
--gpus allUse the versioned image for reproducible deployments. The current v1.1.0 release targets TensorRT 11 mixed-precision engines. latest may move and is intended for quick experiments, not production pinning.
Docker hosts need an NVIDIA driver, Docker, and the NVIDIA container runtime. They do not need the CUDA toolkit or TensorRT installed on the host. The image has the production CLI (ignite-ms), benchmark CLI (ignite-ms-bench), and all dependencies for model prep.
Reproduce the numbers:
python3 benchmark.py # Docker, defaults
python3 benchmark.py --mode native --model e5-small-v2 # native
python3 benchmark.py --gpu-counts 1,8 --skip-tei # IgniteMS onlyDownloads data, prepares models, runs both IgniteMS and TEI, reports results. See BENCHMARKING.md for full results, methodology, and caveats.
The benchmark reports messages/sec plus token-oriented metrics such as tokens/sec, padded tokens/sec, average sequence length, batch fill, and estimated TFLOP/s. Messages/sec is useful for corpus throughput; token metrics are better for comparing runs with different text lengths.
Input: JSONL ({"text": "..."}) or plain text, one per line. Handles .zst and .gz compression.
Output: .npy (NumPy array) or .parquet (with IDs). Row order preserved.
ignite-ms embed \
--model intfloat/e5-small-v2 \
--input corpus.jsonl.zst \
--output embeddings.npy \
--gpus allcrates/ignite-ms/ core engine
crates/ignite-ms-embed/ production CLI (ignite-ms)
crates/ignite-ms-bench/ benchmark CLI (ignite-ms-bench)
native/ TensorRT C++ bridge
examples/ library usage
benchmark.py IgniteMS vs TEI benchmark
quickstart.py one-command demo
cargo build --release -p ignite-ms-embed
cargo build --release -p ignite-ms-benchNeeds CUDA 12+ and TensorRT 11+ headers on the host.
Docker mode: NVIDIA GPU, NVIDIA driver, Docker, NVIDIA container runtime. CUDA and TensorRT are included in the image.
Native mode: NVIDIA GPU, CUDA 12+, TensorRT 11+, Rust 1.85+, Python 3.10+.
Report vulnerabilities privately. See SECURITY.md.
Contributions require CLA. See CONTRIBUTING.md.
Apache 2.0.
Artain may offer future versions under different terms. Versions released under Apache 2.0 stay Apache 2.0.