IgniteMS

256,000 msg/s on 8x A100. Up to 3.6x faster than Hugging Face TEI on same hardware.

357,893 msg/s sustained in production with workload-specific tuning.

IgniteMS is a batch text embedding engine. Rust, native TensorRT, no Python at runtime. You give it text, it gives you embeddings.

Use it for workloads where millions of texts need embeddings quickly: vector DB reindexing, search rebuilds after model swaps, corpus-scale processing.

Numbers

p4d.24xlarge (8x A100 80GB), 1M MSMARCO passages, TensorRT 11 mixed precision:

Model	GPUs	msg/s	tok/s	TEI msg/s	Speedup
e5-small-v2	1	56,002	2,860,377	16,412	3.4x
e5-small-v2	8	254,979	12,988,479	88,912	2.9x
e5-small	1	55,958	3,178,595	15,378	3.6x
e5-small	8	255,958	14,539,275	76,480	3.3x
e5-base	1	18,626	1,058,018	8,843	2.1x
e5-base	8	126,614	7,192,032	57,423	2.2x
e5-large	1	5,861	332,982	4,029	1.5x
e5-large	8	40,445	2,297,994	28,664	1.4x

Baselines (1 GPU, e5-small-v2)

Tool	msg/s	Relative
IgniteMS	56,002	1.0x
TEI	16,412	0.29x
Fastembed (ORT+CUDA)	8,907	0.16x
SentenceTransformers	2,468	0.04x

60 models supported out of the box — E5, BGE, GTE, MiniLM, MPNet, Nomic, Jina, mxbai, Snowflake Arctic, LaBSE, stella, plus language-specific models for Chinese, French, Russian, Korean, Indonesian, and domain models for scientific/biomedical text. Supports both encoder (BERT-style) and decoder (LLM-based) architectures with mean-pool or last-token pooling. Works with any Hugging Face model that exports to ONNX and compiles to TensorRT. Models are downloaded and compiled on first run. See MODELS.md for the full list with verified throughput and correctness results.

Production run

Real production pipeline, not a controlled benchmark:

Metric	Value	Note
Messages embedded	685,520,494
Sustained throughput	357,893 msg/s	average across full run
Peak throughput	506,589 msg/s	short text, GPUs saturated
Low throughput	196,676 msg/s	dense/long text files, reader-bound
Wall clock	1,915s (31.9 min)
Hardware	1x p4d.24xlarge	8x A100 40GB, spot

Full pipeline: read zstd-compressed social media events (Reddit, Hacker News), extract and normalize text, tokenize, infer on 8 GPUs, write aggregated parquet output. Not a GPU microbenchmark.

For cost context: at ~$12.68/hr p4d spot pricing, this production run cost about $0.01 per 1M messages embedded. On the same 68-token/message dataset, OpenAI text-embedding-3-small would be about $1.36 per 1M messages at current API pricing.

Why it's fast

No single trick. Just removing waste everywhere:

TensorRT compiles kernels specific to the GPU architecture and batch shape. Not generic ONNX or PyTorch.
Bucketed batching groups texts by token length so you're not padding a 6-token string to 512.
CPU-side pipeline keeps tokenization, batching, and GPU dispatch moving together without waiting on each other.
Rust end-to-end. No GIL, no Python request path, no HTTP serialization at runtime.
Multi-GPU in one process. Lock-free work stealing across GPUs. Most serving stacks run one container per GPU and glue them together with HTTP — we just don't.
Engine caching. TRT engines compile once and get reused until something actually changes (model, runtime version, or batch profile).

Quickstart

Docker (just needs Docker + NVIDIA runtime):

python3 quickstart.py

Native (needs Rust, CUDA 12+, TensorRT 11+):

python3 quickstart.py --native

Downloads a public dataset, embeds it, writes output. First run takes ~5 minutes for TensorRT engine compilation. After that, engines are cached and startup is instant.

Docker

docker run --rm --gpus all \
  -v "$PWD/data:/data" \
  -v ignite-ms-cache:/cache \
  ghcr.io/artain-ai/ignite-ms:v1.1.0 \
  embed \
  --model intfloat/e5-small-v2 \
  --input /data/input.jsonl \
  --output /data/embeddings.npy \
  --cache-dir /cache \
  --gpus all

Use the versioned image for reproducible deployments. The current v1.1.0 release targets TensorRT 11 mixed-precision engines. latest may move and is intended for quick experiments, not production pinning.

Docker hosts need an NVIDIA driver, Docker, and the NVIDIA container runtime. They do not need the CUDA toolkit or TensorRT installed on the host. The image has the production CLI (ignite-ms), benchmark CLI (ignite-ms-bench), and all dependencies for model prep.

Benchmark

Reproduce the numbers:

python3 benchmark.py                                          # Docker, defaults
python3 benchmark.py --mode native --model e5-small-v2        # native
python3 benchmark.py --gpu-counts 1,8 --skip-tei              # IgniteMS only

Downloads data, prepares models, runs both IgniteMS and TEI, reports results. See BENCHMARKING.md for full results, methodology, and caveats.

The benchmark reports messages/sec plus token-oriented metrics such as tokens/sec, padded tokens/sec, average sequence length, batch fill, and estimated TFLOP/s. Messages/sec is useful for corpus throughput; token metrics are better for comparing runs with different text lengths.

Input / Output

Input: JSONL ({"text": "..."}) or plain text, one per line. Handles .zst and .gz compression.

Output: .npy (NumPy array) or .parquet (with IDs). Row order preserved.

ignite-ms embed \
  --model intfloat/e5-small-v2 \
  --input corpus.jsonl.zst \
  --output embeddings.npy \
  --gpus all

Layout

crates/ignite-ms/          core engine
crates/ignite-ms-embed/    production CLI (ignite-ms)
crates/ignite-ms-bench/    benchmark CLI (ignite-ms-bench)
native/                    TensorRT C++ bridge
examples/                  library usage
benchmark.py               IgniteMS vs TEI benchmark
quickstart.py              one-command demo

Building from source

cargo build --release -p ignite-ms-embed
cargo build --release -p ignite-ms-bench

Needs CUDA 12+ and TensorRT 11+ headers on the host.

Requirements

Docker mode: NVIDIA GPU, NVIDIA driver, Docker, NVIDIA container runtime. CUDA and TensorRT are included in the image.

Native mode: NVIDIA GPU, CUDA 12+, TensorRT 11+, Rust 1.85+, Python 3.10+.

Security

Report vulnerabilities privately. See SECURITY.md.

Contributing

Contributions require CLA. See CONTRIBUTING.md.

License

Apache 2.0.

Artain may offer future versions under different terms. Versions released under Apache 2.0 stay Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
crates		crates
examples		examples
native		native
.dockerignore		.dockerignore
.gitignore		.gitignore
BENCHMARKING.md		BENCHMARKING.md
CLA.md		CLA.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
MODELS.md		MODELS.md
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
benchmark.py		benchmark.py
docker-compose.yml		docker-compose.yml
quickstart.py		quickstart.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IgniteMS

Numbers

Baselines (1 GPU, e5-small-v2)

Production run

Why it's fast

Quickstart

Docker

Benchmark

Input / Output

Layout

Building from source

Requirements

Security

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IgniteMS

Numbers

Baselines (1 GPU, e5-small-v2)

Production run

Why it's fast

Quickstart

Docker

Benchmark

Input / Output

Layout

Building from source

Requirements

Security

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages