Skip to content

Artain-AI/ignite-ms

IgniteMS

License Release

256,000 msg/s on 8x A100. Up to 3.6x faster than Hugging Face TEI on same hardware.

357,893 msg/s sustained in production with workload-specific tuning.

IgniteMS is a batch text embedding engine. Rust, native TensorRT, no Python at runtime. You give it text, it gives you embeddings.

Use it for workloads where millions of texts need embeddings quickly: vector DB reindexing, search rebuilds after model swaps, corpus-scale processing.

Numbers

p4d.24xlarge (8x A100 80GB), 1M MSMARCO passages, TensorRT 11 mixed precision:

Model GPUs msg/s tok/s TEI msg/s Speedup
e5-small-v2 1 56,002 2,860,377 16,412 3.4x
e5-small-v2 8 254,979 12,988,479 88,912 2.9x
e5-small 1 55,958 3,178,595 15,378 3.6x
e5-small 8 255,958 14,539,275 76,480 3.3x
e5-base 1 18,626 1,058,018 8,843 2.1x
e5-base 8 126,614 7,192,032 57,423 2.2x
e5-large 1 5,861 332,982 4,029 1.5x
e5-large 8 40,445 2,297,994 28,664 1.4x

Baselines (1 GPU, e5-small-v2)

Tool msg/s Relative
IgniteMS 56,002 1.0x
TEI 16,412 0.29x
Fastembed (ORT+CUDA) 8,907 0.16x
SentenceTransformers 2,468 0.04x

60 models supported out of the box — E5, BGE, GTE, MiniLM, MPNet, Nomic, Jina, mxbai, Snowflake Arctic, LaBSE, stella, plus language-specific models for Chinese, French, Russian, Korean, Indonesian, and domain models for scientific/biomedical text. Supports both encoder (BERT-style) and decoder (LLM-based) architectures with mean-pool or last-token pooling. Works with any Hugging Face model that exports to ONNX and compiles to TensorRT. Models are downloaded and compiled on first run. See MODELS.md for the full list with verified throughput and correctness results.

Production run

Real production pipeline, not a controlled benchmark:

Metric Value Note
Messages embedded 685,520,494
Sustained throughput 357,893 msg/s average across full run
Peak throughput 506,589 msg/s short text, GPUs saturated
Low throughput 196,676 msg/s dense/long text files, reader-bound
Wall clock 1,915s (31.9 min)
Hardware 1x p4d.24xlarge 8x A100 40GB, spot

Full pipeline: read zstd-compressed social media events (Reddit, Hacker News), extract and normalize text, tokenize, infer on 8 GPUs, write aggregated parquet output. Not a GPU microbenchmark.

For cost context: at ~$12.68/hr p4d spot pricing, this production run cost about $0.01 per 1M messages embedded. On the same 68-token/message dataset, OpenAI text-embedding-3-small would be about $1.36 per 1M messages at current API pricing.

Why it's fast

No single trick. Just removing waste everywhere:

  • TensorRT compiles kernels specific to the GPU architecture and batch shape. Not generic ONNX or PyTorch.
  • Bucketed batching groups texts by token length so you're not padding a 6-token string to 512.
  • CPU-side pipeline keeps tokenization, batching, and GPU dispatch moving together without waiting on each other.
  • Rust end-to-end. No GIL, no Python request path, no HTTP serialization at runtime.
  • Multi-GPU in one process. Lock-free work stealing across GPUs. Most serving stacks run one container per GPU and glue them together with HTTP — we just don't.
  • Engine caching. TRT engines compile once and get reused until something actually changes (model, runtime version, or batch profile).

Quickstart

Docker (just needs Docker + NVIDIA runtime):

python3 quickstart.py

Native (needs Rust, CUDA 12+, TensorRT 11+):

python3 quickstart.py --native

Downloads a public dataset, embeds it, writes output. First run takes ~5 minutes for TensorRT engine compilation. After that, engines are cached and startup is instant.

Docker

docker run --rm --gpus all \
  -v "$PWD/data:/data" \
  -v ignite-ms-cache:/cache \
  ghcr.io/artain-ai/ignite-ms:v1.1.0 \
  embed \
  --model intfloat/e5-small-v2 \
  --input /data/input.jsonl \
  --output /data/embeddings.npy \
  --cache-dir /cache \
  --gpus all

Use the versioned image for reproducible deployments. The current v1.1.0 release targets TensorRT 11 mixed-precision engines. latest may move and is intended for quick experiments, not production pinning.

Docker hosts need an NVIDIA driver, Docker, and the NVIDIA container runtime. They do not need the CUDA toolkit or TensorRT installed on the host. The image has the production CLI (ignite-ms), benchmark CLI (ignite-ms-bench), and all dependencies for model prep.

Benchmark

Reproduce the numbers:

python3 benchmark.py                                          # Docker, defaults
python3 benchmark.py --mode native --model e5-small-v2        # native
python3 benchmark.py --gpu-counts 1,8 --skip-tei              # IgniteMS only

Downloads data, prepares models, runs both IgniteMS and TEI, reports results. See BENCHMARKING.md for full results, methodology, and caveats.

The benchmark reports messages/sec plus token-oriented metrics such as tokens/sec, padded tokens/sec, average sequence length, batch fill, and estimated TFLOP/s. Messages/sec is useful for corpus throughput; token metrics are better for comparing runs with different text lengths.

Input / Output

Input: JSONL ({"text": "..."}) or plain text, one per line. Handles .zst and .gz compression.

Output: .npy (NumPy array) or .parquet (with IDs). Row order preserved.

ignite-ms embed \
  --model intfloat/e5-small-v2 \
  --input corpus.jsonl.zst \
  --output embeddings.npy \
  --gpus all

Layout

crates/ignite-ms/          core engine
crates/ignite-ms-embed/    production CLI (ignite-ms)
crates/ignite-ms-bench/    benchmark CLI (ignite-ms-bench)
native/                    TensorRT C++ bridge
examples/                  library usage
benchmark.py               IgniteMS vs TEI benchmark
quickstart.py              one-command demo

Building from source

cargo build --release -p ignite-ms-embed
cargo build --release -p ignite-ms-bench

Needs CUDA 12+ and TensorRT 11+ headers on the host.

Requirements

Docker mode: NVIDIA GPU, NVIDIA driver, Docker, NVIDIA container runtime. CUDA and TensorRT are included in the image.

Native mode: NVIDIA GPU, CUDA 12+, TensorRT 11+, Rust 1.85+, Python 3.10+.

Security

Report vulnerabilities privately. See SECURITY.md.

Contributing

Contributions require CLA. See CONTRIBUTING.md.

License

Apache 2.0.

Artain may offer future versions under different terms. Versions released under Apache 2.0 stay Apache 2.0.