ARCQuant is a high-performance quantization framework for low-bit LLMs that improves accuracy under fine-grained formats such as NVFP4, while preserving a unified and efficient inference pipeline.
While fine-grained quantization formats such as NVFP4 effectively isolate quantization noise, activation outliers can still cause severe accuracy degradation in critical channels. Traditional mixed-precision methods address this by splitting computations into separate branches, which introduces additional kernel launch overhead and memory fragmentation.
ARCQuant takes a different approach. Instead of treating outliers as a separate computation path, we leverage the structural sparsity of quantization errors in fine-grained settings. We capture the quantization residuals of critical channels and fuse them back into the computation as Augmented Residual Channels (ARC).
- [2026/04] 🏆 ARCQuant has been accepted to ACL 2026 Main Conference!
- [2026/03] 🔥 ARCQuant has been integrated into NVIDIA TensorRT-LLM, with contributions from Tracin!
- [2026/01] 🔥 ARCQuant is publicly available on arXiv! Check our paper here.
- Release code for reproducing results.
- Release CUDA kernels for NVFP4.
- Support vLLM integration.
- Model Support: Add support for more model families and architectures:
- Qwen3
- Mixtral
- Wan2.2
conda create -n arcquant python=3.10 -y
conda activate arcquantPlease make sure that CUDA 12.8 is available in your environment.
git clone --recurse-submodules https://github.com/actypedef/ARCQuant.git
cd ARCQuant
pip install -r requirements.txtsudo apt-get update
sudo apt-get install python3-devconda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128cd kernels/
bash remake.shThis process may take a few minutes.
Precomputed reorder_indices and select_num are required for quantization:
python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 128 --seqlen 2048 --act_sort_metric maxThe generated files will be saved to ./saved/
ARCQuant supports multiple formats, including NVFP4, MXFP4, HiF4, and INT4. You can modify the quant_type parameter as needed.
bash evaluate.sh /PATH/TO/YOUR/MODEL/FlashInfer:
cd third-party/flashinfer
python -m pip install -v .vLLM-based efficiency evaluation scripts will be released in a future update.
@article{meng2026arcquant,
title={ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs},
author={Meng, Haoqian and Luo, Yilun and Zhao, Yafei and Liu, Wenyuan and Zhang, Peng and Ma, Xindian},
journal={arXiv preprint arXiv:2601.07475},
year={2026}
}
This project builds on several excellent open-source efforts. We sincerely thank the community for their contributions:
