ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

ARCQuant is a high-performance quantization framework for low-bit LLMs that improves accuracy under fine-grained formats such as NVFP4, while preserving a unified and efficient inference pipeline.

While fine-grained quantization formats such as NVFP4 effectively isolate quantization noise, activation outliers can still cause severe accuracy degradation in critical channels. Traditional mixed-precision methods address this by splitting computations into separate branches, which introduces additional kernel launch overhead and memory fragmentation.

ARCQuant takes a different approach. Instead of treating outliers as a separate computation path, we leverage the structural sparsity of quantization errors in fine-grained settings. We capture the quantization residuals of critical channels and fuse them back into the computation as Augmented Residual Channels (ARC).

News

[2026/04] 🏆 ARCQuant has been accepted to ACL 2026 Main Conference!
[2026/03] 🔥 ARCQuant has been integrated into NVIDIA TensorRT-LLM, with contributions from Tracin!
[2026/01] 🔥 ARCQuant is publicly available on arXiv! Check our paper here.

To Do

Installation

conda create -n arcquant python=3.10 -y
conda activate arcquant

Please make sure that CUDA 12.8 is available in your environment.

git clone --recurse-submodules https://github.com/actypedef/ARCQuant.git
cd ARCQuant
pip install -r requirements.txt

Usage

Building Kernels

sudo apt-get update
sudo apt-get install python3-dev

conda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

cd kernels/
bash remake.sh

This process may take a few minutes.

Preprocessing

Precomputed reorder_indices and select_num are required for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 128 --seqlen 2048 --act_sort_metric max

The generated files will be saved to ./saved/

Accuracy Evaluation

ARCQuant supports multiple formats, including NVFP4, MXFP4, HiF4, and INT4. You can modify the quant_type parameter as needed.

bash evaluate.sh /PATH/TO/YOUR/MODEL/

Efficiency Evaluation

FlashInfer:

cd third-party/flashinfer
python -m pip install -v .

vLLM-based efficiency evaluation scripts will be released in a future update.

Citation

@article{meng2026arcquant,
  title={ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs},
  author={Meng, Haoqian and Luo, Yilun and Zhao, Yafei and Liu, Wenyuan and Zhang, Peng and Ma, Xindian},
  journal={arXiv preprint arXiv:2601.07475},
  year={2026}
}

Acknowledgements

This project builds on several excellent open-source efforts. We sincerely thank the community for their contributions:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
benchmarks		benchmarks
kernels		kernels
model		model
third-party		third-party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
evaluate.sh		evaluate.sh
reorder_indices.py		reorder_indices.py
requirements.txt		requirements.txt
utilize.py		utilize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

News

To Do

Installation

Usage

Building Kernels

Preprocessing

Accuracy Evaluation

Efficiency Evaluation

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

News

To Do

Installation

Usage

Building Kernels

Preprocessing

Accuracy Evaluation

Efficiency Evaluation

Citation

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages