RSR (Redundant Segment Reduction) for efficient low-bit inference (matrix-vector multiplication).
This repository contains the core kernels, model integrations, and benchmarking code for RSR across CPU and CUDA backends. RSR targets fast matrix-vector multiplication when the matrix is low-bit quantized by grouping repeated column patterns, aggregating the corresponding input values once, and then scattering the result to the affected output rows.
This is especially useful for workloads such as low-bit LLM inference, where decoding repeatedly applies quantized matvec operations. For the original algorithm, see UIC-InDeXLab/RSR and docs/ALGORITHM.md.
Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the original high-quality video. HF denotes the Hugging Face baseline running bfloat16 on PyTorch.
PROMPT: "Write the numbers from one to sixty in words separated by commas only:"
Prerequisites: Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.
git clone https://github.com/UIC-InDeXLab/RSR-Core.git
cd RSR-Core
pip install -e .Run integrations/hf/model_prep.py once per model to preprocess the ternary
weights and save the RSR metadata needed for inference.
python -m integrations.hf.model_prep \
--model microsoft/bitnet-b1.58-2B-4T \
--output ./preprocessed_model \
--device cpu \
--trust-remote-code \
--best-k-json benchmarking/bit_1_58/reports/best_k_cpu.jsonCLI args for integrations/hf/model_prep.py:
--model, -m HuggingFace model ID or local path (required)
--output, -o Output directory for the preprocessed model (required)
--k Block height for RSR decomposition (default: from best_k_{device}.json)
--version RSR multiplier version to use (default: adaptive)
--device Device for model loading: cpu or cuda (default: cpu)
--trust-remote-code Allow remote code when loading HuggingFace models
--best-k-json Optional path to a per-layer best-k JSON file
Default:
benchmarking/bit_1_58/reports/best_k_{device}.json
Use integrations/hf/model_infer.py to run generation from a preprocessed
model directory. The default backend is rsr.
python -m integrations.hf.model_infer \
--model-dir ./preprocessed_model \
--backend rsr \
--device cpu \
--prompt "Write the numbers from one to ten in words." \
--max-new-tokens 64 \
--streamCLI args for integrations/hf/model_infer.py:
--model-dir Directory with rsr_config.json and safetensors artifacts
(default: integrations/hf)
--backend Inference backend: rsr or hf (default: rsr)
--tokenizer Optional tokenizer source
Default: rsr_config.json:model_name
--device Target device; auto-detected from model-dir suffix
(_cpu / _cuda) if omitted
--dtype Optional dtype cast: float32, float16, or bfloat16
--prompt Prompt text to generate from (required)
--max-new-tokens Maximum number of tokens to generate (default: 64)
--no-chat-template Tokenize the raw prompt directly
--stream Stream decoded output as tokens are generated
Use the scripts under benchmarking/ to reproduce the local numbers for
kernel-level matvec benchmarks and end-to-end LLM inference.
Find the best k for ternary RSR
python -m benchmarking.bit_1_58.bench_best_k \
--device cpu \
--shapes 2560x2560 4096x14336 \
--k-values 2 4 6 8 10 12 \
--warmup 10 \
--repeats 30CLI args for benchmarking/bit_1_58/bench_best_k.py:
--device Target device: cpu or cuda (required)
--shapes Optional list of matrix shapes in NxM format
Default: all known preprocessed model shapes
--k-values Optional list of k values to test
Default: 2 4 6 8 10 12
--warmup Warmup iterations before timing (default: 10)
--repeats Timed iterations per shape/k (default: 30)
This writes:
benchmarking/bit_1_58/reports/best_k_{device}.csv and
benchmarking/bit_1_58/reports/best_k_{device}.json
Benchmark matrix-vector multiplication
The shape benchmark scripts do not take CLI arguments. Configure them by
editing the constants at the top of the script:
SHAPES, K_VALUES, METHODS, REPEATS, and WARMUP.
python benchmarking/bit_1/bench_shapes_cpu.py
python benchmarking/bit_1/bench_shapes_cuda.py
python benchmarking/bit_1_58/bench_shapes_cpu.py
python benchmarking/bit_1_58/bench_shapes_cuda.pyReports are written to:
benchmarking/bit_1/reports/results_shapes_{device}.csv
benchmarking/bit_1_58/reports/results_shapes_{device}.csv
Benchmark end-to-end LLM inference
Pass either a single preprocessed model directory or a parent directory that
contains multiple *_cpu or *_cuda model directories.
python -m benchmarking.llms.bench_inference \
--model-dir integrations/hf/preprocessed \
--device cpu \
--prompt "Write the numbers from one to two hundred in words separated by commas only:" \
--max-new-tokens 64 \
--warmup 1 \
--repeats 3 \
--backends rsr hf_float32 hf_bfloat16CLI args for benchmarking/llms/bench_inference.py:
--model-dir Single preprocessed model directory or parent directory
containing multiple preprocessed models (required)
--prompt Prompt text to generate from
Default: "Write the numbers from one to two hundred in
words separated by commas only:"
--max-new-tokens Maximum number of generated tokens (default: 64)
--warmup Warmup generations before timing (default: 1)
--repeats Timed generations per backend/model (default: 3)
--no-chat-template Tokenize the raw prompt directly
--device Target device and model suffix filter: cpu or cuda
(required)
--backends Optional backend list:
rsr, hf_float32, hf_bfloat16, hf_float16
Default: rsr + the standard HF dtypes for the device
| 1-bit | 1.58-bit |
|---|---|
![]() |
![]() |
| 1-bit | 1.58-bit |
|---|---|
![]() |
![]() |
Speedup is computed against the HuggingFace bfloat16 baseline for the same model.
| Model | HF Tok/s | RSR Tok/s | Speedup |
|---|---|---|---|
| Falcon3-10B-Instruct-1.58bit | 0.2 | 11.3 | 62.0x |
| Llama3-8B-1.58-100B-tokens | 0.2 | 13.4 | 53.8x |
| bitnet-b1.58-2B-4T-bf16 | 2.1 | 28.8 | 13.9x |
| bitnet-b1.58-2B-4T | 14.2 | 29.3 | 2.1x |
| Model | HF Tok/s | RSR Tok/s | Speedup |
|---|---|---|---|
| Falcon3-10B-Instruct-1.58bit | 25.2 | 47.4 | 1.9x |
| Llama3-8B-1.58-100B-tokens | 31.9 | 59.3 | 1.9x |
| bitnet-b1.58-2B-4T-bf16 | 33.1 | 57.4 | 1.7x |
| bitnet-b1.58-2B-4T | 41.6 | 57.1 | 1.4x |
- [03/25/2026] Support HuggingFace models interface.
RSR-core/
├── multiplier/ # Python wrappers for kernels
│ ├── bit_1/ # 1-bit (binary) multipliers (CPU/CUDA)
│ └── bit_1_58/ # 1.58-bit (ternary) multipliers (CPU/CUDA)
├── kernels/ # Low-level C/CUDA kernel source
│ ├── bit_1/
│ │ ├── cpu/ # C kernels
│ │ └── cuda/ # CUDA kernels (.cu)
│ └── bit_1_58/
│ ├── cpu/ # C kernels
│ └── cuda/ # CUDA kernels (.cu)
├── integrations/ # Model integrations
│ └── hf/ # HuggingFace integration
├── benchmarking/ # Benchmarking scripts & results
└── tests/ # Unit and integration tests




