RSR-core

RSR (Redundant Segment Reduction) for efficient low-bit inference (matrix-vector multiplication).

This repository contains the core kernels, model integrations, and benchmarking code for RSR across CPU and CUDA backends. RSR targets fast matrix-vector multiplication when the matrix is low-bit quantized by grouping repeated column patterns, aggregating the corresponding input values once, and then scattering the result to the affected output rows.

This is especially useful for workloads such as low-bit LLM inference, where decoding repeatedly applies quantized matvec operations. For the original algorithm, see UIC-InDeXLab/RSR and docs/ALGORITHM.md.

Demo 🎬

Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the original high-quality video. HF denotes the Hugging Face baseline running bfloat16 on PyTorch.

PROMPT: "Write the numbers from one to sixty in words separated by commas only:"

Usage 🛠️

Installation 📦

Prerequisites: Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.

git clone https://github.com/UIC-InDeXLab/RSR-Core.git
cd RSR-Core
pip install -e .

Prepare a model (once) 🧱

Run integrations/hf/model_prep.py once per model to preprocess the ternary weights and save the RSR metadata needed for inference.

python -m integrations.hf.model_prep \
  --model microsoft/bitnet-b1.58-2B-4T \
  --output ./preprocessed_model \
  --device cpu \
  --trust-remote-code \
  --best-k-json benchmarking/bit_1_58/reports/best_k_cpu.json

CLI args for integrations/hf/model_prep.py:
  --model, -m           HuggingFace model ID or local path (required)
  --output, -o          Output directory for the preprocessed model (required)
  --k                   Block height for RSR decomposition (default: from best_k_{device}.json)
  --version             RSR multiplier version to use (default: adaptive)
  --device              Device for model loading: cpu or cuda (default: cpu)
  --trust-remote-code   Allow remote code when loading HuggingFace models
  --best-k-json         Optional path to a per-layer best-k JSON file
                        Default:
                        benchmarking/bit_1_58/reports/best_k_{device}.json

Run model inference 🤖

Use integrations/hf/model_infer.py to run generation from a preprocessed model directory. The default backend is rsr.

python -m integrations.hf.model_infer \
  --model-dir ./preprocessed_model \
  --backend rsr \
  --device cpu \
  --prompt "Write the numbers from one to ten in words." \
  --max-new-tokens 64 \
  --stream

CLI args for integrations/hf/model_infer.py:
  --model-dir          Directory with rsr_config.json and safetensors artifacts
                       (default: integrations/hf)
  --backend            Inference backend: rsr or hf (default: rsr)
  --tokenizer          Optional tokenizer source
                       Default: rsr_config.json:model_name
  --device             Target device; auto-detected from model-dir suffix
                       (_cpu / _cuda) if omitted
  --dtype              Optional dtype cast: float32, float16, or bfloat16
  --prompt             Prompt text to generate from (required)
  --max-new-tokens     Maximum number of tokens to generate (default: 64)
  --no-chat-template   Tokenize the raw prompt directly
  --stream             Stream decoded output as tokens are generated

Benchmark on your machine ⏱️

Use the scripts under benchmarking/ to reproduce the local numbers for kernel-level matvec benchmarks and end-to-end LLM inference.

Find the best k for ternary RSR

python -m benchmarking.bit_1_58.bench_best_k \
  --device cpu \
  --shapes 2560x2560 4096x14336 \
  --k-values 2 4 6 8 10 12 \
  --warmup 10 \
  --repeats 30

CLI args for benchmarking/bit_1_58/bench_best_k.py:
  --device             Target device: cpu or cuda (required)
  --shapes             Optional list of matrix shapes in NxM format
                       Default: all known preprocessed model shapes
  --k-values           Optional list of k values to test
                       Default: 2 4 6 8 10 12
  --warmup             Warmup iterations before timing (default: 10)
  --repeats            Timed iterations per shape/k (default: 30)

This writes: benchmarking/bit_1_58/reports/best_k_{device}.csv and benchmarking/bit_1_58/reports/best_k_{device}.json

Benchmark matrix-vector multiplication

The shape benchmark scripts do not take CLI arguments. Configure them by editing the constants at the top of the script: SHAPES, K_VALUES, METHODS, REPEATS, and WARMUP.

python benchmarking/bit_1/bench_shapes_cpu.py
python benchmarking/bit_1/bench_shapes_cuda.py
python benchmarking/bit_1_58/bench_shapes_cpu.py
python benchmarking/bit_1_58/bench_shapes_cuda.py

Reports are written to: benchmarking/bit_1/reports/results_shapes_{device}.csv benchmarking/bit_1_58/reports/results_shapes_{device}.csv

Benchmark end-to-end LLM inference

Pass either a single preprocessed model directory or a parent directory that contains multiple *_cpu or *_cuda model directories.

python -m benchmarking.llms.bench_inference \
  --model-dir integrations/hf/preprocessed \
  --device cpu \
  --prompt "Write the numbers from one to two hundred in words separated by commas only:" \
  --max-new-tokens 64 \
  --warmup 1 \
  --repeats 3 \
  --backends rsr hf_float32 hf_bfloat16

CLI args for benchmarking/llms/bench_inference.py:
  --model-dir          Single preprocessed model directory or parent directory
                       containing multiple preprocessed models (required)
  --prompt             Prompt text to generate from
                       Default: "Write the numbers from one to two hundred in
                       words separated by commas only:"
  --max-new-tokens     Maximum number of generated tokens (default: 64)
  --warmup             Warmup generations before timing (default: 1)
  --repeats            Timed generations per backend/model (default: 3)
  --no-chat-template   Tokenize the raw prompt directly
  --device             Target device and model suffix filter: cpu or cuda
                       (required)
  --backends           Optional backend list:
                       rsr, hf_float32, hf_bfloat16, hf_float16
                       Default: rsr + the standard HF dtypes for the device

Benchmark Results 📊

Matrix-Vector Multiplication 🧮

CPU 🖥️

1-bit	1.58-bit

CUDA ⚡

1-bit	1.58-bit

Ternary (1.58bit) LLMs 🤖

Speedup is computed against the HuggingFace bfloat16 baseline for the same model.

CPU 🖥️

Model	HF Tok/s	RSR Tok/s	Speedup
Falcon3-10B-Instruct-1.58bit	0.2	11.3	62.0x
Llama3-8B-1.58-100B-tokens	0.2	13.4	53.8x
bitnet-b1.58-2B-4T-bf16	2.1	28.8	13.9x
bitnet-b1.58-2B-4T	14.2	29.3	2.1x

CUDA ⚡

Model	HF Tok/s	RSR Tok/s	Speedup
Falcon3-10B-Instruct-1.58bit	25.2	47.4	1.9x
Llama3-8B-1.58-100B-tokens	31.9	59.3	1.9x
bitnet-b1.58-2B-4T-bf16	33.1	57.4	1.7x
bitnet-b1.58-2B-4T	41.6	57.1	1.4x

Updates 📝

[03/25/2026] Support HuggingFace models interface.

Project Structure 🗂️

RSR-core/
├── multiplier/             # Python wrappers for kernels
│   ├── bit_1/              # 1-bit (binary) multipliers (CPU/CUDA)
│   └── bit_1_58/           # 1.58-bit (ternary) multipliers (CPU/CUDA)
├── kernels/                # Low-level C/CUDA kernel source
│   ├── bit_1/
│   │   ├── cpu/            #   C kernels
│   │   └── cuda/           #   CUDA kernels (.cu)
│   └── bit_1_58/
│       ├── cpu/            #   C kernels
│       └── cuda/           #   CUDA kernels (.cu)
├── integrations/           # Model integrations
│   └── hf/                 #   HuggingFace integration
├── benchmarking/           # Benchmarking scripts & results
└── tests/                  # Unit and integration tests

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
assets		assets
benchmarking		benchmarking
docs		docs
integrations		integrations
kernels		kernels
multiplier		multiplier
tests		tests
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RSR-core

Demo 🎬

Usage 🛠️

Installation 📦

Prepare a model (once) 🧱

Run model inference 🤖

Benchmark on your machine ⏱️

Benchmark Results 📊

Matrix-Vector Multiplication 🧮

CPU 🖥️

CUDA ⚡

Ternary (1.58bit) LLMs 🤖

CPU 🖥️

CUDA ⚡

Updates 📝

Project Structure 🗂️

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

RSR-core

Demo 🎬

Usage 🛠️

Installation 📦

Prepare a model (once) 🧱

Run model inference 🤖

Benchmark on your machine ⏱️

Benchmark Results 📊

Matrix-Vector Multiplication 🧮

CPU 🖥️

CUDA ⚡

Ternary (1.58bit) LLMs 🤖

CPU 🖥️

CUDA ⚡

Updates 📝

Project Structure 🗂️

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages