Skip to content

UIC-InDeXLab/RSR-core

Repository files navigation

RSR-core

RSR (Redundant Segment Reduction) for efficient low-bit inference (matrix-vector multiplication).

This repository contains the core kernels, model integrations, and benchmarking code for RSR across CPU and CUDA backends. RSR targets fast matrix-vector multiplication when the matrix is low-bit quantized by grouping repeated column patterns, aggregating the corresponding input values once, and then scattering the result to the affected output rows.

This is especially useful for workloads such as low-bit LLM inference, where decoding repeatedly applies quantized matvec operations. For the original algorithm, see UIC-InDeXLab/RSR and docs/ALGORITHM.md.

Demo 🎬

Inference on CPU for a 1.58-bit LLM decoding step. Click the image to view the original high-quality video. HF denotes the Hugging Face baseline running bfloat16 on PyTorch.

PROMPT: "Write the numbers from one to sixty in words separated by commas only:"

RSR vs Baseline

Usage 🛠️

Installation 📦

Prerequisites: Python >= 3.10, a C compiler for CPU kernels, and optionally CUDA for GPU support.

git clone https://github.com/UIC-InDeXLab/RSR-Core.git
cd RSR-Core
pip install -e .

Prepare a model (once) 🧱

Run integrations/hf/model_prep.py once per model to preprocess the ternary weights and save the RSR metadata needed for inference.

python -m integrations.hf.model_prep \
  --model microsoft/bitnet-b1.58-2B-4T \
  --output ./preprocessed_model \
  --device cpu \
  --trust-remote-code \
  --best-k-json benchmarking/bit_1_58/reports/best_k_cpu.json
CLI args for integrations/hf/model_prep.py:
  --model, -m           HuggingFace model ID or local path (required)
  --output, -o          Output directory for the preprocessed model (required)
  --k                   Block height for RSR decomposition (default: from best_k_{device}.json)
  --version             RSR multiplier version to use (default: adaptive)
  --device              Device for model loading: cpu or cuda (default: cpu)
  --trust-remote-code   Allow remote code when loading HuggingFace models
  --best-k-json         Optional path to a per-layer best-k JSON file
                        Default:
                        benchmarking/bit_1_58/reports/best_k_{device}.json

Run model inference 🤖

Use integrations/hf/model_infer.py to run generation from a preprocessed model directory. The default backend is rsr.

python -m integrations.hf.model_infer \
  --model-dir ./preprocessed_model \
  --backend rsr \
  --device cpu \
  --prompt "Write the numbers from one to ten in words." \
  --max-new-tokens 64 \
  --stream
CLI args for integrations/hf/model_infer.py:
  --model-dir          Directory with rsr_config.json and safetensors artifacts
                       (default: integrations/hf)
  --backend            Inference backend: rsr or hf (default: rsr)
  --tokenizer          Optional tokenizer source
                       Default: rsr_config.json:model_name
  --device             Target device; auto-detected from model-dir suffix
                       (_cpu / _cuda) if omitted
  --dtype              Optional dtype cast: float32, float16, or bfloat16
  --prompt             Prompt text to generate from (required)
  --max-new-tokens     Maximum number of tokens to generate (default: 64)
  --no-chat-template   Tokenize the raw prompt directly
  --stream             Stream decoded output as tokens are generated

Benchmark on your machine ⏱️

Use the scripts under benchmarking/ to reproduce the local numbers for kernel-level matvec benchmarks and end-to-end LLM inference.

Find the best k for ternary RSR

python -m benchmarking.bit_1_58.bench_best_k \
  --device cpu \
  --shapes 2560x2560 4096x14336 \
  --k-values 2 4 6 8 10 12 \
  --warmup 10 \
  --repeats 30
CLI args for benchmarking/bit_1_58/bench_best_k.py:
  --device             Target device: cpu or cuda (required)
  --shapes             Optional list of matrix shapes in NxM format
                       Default: all known preprocessed model shapes
  --k-values           Optional list of k values to test
                       Default: 2 4 6 8 10 12
  --warmup             Warmup iterations before timing (default: 10)
  --repeats            Timed iterations per shape/k (default: 30)

This writes: benchmarking/bit_1_58/reports/best_k_{device}.csv and benchmarking/bit_1_58/reports/best_k_{device}.json

Benchmark matrix-vector multiplication

The shape benchmark scripts do not take CLI arguments. Configure them by editing the constants at the top of the script: SHAPES, K_VALUES, METHODS, REPEATS, and WARMUP.

python benchmarking/bit_1/bench_shapes_cpu.py
python benchmarking/bit_1/bench_shapes_cuda.py
python benchmarking/bit_1_58/bench_shapes_cpu.py
python benchmarking/bit_1_58/bench_shapes_cuda.py

Reports are written to: benchmarking/bit_1/reports/results_shapes_{device}.csv benchmarking/bit_1_58/reports/results_shapes_{device}.csv

Benchmark end-to-end LLM inference

Pass either a single preprocessed model directory or a parent directory that contains multiple *_cpu or *_cuda model directories.

python -m benchmarking.llms.bench_inference \
  --model-dir integrations/hf/preprocessed \
  --device cpu \
  --prompt "Write the numbers from one to two hundred in words separated by commas only:" \
  --max-new-tokens 64 \
  --warmup 1 \
  --repeats 3 \
  --backends rsr hf_float32 hf_bfloat16
CLI args for benchmarking/llms/bench_inference.py:
  --model-dir          Single preprocessed model directory or parent directory
                       containing multiple preprocessed models (required)
  --prompt             Prompt text to generate from
                       Default: "Write the numbers from one to two hundred in
                       words separated by commas only:"
  --max-new-tokens     Maximum number of generated tokens (default: 64)
  --warmup             Warmup generations before timing (default: 1)
  --repeats            Timed generations per backend/model (default: 3)
  --no-chat-template   Tokenize the raw prompt directly
  --device             Target device and model suffix filter: cpu or cuda
                       (required)
  --backends           Optional backend list:
                       rsr, hf_float32, hf_bfloat16, hf_float16
                       Default: rsr + the standard HF dtypes for the device

Benchmark Results 📊

Matrix-Vector Multiplication 🧮

CPU 🖥️

1-bit 1.58-bit
1-bit CPU 1.58-bit CPU

CUDA ⚡

1-bit 1.58-bit
1-bit CUDA 1.58-bit CUDA

Ternary (1.58bit) LLMs 🤖

Speedup is computed against the HuggingFace bfloat16 baseline for the same model.

CPU 🖥️

Model HF Tok/s RSR Tok/s Speedup
Falcon3-10B-Instruct-1.58bit 0.2 11.3 62.0x
Llama3-8B-1.58-100B-tokens 0.2 13.4 53.8x
bitnet-b1.58-2B-4T-bf16 2.1 28.8 13.9x
bitnet-b1.58-2B-4T 14.2 29.3 2.1x

CUDA ⚡

Model HF Tok/s RSR Tok/s Speedup
Falcon3-10B-Instruct-1.58bit 25.2 47.4 1.9x
Llama3-8B-1.58-100B-tokens 31.9 59.3 1.9x
bitnet-b1.58-2B-4T-bf16 33.1 57.4 1.7x
bitnet-b1.58-2B-4T 41.6 57.1 1.4x

Updates 📝

  • [03/25/2026] Support HuggingFace models interface.

Project Structure 🗂️

RSR-core/
├── multiplier/             # Python wrappers for kernels
│   ├── bit_1/              # 1-bit (binary) multipliers (CPU/CUDA)
│   └── bit_1_58/           # 1.58-bit (ternary) multipliers (CPU/CUDA)
├── kernels/                # Low-level C/CUDA kernel source
│   ├── bit_1/
│   │   ├── cpu/            #   C kernels
│   │   └── cuda/           #   CUDA kernels (.cu)
│   └── bit_1_58/
│       ├── cpu/            #   C kernels
│       └── cuda/           #   CUDA kernels (.cu)
├── integrations/           # Model integrations
│   └── hf/                 #   HuggingFace integration
├── benchmarking/           # Benchmarking scripts & results
└── tests/                  # Unit and integration tests

About

RSR-core: A High-Performance Engine for Low-Bit Matrix-Vector Multiplication

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages