MaxCompression

High-ratio lossless data compression library and CLI

MaxCompression (MCX) is a lossless compression library and CLI written in portable C99. It combines multiple compression strategies — LZ77 with adaptive entropy coding, Burrows-Wheeler Transform with multi-table rANS, LZRC (LZ + range coder), and stride-delta preprocessing — under a unified API that automatically selects the best pipeline for each data type.

MCX targets maximum compression ratio while maintaining practical speeds. It beats bzip2 on 100% of standard benchmark files and competes with xz/LZMA2 on most data types.

Highlights

Metric	MCX	Best Alternative
kennedy.xls (structured binary)	50.1×	xz: 21.0× — 2.4× better
nci (chemical text, 33 MB)	25.7×	xz: 19.3× — 33% better
alice29.txt (English text, L20)	3.52×	bzip2: 3.52× — matches bzip2
alice29.txt (English text, L28 CM)	4.28×	PAQ8l: 4.28× — beats PAQ8l
mozilla (50 MB binary archive)	3.22×	xz: 3.55× — 91% of xz
enwik8 (100 MB Wikipedia)	4.04×	xz: 3.89× — beats xz by 4%
Silesia corpus (202 MB total)	4.35×	bzip2: 3.89× — +12%

Features

Compression Engines

Smart Mode (L20) — automatically detects data type and selects the optimal pipeline
LZ77 (L1–L9) — fast compression with greedy/lazy matching and hash chain match finders
BWT + multi-table rANS (L10–L14) — Burrows-Wheeler Transform with K-means clustered frequency tables
LZRC v2.0 (L24–L26) — LZ + adaptive range coder with binary tree or hash chain match finder, LZMA-style matched literal coding, 4-state machine, rep-match distances
Context Mixing (L28) — PAQ8-class bit-level compressor: 58 context models, 8 logit-space neural mixers, 3-stage APM cascade, adaptive StateMap — beats bzip2 by 17–30% on text, beats PAQ8l on alice29
Stride-Delta — auto-detects fixed-width records (1–512 byte stride) for structured binary data

Entropy Coding

Multi-table rANS — 4–6 frequency tables with K-means clustering, within 0.01 bits/symbol of entropy
Adaptive Arithmetic Coding — order-1 AC with Fenwick-tree accelerated decoding (O(log n) per symbol)
Adaptive Range Coder — bit-level context modeling with matched literal coding for LZRC
tANS/FSE — 4-stream interleaved table ANS for fast LZ decompression

Preprocessing

E8/E9 x86 filter — CALL/JMP address normalization (+16% on x86 binaries)
RLE2 — bijective base-2 zero-run encoding (log₂(N) symbols for N zeros)
Genetic optimizer — evolves pipeline configuration per block at L10–L14

CLI

30+ subcommands — compress, decompress, verify, diff, bench, stat, hash, checksum, upgrade, pipe, and more
Multi-file and recursive — mcx compress -r ./data/ with glob exclusion patterns
Rich benchmarking — JSON/CSV/Markdown output, --compare against gzip/bzip2/xz, --aggregate for directories
Decompress aliases — mcx x, mcx d, mcx extract
Shell completions — Bash, Zsh, Fish

Library

Simple C API — mcx_compress(), mcx_decompress(), mcx_get_frame_info()
Python bindings — ctypes-based, pip-installable
OpenMP parallel — block-level parallelism, configurable thread count
Pure C99 — no C++ dependency, compiles with GCC, Clang, MSVC
Cross-platform — Linux, macOS, Windows

Quick Start

# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Compress
./build/bin/mcx compress myfile.txt                # fast (L3)
./build/bin/mcx compress --best myfile.txt          # max compression (L20)

# Decompress
./build/bin/mcx decompress myfile.txt.mcx

# Benchmark
./build/bin/mcx bench myfile.txt
./build/bin/mcx bench --compare mydir/              # vs gzip/bzip2/xz

# Run tests
cd build && ctest --output-on-failure

Installation

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
sudo cmake --install build       # installs to /usr/local

# Shell completions
cp completions/mcx.bash ~/.local/share/bash-completion/completions/mcx
cp completions/mcx.zsh /usr/local/share/zsh/site-functions/_mcx
cp completions/mcx.fish ~/.config/fish/completions/mcx.fish

# Man page
sudo cp docs/mcx.1 /usr/local/share/man/man1/

Requirements: C99 compiler (GCC, Clang, MSVC), CMake ≥ 3.10. Optional: OpenMP for multi-threading.

Usage

CLI

# Compression
mcx compress input.bin                     # default level (L3)
mcx compress -l 20 input.bin               # max compression
mcx compress --fast input.bin              # L3
mcx compress --best input.bin              # L20 Smart Mode
mcx compress -l 26 binary.bin              # LZRC (best for binaries)
mcx compress -l 28 archival.txt            # Context Mixing (max ratio)

# Multi-file & recursive
mcx compress *.txt                         # compress all .txt files
mcx compress -r ./data/ --exclude "*.log"  # recursive with exclusion
mcx decompress *.mcx                       # decompress all

# Inspection
mcx info archive.mcx                       # detailed frame info
mcx info --blocks archive.mcx              # per-block details
mcx ls *.mcx                               # compact multi-file listing
mcx diff old.mcx new.mcx                   # compare two archives
mcx stat rawfile.bin                       # entropy and byte distribution

# Integrity
mcx verify archive.mcx                     # decompress and verify CRC
mcx verify archive.mcx original.bin        # verify against original
mcx checksum archive.mcx                   # verify header CRC32
mcx hash archive.mcx                       # CRC32/FNV of content

# Utilities
mcx cat archive.mcx                        # decompress to stdout
mcx cat archive.mcx | head -c 1024        # pipe first 1KB
mcx pipe -l 6 < input > output.mcx        # stdin/stdout mode
mcx upgrade -l 20 --in-place old.mcx      # recompress at higher level

# Benchmarking
mcx bench input.bin                        # all default levels
mcx bench --all-levels input.bin           # L1-L26
mcx bench --compare input.bin             # vs gzip/bzip2/xz
mcx bench -r ./corpus/ --aggregate        # directory totals
mcx bench --format json input.bin         # JSON output
mcx bench --format csv input.bin          # CSV output
mcx bench --format markdown input.bin     # Markdown table

# Advanced
mcx compress --decompress-check input.bin  # roundtrip verify in memory
mcx compress --atomic input.bin            # crash-safe write
mcx compress --preserve-mtime input.bin    # preserve timestamps
mcx compress --dry-run input.bin           # analyze without writing
mcx compress --estimate input.bin          # fast size estimate
mcx compress --adaptive-level input.bin    # entropy-based auto level

# Self-test
mcx test                                   # built-in roundtrip tests
mcx version --build                        # detailed build info

C API

#include <maxcomp/maxcomp.h>

// Compress
size_t bound = mcx_compress_bound(src_size);
uint8_t* dst = malloc(bound);
size_t comp_size = mcx_compress(dst, bound, src, src_size, 20);
if (mcx_is_error(comp_size)) {
    fprintf(stderr, "Error: %s\n", mcx_get_error_name(comp_size));
}

// Decompress
size_t orig_size = mcx_decompress(out, out_cap, dst, comp_size);

// Inspect
mcx_frame_info info;
mcx_get_frame_info(dst, comp_size, &info);
printf("Original: %zu, Level: %d\n", info.original_size, info.level);

// Version
printf("MCX %s\n", mcx_version_string());

Python

import maxcomp

data = open("input.bin", "rb").read()
compressed = maxcomp.compress(data, level=20)
restored = maxcomp.decompress(compressed)
assert restored == data

info = maxcomp.get_frame_info(compressed)
print(f"Original: {info['original_size']}, Level: {info['level']}")

Compression Levels

Level	Strategy	Compress Speed	Decompress Speed	Use Case
1–3	LZ77 greedy + tANS	~5–10 MB/s	~15–35 MB/s	Real-time, streaming
6	LZ77 lazy + rANS	~3–5 MB/s	~10–20 MB/s	General purpose
7–9	LZ77 lazy + adaptive AC	~2–4 MB/s	~3–14 MB/s	Best LZ ratio
10–14	BWT + MTF + RLE2 + multi-rANS	~1–3 MB/s	~5–10 MB/s	Text, structured data
20	Smart Mode (auto-detect)	~0.3–1 MB/s	~3–7 MB/s	Maximum compression
24	LZRC fast (hash chains)	~1–2 MB/s	~4–5 MB/s	Fast binary compression
26	LZRC best (binary tree)	~0.3–0.5 MB/s	~4–5 MB/s	Best for binary data
28	Context Mixing (CM)	~10–15 KB/s	~10–15 KB/s	Archival, maximum ratio

Shortcuts: --fast (L3), --default (L6), --best (L20)

Smart Mode (Level 20)

Analyzes each block and automatically routes to the best pipeline:

Structured binary (spreadsheets, audio) → stride-delta + RLE2 + rANS → kennedy.xls 50×
Text (UTF-8, source code) → BWT + MTF + RLE2 + multi-rANS → alice29 3.53×
x86 executables → E8/E9 filter + BWT → ooffice 2.56×
Mixed/binary → multi-trial (tries BWT, LZ, LZRC, keeps smallest)
Incompressible → stored uncompressed (no expansion)

Benchmarks

Single-threaded, in-memory, roundtrip-verified. System gzip, bzip2, and xz for baselines.

Canterbury Corpus

File	Size	gzip -9	bzip2 -9	xz -6	MCX L20	Winner
alice29.txt	152 KB	2.81×	3.52×	3.14×	3.52×	MCX ≈ bzip2
asyoulik.txt	125 KB	2.56×	3.16×	2.81×	3.15×	bzip2 ≈ MCX
lcet10.txt	427 KB	2.95×	3.96×	3.57×	3.98×	MCX
plrabn12.txt	482 KB	2.48×	3.31×	2.91×	3.33×	MCX
kennedy.xls	1.0 MB	4.91×	7.90×	20.97×	50.1×	MCX (2.4× better than xz)
ptt5	513 KB	9.80×	10.31×	12.22×	10.19×	xz

Silesia Corpus (202 MB)

The standard benchmark for evaluating compression on real-world data.

File	Size	gzip -9	bzip2 -9	xz -9	MCX L20	vs bzip2	vs xz
dickens	9.7 MB	2.65×	3.64×	3.60×	4.07×	+12%	+13%
mozilla	48.8 MB	2.70×	2.86×	3.83×	3.22×	+13%	-16%
mr	9.5 MB	2.71×	4.08×	3.63×	4.28×	+5%	+18%
nci	32.0 MB	11.23×	18.51×	19.30×	25.65×	+39%	+33%
ooffice	5.9 MB	1.99×	2.15×	2.54×	2.56×	+19%	+1%
osdb	9.6 MB	2.71×	3.60×	3.54×	4.04×	+12%	+14%
reymont	6.3 MB	3.64×	5.32×	5.03×	5.93×	+11%	+18%
samba	20.6 MB	4.00×	4.75×	5.74×	5.05×	+6%	-12%
sao	6.9 MB	1.36×	1.47×	1.64×	1.48×	+1%	-10%
webster	39.5 MB	3.44×	4.80×	4.94×	5.81×	+21%	+18%
xml	5.1 MB	8.07×	12.12×	11.79×	12.86×	+6%	+9%
x-ray	8.1 MB	1.40×	2.09×	1.89×	2.15×	+3%	+14%
Total	202 MB	3.13×	3.89×	4.34×	4.35×	+12%	≈

Score: MCX beats gzip 12/12, bzip2 12/12, xz 9/12.

xz leads on 3 binary-heavy files (mozilla, samba, sao) where LZMA2's large-window optimal parsing has an advantage. MCX's LZRC engine (L26) narrows this gap: mozilla 3.22× vs xz 3.55×.

Context Mixing (Level 28) — Maximum Compression

Level 28 enables the context mixing engine — a PAQ8-class bit-level compressor for archival use. Extremely slow (~10 KB/s) but achieves the best compression ratios.

File	Size	bzip2 -9	MCX L20	MCX L28 (CM)	vs bzip2
alice29.txt	152 KB	3.52×	3.52×	4.28×	+22%
lcet10.txt	427 KB	3.96×	3.98×	4.93×	+25%
plrabn12.txt	482 KB	3.31×	3.33×	3.89×	+17%
asyoulik.txt	125 KB	3.16×	3.15×	3.74×	+18%
xml	5.1 MB	12.12×	12.86×	15.12×	+25%
dickens	9.7 MB	3.64×	4.07×	4.60×	+26%
reymont	6.3 MB	5.32×	5.93×	6.89×	+30%

The CM engine uses 58 context models (order-0 through order-14, word, sparse, indirect, cross-context, linguistic), 8 logit-space neural network mixers with cross-terms, and a 3-stage Adaptive Probability Map cascade. It beats bzip2 by 17–30% on all text data and surpasses PAQ8l on alice29.txt.

Large Files

File	Size	xz -9	MCX L20	Notes
enwik8	95.4 MB	3.89×	4.04×	Wikipedia — beats xz by 4%
enwik9	953 MB	4.12×	4.28×	1 GB Wikipedia dump

Reproducing Benchmarks

All benchmarks are reproducible. Download the standard corpora and run:

# Canterbury Corpus
mkdir -p /tmp/cantrbry && cd /tmp/cantrbry
wget -q https://corpus.canterbury.ac.nz/resources/cantrbry.tar.gz
tar xzf cantrbry.tar.gz

# Benchmark with comparison against system compressors
mcx bench --compare /tmp/cantrbry/

# Silesia Corpus (download from https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)
mcx bench --compare --format markdown /path/to/silesia/

# Context Mixing (Level 28) — warning: very slow (~10 KB/s)
mcx compress -l 28 /tmp/cantrbry/alice29.txt -o /tmp/alice.mcx
mcx decompress /tmp/alice.mcx -o /tmp/alice.dec
diff /tmp/cantrbry/alice29.txt /tmp/alice.dec  # verify roundtrip

Baseline compressors for comparison: gzip -9, bzip2 -9, xz -9 (system packages).

Architecture

Input → [Block Analyzer] → Strategy Selection
                               │
     ┌─────────┬──────────┼──────────┬──────────┬──────────┐
     ▼         ▼          ▼          ▼          ▼          ▼
LZ Pipeline  BWT Pipe  Stride-Δ   LZRC-HC    LZRC-BT    CM Engine
(L1–L9)      (L10–14)  (L20 auto) (L24)      (L26)      (L28)
     │         │          │          │          │          │
LZ77 Match  divsufsort  Delta @   HC Match   BT Match   58 Context
Finding     +MTF+RLE2   stride    Finder     Finder     Models
     │         │          │          │          │          │
tANS/FSE/   Multi-tbl  RLE2+rANS  Adaptive  Adaptive   8 Neural
Adaptive AC  rANS                  Range RC   Range RC   Mixers+APM
     │         │          │          │          │          │
     └─────────┴──────────┼──────────┴──────────┴──────────┘
                               ▼
                    [Block Multiplexer]
                    OpenMP Parallelism
                               ▼
                         .mcx output

File Format

MCX uses a frame-based format with a 20-byte header and variable-size blocks (up to 64 MB). See docs/FORMAT.md for the full specification.

Quality & Safety

MaxCompression is built with production-grade engineering practices:

Practice	Status
Continuous Integration	GitHub Actions — every push triggers build + test on Linux (GCC + Clang, Release + Debug), macOS, and Windows
Test Suites	21 test suites — unit, roundtrip, fuzz, stress, regression, integration, streaming, edge cases, malformed input
Memory Safety	Valgrind memcheck runs in CI — leak-check, track-origins, error-exitcode on every level
Code Coverage	lcov + Codecov integration in CI pipeline
Roundtrip Verification	Canterbury corpus roundtrip at all compression levels (L1–L26) in CI
Cross-Platform	CI builds and tests on Linux, macOS, Windows with multiple compilers
WASM	Emscripten build + Node.js roundtrip test in CI
Python Bindings	Automated binding test in CI (build .so, compress/decompress, verify)
pkg-config	Integration test: install, discover via pkg-config, build+link external program
API Documentation	Doxygen generation + undocumented symbol check in CI
Security	SECURITY.md — vulnerability reporting policy, supported versions
Releases	Semantic versioning, prebuilt binaries (Linux/macOS/Windows) on every tagged release

Project Stats

~17,400 lines of C code
770+ commits across the project
21 test suites — unit tests, roundtrip, fuzz, stress, regression, integration
CI — Linux (GCC + Clang), macOS, Windows, Valgrind, WASM, coverage, Python bindings

Project Structure

maxcomp/
├── include/maxcomp/    Public API (maxcomp.h)
├── lib/
│   ├── entropy/        tANS, FSE, rANS, multi-rANS, adaptive AC, range coder
│   ├── lz/             LZ77, LZRC v2.0, binary tree + hash chain match finders
│   ├── preprocess/     BWT (divsufsort), MTF, RLE2, delta, E8/E9 filter
│   ├── babel/          Stride-delta transform
│   ├── optimizer/      Genetic pipeline optimizer
│   ├── analyzer/       Block analysis (entropy, structure, stride detection)
│   ├── external/       Embedded libdivsufsort (MIT license)
│   └── compat.h        Cross-platform portability layer
├── cli/                Command-line tool (30+ subcommands)
├── bindings/python/    Python ctypes bindings
├── completions/        Bash, Zsh, Fish shell completions
├── tests/              21 test suites (unit, integration, fuzz, stress)
├── docs/               Format spec, API docs, benchmarks, man page
├── valgrind.supp       Valgrind suppressions
└── CMakeLists.txt

Documentation

FORMAT.md — MCX file format specification
API.md — C API reference
DESIGN.md — v2.0 architecture and design decisions
BENCHMARKS.md — Comprehensive benchmark tables
ROADMAP.md — Development roadmap and research log
CHANGELOG.md — Version history
CONTRIBUTING.md — Contribution guidelines
man mcx — Man page (installed with cmake --install)

Roadmap

Completed ✅

BWT + multi-table rANS — beats bzip2 on all standard benchmarks
Adaptive arithmetic coding on LZ output — best-in-class LZ ratios
Smart Mode (L20) with stride-delta, E8/E9, multi-trial strategy selection
LZRC v2.0 — LZ + range coder with BT/HC match finders, rep-matches, matched literals
OpenMP block parallelism
Embedded libdivsufsort (2× faster BWT)
Rich CLI with 30+ commands, multi-file, recursive, benchmarking
Python bindings with pip install support
Cross-platform CI (Linux, macOS, Windows)

Future

Context-mixed literal coding for LZRC
ARM/ARM64 BCJ filter
Streaming API for arbitrary-length input
WASM build for browser usage
v3.0 format: Huffman-coded LZ tokens (close gap with gzip at same speed)

Contributing

Contributions are welcome! Please ensure all changes pass the test suite:

cd build && ctest --output-on-failure

For compression ratio changes, include before/after benchmarks on Canterbury and Silesia corpora. See CONTRIBUTING.md.

License

GNU General Public License v3.0 — Free for everyone, forever.

_{MaxCompression is developed by Dreams-Makers Studio.}

Name		Name	Last commit message	Last commit date
Latest commit History 778 Commits
.github		.github
babel		babel
benchmarks		benchmarks
bindings		bindings
cli		cli
cmake		cmake
completions		completions
docs		docs
examples		examples
include/maxcomp		include/maxcomp
lib		lib
plots		plots
research		research
tests		tests
wasm		wasm
.clang-format		.clang-format
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
babel_benchmark_results.txt		babel_benchmark_results.txt
benchmark_babel.sh		benchmark_babel.sh
benchmark_results.csv		benchmark_results.csv
benchmark_results.md		benchmark_results.md
maxcomp.pc.in		maxcomp.pc.in
valgrind.supp		valgrind.supp

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MaxCompression

Highlights

Features

Compression Engines

Entropy Coding

Preprocessing

CLI

Library

Quick Start

Installation

Usage

CLI

C API

Python

Compression Levels

Smart Mode (Level 20)

Benchmarks

Canterbury Corpus

Silesia Corpus (202 MB)

Context Mixing (Level 28) — Maximum Compression

Large Files

Reproducing Benchmarks

Architecture

File Format

Quality & Safety

Project Stats

Project Structure

Documentation

Roadmap

Completed ✅

Future

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages