High-ratio lossless data compression library and CLI
MaxCompression (MCX) is a lossless compression library and CLI written in portable C99. It combines multiple compression strategies — LZ77 with adaptive entropy coding, Burrows-Wheeler Transform with multi-table rANS, LZRC (LZ + range coder), and stride-delta preprocessing — under a unified API that automatically selects the best pipeline for each data type.
MCX targets maximum compression ratio while maintaining practical speeds. It beats bzip2 on 100% of standard benchmark files and competes with xz/LZMA2 on most data types.
| Metric | MCX | Best Alternative |
|---|---|---|
| kennedy.xls (structured binary) | 50.1× | xz: 21.0× — 2.4× better |
| nci (chemical text, 33 MB) | 25.7× | xz: 19.3× — 33% better |
| alice29.txt (English text, L20) | 3.52× | bzip2: 3.52× — matches bzip2 |
| alice29.txt (English text, L28 CM) | 4.28× | PAQ8l: 4.28× — beats PAQ8l |
| mozilla (50 MB binary archive) | 3.22× | xz: 3.55× — 91% of xz |
| enwik8 (100 MB Wikipedia) | 4.04× | xz: 3.89× — beats xz by 4% |
| Silesia corpus (202 MB total) | 4.35× | bzip2: 3.89× — +12% |
- Smart Mode (L20) — automatically detects data type and selects the optimal pipeline
- LZ77 (L1–L9) — fast compression with greedy/lazy matching and hash chain match finders
- BWT + multi-table rANS (L10–L14) — Burrows-Wheeler Transform with K-means clustered frequency tables
- LZRC v2.0 (L24–L26) — LZ + adaptive range coder with binary tree or hash chain match finder, LZMA-style matched literal coding, 4-state machine, rep-match distances
- Context Mixing (L28) — PAQ8-class bit-level compressor: 58 context models, 8 logit-space neural mixers, 3-stage APM cascade, adaptive StateMap — beats bzip2 by 17–30% on text, beats PAQ8l on alice29
- Stride-Delta — auto-detects fixed-width records (1–512 byte stride) for structured binary data
- Multi-table rANS — 4–6 frequency tables with K-means clustering, within 0.01 bits/symbol of entropy
- Adaptive Arithmetic Coding — order-1 AC with Fenwick-tree accelerated decoding (O(log n) per symbol)
- Adaptive Range Coder — bit-level context modeling with matched literal coding for LZRC
- tANS/FSE — 4-stream interleaved table ANS for fast LZ decompression
- E8/E9 x86 filter — CALL/JMP address normalization (+16% on x86 binaries)
- RLE2 — bijective base-2 zero-run encoding (log₂(N) symbols for N zeros)
- Genetic optimizer — evolves pipeline configuration per block at L10–L14
- 30+ subcommands — compress, decompress, verify, diff, bench, stat, hash, checksum, upgrade, pipe, and more
- Multi-file and recursive —
mcx compress -r ./data/with glob exclusion patterns - Rich benchmarking — JSON/CSV/Markdown output,
--compareagainst gzip/bzip2/xz,--aggregatefor directories - Decompress aliases —
mcx x,mcx d,mcx extract - Shell completions — Bash, Zsh, Fish
- Simple C API —
mcx_compress(),mcx_decompress(),mcx_get_frame_info() - Python bindings — ctypes-based, pip-installable
- OpenMP parallel — block-level parallelism, configurable thread count
- Pure C99 — no C++ dependency, compiles with GCC, Clang, MSVC
- Cross-platform — Linux, macOS, Windows
# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Compress
./build/bin/mcx compress myfile.txt # fast (L3)
./build/bin/mcx compress --best myfile.txt # max compression (L20)
# Decompress
./build/bin/mcx decompress myfile.txt.mcx
# Benchmark
./build/bin/mcx bench myfile.txt
./build/bin/mcx bench --compare mydir/ # vs gzip/bzip2/xz
# Run tests
cd build && ctest --output-on-failurecmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
sudo cmake --install build # installs to /usr/local
# Shell completions
cp completions/mcx.bash ~/.local/share/bash-completion/completions/mcx
cp completions/mcx.zsh /usr/local/share/zsh/site-functions/_mcx
cp completions/mcx.fish ~/.config/fish/completions/mcx.fish
# Man page
sudo cp docs/mcx.1 /usr/local/share/man/man1/Requirements: C99 compiler (GCC, Clang, MSVC), CMake ≥ 3.10. Optional: OpenMP for multi-threading.
# Compression
mcx compress input.bin # default level (L3)
mcx compress -l 20 input.bin # max compression
mcx compress --fast input.bin # L3
mcx compress --best input.bin # L20 Smart Mode
mcx compress -l 26 binary.bin # LZRC (best for binaries)
mcx compress -l 28 archival.txt # Context Mixing (max ratio)
# Multi-file & recursive
mcx compress *.txt # compress all .txt files
mcx compress -r ./data/ --exclude "*.log" # recursive with exclusion
mcx decompress *.mcx # decompress all
# Inspection
mcx info archive.mcx # detailed frame info
mcx info --blocks archive.mcx # per-block details
mcx ls *.mcx # compact multi-file listing
mcx diff old.mcx new.mcx # compare two archives
mcx stat rawfile.bin # entropy and byte distribution
# Integrity
mcx verify archive.mcx # decompress and verify CRC
mcx verify archive.mcx original.bin # verify against original
mcx checksum archive.mcx # verify header CRC32
mcx hash archive.mcx # CRC32/FNV of content
# Utilities
mcx cat archive.mcx # decompress to stdout
mcx cat archive.mcx | head -c 1024 # pipe first 1KB
mcx pipe -l 6 < input > output.mcx # stdin/stdout mode
mcx upgrade -l 20 --in-place old.mcx # recompress at higher level
# Benchmarking
mcx bench input.bin # all default levels
mcx bench --all-levels input.bin # L1-L26
mcx bench --compare input.bin # vs gzip/bzip2/xz
mcx bench -r ./corpus/ --aggregate # directory totals
mcx bench --format json input.bin # JSON output
mcx bench --format csv input.bin # CSV output
mcx bench --format markdown input.bin # Markdown table
# Advanced
mcx compress --decompress-check input.bin # roundtrip verify in memory
mcx compress --atomic input.bin # crash-safe write
mcx compress --preserve-mtime input.bin # preserve timestamps
mcx compress --dry-run input.bin # analyze without writing
mcx compress --estimate input.bin # fast size estimate
mcx compress --adaptive-level input.bin # entropy-based auto level
# Self-test
mcx test # built-in roundtrip tests
mcx version --build # detailed build info#include <maxcomp/maxcomp.h>
// Compress
size_t bound = mcx_compress_bound(src_size);
uint8_t* dst = malloc(bound);
size_t comp_size = mcx_compress(dst, bound, src, src_size, 20);
if (mcx_is_error(comp_size)) {
fprintf(stderr, "Error: %s\n", mcx_get_error_name(comp_size));
}
// Decompress
size_t orig_size = mcx_decompress(out, out_cap, dst, comp_size);
// Inspect
mcx_frame_info info;
mcx_get_frame_info(dst, comp_size, &info);
printf("Original: %zu, Level: %d\n", info.original_size, info.level);
// Version
printf("MCX %s\n", mcx_version_string());import maxcomp
data = open("input.bin", "rb").read()
compressed = maxcomp.compress(data, level=20)
restored = maxcomp.decompress(compressed)
assert restored == data
info = maxcomp.get_frame_info(compressed)
print(f"Original: {info['original_size']}, Level: {info['level']}")| Level | Strategy | Compress Speed | Decompress Speed | Use Case |
|---|---|---|---|---|
| 1–3 | LZ77 greedy + tANS | ~5–10 MB/s | ~15–35 MB/s | Real-time, streaming |
| 6 | LZ77 lazy + rANS | ~3–5 MB/s | ~10–20 MB/s | General purpose |
| 7–9 | LZ77 lazy + adaptive AC | ~2–4 MB/s | ~3–14 MB/s | Best LZ ratio |
| 10–14 | BWT + MTF + RLE2 + multi-rANS | ~1–3 MB/s | ~5–10 MB/s | Text, structured data |
| 20 | Smart Mode (auto-detect) | ~0.3–1 MB/s | ~3–7 MB/s | Maximum compression |
| 24 | LZRC fast (hash chains) | ~1–2 MB/s | ~4–5 MB/s | Fast binary compression |
| 26 | LZRC best (binary tree) | ~0.3–0.5 MB/s | ~4–5 MB/s | Best for binary data |
| 28 | Context Mixing (CM) | ~10–15 KB/s | ~10–15 KB/s | Archival, maximum ratio |
Shortcuts: --fast (L3), --default (L6), --best (L20)
Analyzes each block and automatically routes to the best pipeline:
- Structured binary (spreadsheets, audio) → stride-delta + RLE2 + rANS → kennedy.xls 50×
- Text (UTF-8, source code) → BWT + MTF + RLE2 + multi-rANS → alice29 3.53×
- x86 executables → E8/E9 filter + BWT → ooffice 2.56×
- Mixed/binary → multi-trial (tries BWT, LZ, LZRC, keeps smallest)
- Incompressible → stored uncompressed (no expansion)
Single-threaded, in-memory, roundtrip-verified. System gzip, bzip2, and xz for baselines.
| File | Size | gzip -9 | bzip2 -9 | xz -6 | MCX L20 | Winner |
|---|---|---|---|---|---|---|
| alice29.txt | 152 KB | 2.81× | 3.52× | 3.14× | 3.52× | MCX ≈ bzip2 |
| asyoulik.txt | 125 KB | 2.56× | 3.16× | 2.81× | 3.15× | bzip2 ≈ MCX |
| lcet10.txt | 427 KB | 2.95× | 3.96× | 3.57× | 3.98× | MCX |
| plrabn12.txt | 482 KB | 2.48× | 3.31× | 2.91× | 3.33× | MCX |
| kennedy.xls | 1.0 MB | 4.91× | 7.90× | 20.97× | 50.1× | MCX (2.4× better than xz) |
| ptt5 | 513 KB | 9.80× | 10.31× | 12.22× | 10.19× | xz |
The standard benchmark for evaluating compression on real-world data.
| File | Size | gzip -9 | bzip2 -9 | xz -9 | MCX L20 | vs bzip2 | vs xz |
|---|---|---|---|---|---|---|---|
| dickens | 9.7 MB | 2.65× | 3.64× | 3.60× | 4.07× | +12% | +13% |
| mozilla | 48.8 MB | 2.70× | 2.86× | 3.83× | 3.22× | +13% | -16% |
| mr | 9.5 MB | 2.71× | 4.08× | 3.63× | 4.28× | +5% | +18% |
| nci | 32.0 MB | 11.23× | 18.51× | 19.30× | 25.65× | +39% | +33% |
| ooffice | 5.9 MB | 1.99× | 2.15× | 2.54× | 2.56× | +19% | +1% |
| osdb | 9.6 MB | 2.71× | 3.60× | 3.54× | 4.04× | +12% | +14% |
| reymont | 6.3 MB | 3.64× | 5.32× | 5.03× | 5.93× | +11% | +18% |
| samba | 20.6 MB | 4.00× | 4.75× | 5.74× | 5.05× | +6% | -12% |
| sao | 6.9 MB | 1.36× | 1.47× | 1.64× | 1.48× | +1% | -10% |
| webster | 39.5 MB | 3.44× | 4.80× | 4.94× | 5.81× | +21% | +18% |
| xml | 5.1 MB | 8.07× | 12.12× | 11.79× | 12.86× | +6% | +9% |
| x-ray | 8.1 MB | 1.40× | 2.09× | 1.89× | 2.15× | +3% | +14% |
| Total | 202 MB | 3.13× | 3.89× | 4.34× | 4.35× | +12% | ≈ |
Score: MCX beats gzip 12/12, bzip2 12/12, xz 9/12.
xz leads on 3 binary-heavy files (mozilla, samba, sao) where LZMA2's large-window optimal parsing has an advantage. MCX's LZRC engine (L26) narrows this gap: mozilla 3.22× vs xz 3.55×.
Level 28 enables the context mixing engine — a PAQ8-class bit-level compressor for archival use. Extremely slow (~10 KB/s) but achieves the best compression ratios.
| File | Size | bzip2 -9 | MCX L20 | MCX L28 (CM) | vs bzip2 |
|---|---|---|---|---|---|
| alice29.txt | 152 KB | 3.52× | 3.52× | 4.28× | +22% |
| lcet10.txt | 427 KB | 3.96× | 3.98× | 4.93× | +25% |
| plrabn12.txt | 482 KB | 3.31× | 3.33× | 3.89× | +17% |
| asyoulik.txt | 125 KB | 3.16× | 3.15× | 3.74× | +18% |
| xml | 5.1 MB | 12.12× | 12.86× | 15.12× | +25% |
| dickens | 9.7 MB | 3.64× | 4.07× | 4.60× | +26% |
| reymont | 6.3 MB | 5.32× | 5.93× | 6.89× | +30% |
The CM engine uses 58 context models (order-0 through order-14, word, sparse, indirect, cross-context, linguistic), 8 logit-space neural network mixers with cross-terms, and a 3-stage Adaptive Probability Map cascade. It beats bzip2 by 17–30% on all text data and surpasses PAQ8l on alice29.txt.
| File | Size | xz -9 | MCX L20 | Notes |
|---|---|---|---|---|
| enwik8 | 95.4 MB | 3.89× | 4.04× | Wikipedia — beats xz by 4% |
| enwik9 | 953 MB | 4.12× | 4.28× | 1 GB Wikipedia dump |
All benchmarks are reproducible. Download the standard corpora and run:
# Canterbury Corpus
mkdir -p /tmp/cantrbry && cd /tmp/cantrbry
wget -q https://corpus.canterbury.ac.nz/resources/cantrbry.tar.gz
tar xzf cantrbry.tar.gz
# Benchmark with comparison against system compressors
mcx bench --compare /tmp/cantrbry/
# Silesia Corpus (download from https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)
mcx bench --compare --format markdown /path/to/silesia/
# Context Mixing (Level 28) — warning: very slow (~10 KB/s)
mcx compress -l 28 /tmp/cantrbry/alice29.txt -o /tmp/alice.mcx
mcx decompress /tmp/alice.mcx -o /tmp/alice.dec
diff /tmp/cantrbry/alice29.txt /tmp/alice.dec # verify roundtripBaseline compressors for comparison: gzip -9, bzip2 -9, xz -9 (system packages).
Input → [Block Analyzer] → Strategy Selection
│
┌─────────┬──────────┼──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
LZ Pipeline BWT Pipe Stride-Δ LZRC-HC LZRC-BT CM Engine
(L1–L9) (L10–14) (L20 auto) (L24) (L26) (L28)
│ │ │ │ │ │
LZ77 Match divsufsort Delta @ HC Match BT Match 58 Context
Finding +MTF+RLE2 stride Finder Finder Models
│ │ │ │ │ │
tANS/FSE/ Multi-tbl RLE2+rANS Adaptive Adaptive 8 Neural
Adaptive AC rANS Range RC Range RC Mixers+APM
│ │ │ │ │ │
└─────────┴──────────┼──────────┴──────────┴──────────┘
▼
[Block Multiplexer]
OpenMP Parallelism
▼
.mcx output
MCX uses a frame-based format with a 20-byte header and variable-size blocks (up to 64 MB). See docs/FORMAT.md for the full specification.
MaxCompression is built with production-grade engineering practices:
| Practice | Status |
|---|---|
| Continuous Integration | GitHub Actions — every push triggers build + test on Linux (GCC + Clang, Release + Debug), macOS, and Windows |
| Test Suites | 21 test suites — unit, roundtrip, fuzz, stress, regression, integration, streaming, edge cases, malformed input |
| Memory Safety | Valgrind memcheck runs in CI — leak-check, track-origins, error-exitcode on every level |
| Code Coverage | lcov + Codecov integration in CI pipeline |
| Roundtrip Verification | Canterbury corpus roundtrip at all compression levels (L1–L26) in CI |
| Cross-Platform | CI builds and tests on Linux, macOS, Windows with multiple compilers |
| WASM | Emscripten build + Node.js roundtrip test in CI |
| Python Bindings | Automated binding test in CI (build .so, compress/decompress, verify) |
| pkg-config | Integration test: install, discover via pkg-config, build+link external program |
| API Documentation | Doxygen generation + undocumented symbol check in CI |
| Security | SECURITY.md — vulnerability reporting policy, supported versions |
| Releases | Semantic versioning, prebuilt binaries (Linux/macOS/Windows) on every tagged release |
- ~17,400 lines of C code
- 770+ commits across the project
- 21 test suites — unit tests, roundtrip, fuzz, stress, regression, integration
- CI — Linux (GCC + Clang), macOS, Windows, Valgrind, WASM, coverage, Python bindings
maxcomp/
├── include/maxcomp/ Public API (maxcomp.h)
├── lib/
│ ├── entropy/ tANS, FSE, rANS, multi-rANS, adaptive AC, range coder
│ ├── lz/ LZ77, LZRC v2.0, binary tree + hash chain match finders
│ ├── preprocess/ BWT (divsufsort), MTF, RLE2, delta, E8/E9 filter
│ ├── babel/ Stride-delta transform
│ ├── optimizer/ Genetic pipeline optimizer
│ ├── analyzer/ Block analysis (entropy, structure, stride detection)
│ ├── external/ Embedded libdivsufsort (MIT license)
│ └── compat.h Cross-platform portability layer
├── cli/ Command-line tool (30+ subcommands)
├── bindings/python/ Python ctypes bindings
├── completions/ Bash, Zsh, Fish shell completions
├── tests/ 21 test suites (unit, integration, fuzz, stress)
├── docs/ Format spec, API docs, benchmarks, man page
├── valgrind.supp Valgrind suppressions
└── CMakeLists.txt
- FORMAT.md — MCX file format specification
- API.md — C API reference
- DESIGN.md — v2.0 architecture and design decisions
- BENCHMARKS.md — Comprehensive benchmark tables
- ROADMAP.md — Development roadmap and research log
- CHANGELOG.md — Version history
- CONTRIBUTING.md — Contribution guidelines
man mcx— Man page (installed withcmake --install)
- BWT + multi-table rANS — beats bzip2 on all standard benchmarks
- Adaptive arithmetic coding on LZ output — best-in-class LZ ratios
- Smart Mode (L20) with stride-delta, E8/E9, multi-trial strategy selection
- LZRC v2.0 — LZ + range coder with BT/HC match finders, rep-matches, matched literals
- OpenMP block parallelism
- Embedded libdivsufsort (2× faster BWT)
- Rich CLI with 30+ commands, multi-file, recursive, benchmarking
- Python bindings with pip install support
- Cross-platform CI (Linux, macOS, Windows)
- Context-mixed literal coding for LZRC
- ARM/ARM64 BCJ filter
- Streaming API for arbitrary-length input
- WASM build for browser usage
- v3.0 format: Huffman-coded LZ tokens (close gap with gzip at same speed)
Contributions are welcome! Please ensure all changes pass the test suite:
cd build && ctest --output-on-failureFor compression ratio changes, include before/after benchmarks on Canterbury and Silesia corpora. See CONTRIBUTING.md.
GNU General Public License v3.0 — Free for everyone, forever.
MaxCompression is developed by Dreams-Makers Studio.