Skip to content

SamDreamsMaker/Max-Compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

778 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MaxCompression

High-ratio lossless data compression library and CLI

CI Release Silesia CM License Tests C99 Platform


MaxCompression (MCX) is a lossless compression library and CLI written in portable C99. It combines multiple compression strategies — LZ77 with adaptive entropy coding, Burrows-Wheeler Transform with multi-table rANS, LZRC (LZ + range coder), and stride-delta preprocessing — under a unified API that automatically selects the best pipeline for each data type.

MCX targets maximum compression ratio while maintaining practical speeds. It beats bzip2 on 100% of standard benchmark files and competes with xz/LZMA2 on most data types.

Highlights

Metric MCX Best Alternative
kennedy.xls (structured binary) 50.1× xz: 21.0× — 2.4× better
nci (chemical text, 33 MB) 25.7× xz: 19.3× — 33% better
alice29.txt (English text, L20) 3.52× bzip2: 3.52× — matches bzip2
alice29.txt (English text, L28 CM) 4.28× PAQ8l: 4.28× — beats PAQ8l
mozilla (50 MB binary archive) 3.22× xz: 3.55× — 91% of xz
enwik8 (100 MB Wikipedia) 4.04× xz: 3.89× — beats xz by 4%
Silesia corpus (202 MB total) 4.35× bzip2: 3.89× — +12%

Features

Compression Engines

  • Smart Mode (L20) — automatically detects data type and selects the optimal pipeline
  • LZ77 (L1–L9) — fast compression with greedy/lazy matching and hash chain match finders
  • BWT + multi-table rANS (L10–L14) — Burrows-Wheeler Transform with K-means clustered frequency tables
  • LZRC v2.0 (L24–L26) — LZ + adaptive range coder with binary tree or hash chain match finder, LZMA-style matched literal coding, 4-state machine, rep-match distances
  • Context Mixing (L28) — PAQ8-class bit-level compressor: 58 context models, 8 logit-space neural mixers, 3-stage APM cascade, adaptive StateMap — beats bzip2 by 17–30% on text, beats PAQ8l on alice29
  • Stride-Delta — auto-detects fixed-width records (1–512 byte stride) for structured binary data

Entropy Coding

  • Multi-table rANS — 4–6 frequency tables with K-means clustering, within 0.01 bits/symbol of entropy
  • Adaptive Arithmetic Coding — order-1 AC with Fenwick-tree accelerated decoding (O(log n) per symbol)
  • Adaptive Range Coder — bit-level context modeling with matched literal coding for LZRC
  • tANS/FSE — 4-stream interleaved table ANS for fast LZ decompression

Preprocessing

  • E8/E9 x86 filter — CALL/JMP address normalization (+16% on x86 binaries)
  • RLE2 — bijective base-2 zero-run encoding (log₂(N) symbols for N zeros)
  • Genetic optimizer — evolves pipeline configuration per block at L10–L14

CLI

  • 30+ subcommands — compress, decompress, verify, diff, bench, stat, hash, checksum, upgrade, pipe, and more
  • Multi-file and recursivemcx compress -r ./data/ with glob exclusion patterns
  • Rich benchmarking — JSON/CSV/Markdown output, --compare against gzip/bzip2/xz, --aggregate for directories
  • Decompress aliasesmcx x, mcx d, mcx extract
  • Shell completions — Bash, Zsh, Fish

Library

  • Simple C APImcx_compress(), mcx_decompress(), mcx_get_frame_info()
  • Python bindings — ctypes-based, pip-installable
  • OpenMP parallel — block-level parallelism, configurable thread count
  • Pure C99 — no C++ dependency, compiles with GCC, Clang, MSVC
  • Cross-platform — Linux, macOS, Windows

Quick Start

# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Compress
./build/bin/mcx compress myfile.txt                # fast (L3)
./build/bin/mcx compress --best myfile.txt          # max compression (L20)

# Decompress
./build/bin/mcx decompress myfile.txt.mcx

# Benchmark
./build/bin/mcx bench myfile.txt
./build/bin/mcx bench --compare mydir/              # vs gzip/bzip2/xz

# Run tests
cd build && ctest --output-on-failure

Installation

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
sudo cmake --install build       # installs to /usr/local

# Shell completions
cp completions/mcx.bash ~/.local/share/bash-completion/completions/mcx
cp completions/mcx.zsh /usr/local/share/zsh/site-functions/_mcx
cp completions/mcx.fish ~/.config/fish/completions/mcx.fish

# Man page
sudo cp docs/mcx.1 /usr/local/share/man/man1/

Requirements: C99 compiler (GCC, Clang, MSVC), CMake ≥ 3.10. Optional: OpenMP for multi-threading.

Usage

CLI

# Compression
mcx compress input.bin                     # default level (L3)
mcx compress -l 20 input.bin               # max compression
mcx compress --fast input.bin              # L3
mcx compress --best input.bin              # L20 Smart Mode
mcx compress -l 26 binary.bin              # LZRC (best for binaries)
mcx compress -l 28 archival.txt            # Context Mixing (max ratio)

# Multi-file & recursive
mcx compress *.txt                         # compress all .txt files
mcx compress -r ./data/ --exclude "*.log"  # recursive with exclusion
mcx decompress *.mcx                       # decompress all

# Inspection
mcx info archive.mcx                       # detailed frame info
mcx info --blocks archive.mcx              # per-block details
mcx ls *.mcx                               # compact multi-file listing
mcx diff old.mcx new.mcx                   # compare two archives
mcx stat rawfile.bin                       # entropy and byte distribution

# Integrity
mcx verify archive.mcx                     # decompress and verify CRC
mcx verify archive.mcx original.bin        # verify against original
mcx checksum archive.mcx                   # verify header CRC32
mcx hash archive.mcx                       # CRC32/FNV of content

# Utilities
mcx cat archive.mcx                        # decompress to stdout
mcx cat archive.mcx | head -c 1024        # pipe first 1KB
mcx pipe -l 6 < input > output.mcx        # stdin/stdout mode
mcx upgrade -l 20 --in-place old.mcx      # recompress at higher level

# Benchmarking
mcx bench input.bin                        # all default levels
mcx bench --all-levels input.bin           # L1-L26
mcx bench --compare input.bin             # vs gzip/bzip2/xz
mcx bench -r ./corpus/ --aggregate        # directory totals
mcx bench --format json input.bin         # JSON output
mcx bench --format csv input.bin          # CSV output
mcx bench --format markdown input.bin     # Markdown table

# Advanced
mcx compress --decompress-check input.bin  # roundtrip verify in memory
mcx compress --atomic input.bin            # crash-safe write
mcx compress --preserve-mtime input.bin    # preserve timestamps
mcx compress --dry-run input.bin           # analyze without writing
mcx compress --estimate input.bin          # fast size estimate
mcx compress --adaptive-level input.bin    # entropy-based auto level

# Self-test
mcx test                                   # built-in roundtrip tests
mcx version --build                        # detailed build info

C API

#include <maxcomp/maxcomp.h>

// Compress
size_t bound = mcx_compress_bound(src_size);
uint8_t* dst = malloc(bound);
size_t comp_size = mcx_compress(dst, bound, src, src_size, 20);
if (mcx_is_error(comp_size)) {
    fprintf(stderr, "Error: %s\n", mcx_get_error_name(comp_size));
}

// Decompress
size_t orig_size = mcx_decompress(out, out_cap, dst, comp_size);

// Inspect
mcx_frame_info info;
mcx_get_frame_info(dst, comp_size, &info);
printf("Original: %zu, Level: %d\n", info.original_size, info.level);

// Version
printf("MCX %s\n", mcx_version_string());

Python

import maxcomp

data = open("input.bin", "rb").read()
compressed = maxcomp.compress(data, level=20)
restored = maxcomp.decompress(compressed)
assert restored == data

info = maxcomp.get_frame_info(compressed)
print(f"Original: {info['original_size']}, Level: {info['level']}")

Compression Levels

Level Strategy Compress Speed Decompress Speed Use Case
1–3 LZ77 greedy + tANS ~5–10 MB/s ~15–35 MB/s Real-time, streaming
6 LZ77 lazy + rANS ~3–5 MB/s ~10–20 MB/s General purpose
7–9 LZ77 lazy + adaptive AC ~2–4 MB/s ~3–14 MB/s Best LZ ratio
10–14 BWT + MTF + RLE2 + multi-rANS ~1–3 MB/s ~5–10 MB/s Text, structured data
20 Smart Mode (auto-detect) ~0.3–1 MB/s ~3–7 MB/s Maximum compression
24 LZRC fast (hash chains) ~1–2 MB/s ~4–5 MB/s Fast binary compression
26 LZRC best (binary tree) ~0.3–0.5 MB/s ~4–5 MB/s Best for binary data
28 Context Mixing (CM) ~10–15 KB/s ~10–15 KB/s Archival, maximum ratio

Shortcuts: --fast (L3), --default (L6), --best (L20)

Smart Mode (Level 20)

Analyzes each block and automatically routes to the best pipeline:

  • Structured binary (spreadsheets, audio) → stride-delta + RLE2 + rANS → kennedy.xls 50×
  • Text (UTF-8, source code) → BWT + MTF + RLE2 + multi-rANS → alice29 3.53×
  • x86 executables → E8/E9 filter + BWT → ooffice 2.56×
  • Mixed/binary → multi-trial (tries BWT, LZ, LZRC, keeps smallest)
  • Incompressible → stored uncompressed (no expansion)

Benchmarks

Single-threaded, in-memory, roundtrip-verified. System gzip, bzip2, and xz for baselines.

Canterbury Corpus

File Size gzip -9 bzip2 -9 xz -6 MCX L20 Winner
alice29.txt 152 KB 2.81× 3.52× 3.14× 3.52× MCX ≈ bzip2
asyoulik.txt 125 KB 2.56× 3.16× 2.81× 3.15× bzip2 ≈ MCX
lcet10.txt 427 KB 2.95× 3.96× 3.57× 3.98× MCX
plrabn12.txt 482 KB 2.48× 3.31× 2.91× 3.33× MCX
kennedy.xls 1.0 MB 4.91× 7.90× 20.97× 50.1× MCX (2.4× better than xz)
ptt5 513 KB 9.80× 10.31× 12.22× 10.19× xz

Silesia Corpus (202 MB)

The standard benchmark for evaluating compression on real-world data.

File Size gzip -9 bzip2 -9 xz -9 MCX L20 vs bzip2 vs xz
dickens 9.7 MB 2.65× 3.64× 3.60× 4.07× +12% +13%
mozilla 48.8 MB 2.70× 2.86× 3.83× 3.22× +13% -16%
mr 9.5 MB 2.71× 4.08× 3.63× 4.28× +5% +18%
nci 32.0 MB 11.23× 18.51× 19.30× 25.65× +39% +33%
ooffice 5.9 MB 1.99× 2.15× 2.54× 2.56× +19% +1%
osdb 9.6 MB 2.71× 3.60× 3.54× 4.04× +12% +14%
reymont 6.3 MB 3.64× 5.32× 5.03× 5.93× +11% +18%
samba 20.6 MB 4.00× 4.75× 5.74× 5.05× +6% -12%
sao 6.9 MB 1.36× 1.47× 1.64× 1.48× +1% -10%
webster 39.5 MB 3.44× 4.80× 4.94× 5.81× +21% +18%
xml 5.1 MB 8.07× 12.12× 11.79× 12.86× +6% +9%
x-ray 8.1 MB 1.40× 2.09× 1.89× 2.15× +3% +14%
Total 202 MB 3.13× 3.89× 4.34× 4.35× +12%

Score: MCX beats gzip 12/12, bzip2 12/12, xz 9/12.

xz leads on 3 binary-heavy files (mozilla, samba, sao) where LZMA2's large-window optimal parsing has an advantage. MCX's LZRC engine (L26) narrows this gap: mozilla 3.22× vs xz 3.55×.

Context Mixing (Level 28) — Maximum Compression

Level 28 enables the context mixing engine — a PAQ8-class bit-level compressor for archival use. Extremely slow (~10 KB/s) but achieves the best compression ratios.

File Size bzip2 -9 MCX L20 MCX L28 (CM) vs bzip2
alice29.txt 152 KB 3.52× 3.52× 4.28× +22%
lcet10.txt 427 KB 3.96× 3.98× 4.93× +25%
plrabn12.txt 482 KB 3.31× 3.33× 3.89× +17%
asyoulik.txt 125 KB 3.16× 3.15× 3.74× +18%
xml 5.1 MB 12.12× 12.86× 15.12× +25%
dickens 9.7 MB 3.64× 4.07× 4.60× +26%
reymont 6.3 MB 5.32× 5.93× 6.89× +30%

The CM engine uses 58 context models (order-0 through order-14, word, sparse, indirect, cross-context, linguistic), 8 logit-space neural network mixers with cross-terms, and a 3-stage Adaptive Probability Map cascade. It beats bzip2 by 17–30% on all text data and surpasses PAQ8l on alice29.txt.

Large Files

File Size xz -9 MCX L20 Notes
enwik8 95.4 MB 3.89× 4.04× Wikipedia — beats xz by 4%
enwik9 953 MB 4.12× 4.28× 1 GB Wikipedia dump

Reproducing Benchmarks

All benchmarks are reproducible. Download the standard corpora and run:

# Canterbury Corpus
mkdir -p /tmp/cantrbry && cd /tmp/cantrbry
wget -q https://corpus.canterbury.ac.nz/resources/cantrbry.tar.gz
tar xzf cantrbry.tar.gz

# Benchmark with comparison against system compressors
mcx bench --compare /tmp/cantrbry/

# Silesia Corpus (download from https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)
mcx bench --compare --format markdown /path/to/silesia/

# Context Mixing (Level 28) — warning: very slow (~10 KB/s)
mcx compress -l 28 /tmp/cantrbry/alice29.txt -o /tmp/alice.mcx
mcx decompress /tmp/alice.mcx -o /tmp/alice.dec
diff /tmp/cantrbry/alice29.txt /tmp/alice.dec  # verify roundtrip

Baseline compressors for comparison: gzip -9, bzip2 -9, xz -9 (system packages).

Architecture

Input → [Block Analyzer] → Strategy Selection
                               │
     ┌─────────┬──────────┼──────────┬──────────┬──────────┐
     ▼         ▼          ▼          ▼          ▼          ▼
LZ Pipeline  BWT Pipe  Stride-Δ   LZRC-HC    LZRC-BT    CM Engine
(L1–L9)      (L10–14)  (L20 auto) (L24)      (L26)      (L28)
     │         │          │          │          │          │
LZ77 Match  divsufsort  Delta @   HC Match   BT Match   58 Context
Finding     +MTF+RLE2   stride    Finder     Finder     Models
     │         │          │          │          │          │
tANS/FSE/   Multi-tbl  RLE2+rANS  Adaptive  Adaptive   8 Neural
Adaptive AC  rANS                  Range RC   Range RC   Mixers+APM
     │         │          │          │          │          │
     └─────────┴──────────┼──────────┴──────────┴──────────┘
                               ▼
                    [Block Multiplexer]
                    OpenMP Parallelism
                               ▼
                         .mcx output

File Format

MCX uses a frame-based format with a 20-byte header and variable-size blocks (up to 64 MB). See docs/FORMAT.md for the full specification.

Quality & Safety

MaxCompression is built with production-grade engineering practices:

Practice Status
Continuous Integration GitHub Actions — every push triggers build + test on Linux (GCC + Clang, Release + Debug), macOS, and Windows
Test Suites 21 test suites — unit, roundtrip, fuzz, stress, regression, integration, streaming, edge cases, malformed input
Memory Safety Valgrind memcheck runs in CI — leak-check, track-origins, error-exitcode on every level
Code Coverage lcov + Codecov integration in CI pipeline
Roundtrip Verification Canterbury corpus roundtrip at all compression levels (L1–L26) in CI
Cross-Platform CI builds and tests on Linux, macOS, Windows with multiple compilers
WASM Emscripten build + Node.js roundtrip test in CI
Python Bindings Automated binding test in CI (build .so, compress/decompress, verify)
pkg-config Integration test: install, discover via pkg-config, build+link external program
API Documentation Doxygen generation + undocumented symbol check in CI
Security SECURITY.md — vulnerability reporting policy, supported versions
Releases Semantic versioning, prebuilt binaries (Linux/macOS/Windows) on every tagged release

Project Stats

  • ~17,400 lines of C code
  • 770+ commits across the project
  • 21 test suites — unit tests, roundtrip, fuzz, stress, regression, integration
  • CI — Linux (GCC + Clang), macOS, Windows, Valgrind, WASM, coverage, Python bindings

Project Structure

maxcomp/
├── include/maxcomp/    Public API (maxcomp.h)
├── lib/
│   ├── entropy/        tANS, FSE, rANS, multi-rANS, adaptive AC, range coder
│   ├── lz/             LZ77, LZRC v2.0, binary tree + hash chain match finders
│   ├── preprocess/     BWT (divsufsort), MTF, RLE2, delta, E8/E9 filter
│   ├── babel/          Stride-delta transform
│   ├── optimizer/      Genetic pipeline optimizer
│   ├── analyzer/       Block analysis (entropy, structure, stride detection)
│   ├── external/       Embedded libdivsufsort (MIT license)
│   └── compat.h        Cross-platform portability layer
├── cli/                Command-line tool (30+ subcommands)
├── bindings/python/    Python ctypes bindings
├── completions/        Bash, Zsh, Fish shell completions
├── tests/              21 test suites (unit, integration, fuzz, stress)
├── docs/               Format spec, API docs, benchmarks, man page
├── valgrind.supp       Valgrind suppressions
└── CMakeLists.txt

Documentation

Roadmap

Completed ✅

  • BWT + multi-table rANS — beats bzip2 on all standard benchmarks
  • Adaptive arithmetic coding on LZ output — best-in-class LZ ratios
  • Smart Mode (L20) with stride-delta, E8/E9, multi-trial strategy selection
  • LZRC v2.0 — LZ + range coder with BT/HC match finders, rep-matches, matched literals
  • OpenMP block parallelism
  • Embedded libdivsufsort (2× faster BWT)
  • Rich CLI with 30+ commands, multi-file, recursive, benchmarking
  • Python bindings with pip install support
  • Cross-platform CI (Linux, macOS, Windows)

Future

  • Context-mixed literal coding for LZRC
  • ARM/ARM64 BCJ filter
  • Streaming API for arbitrary-length input
  • WASM build for browser usage
  • v3.0 format: Huffman-coded LZ tokens (close gap with gzip at same speed)

Contributing

Contributions are welcome! Please ensure all changes pass the test suite:

cd build && ctest --output-on-failure

For compression ratio changes, include before/after benchmarks on Canterbury and Silesia corpora. See CONTRIBUTING.md.

License

GNU General Public License v3.0 — Free for everyone, forever.


MaxCompression is developed by Dreams-Makers Studio.

About

MaxCompression is a cross-platform, lossless compression library written in C99. It aims to push compression ratios beyond the current state of the art by fusing techniques from information theory, nature-inspired algorithms, fractal mathematics, and predictive coding.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors