MCX File Format Specification

Version: 1
Status: Stable
Last updated: 2026-03-18

Overview

MCX is a frame-based lossless compression format. Each .mcx file contains a single frame with a fixed header followed by one or more independently-decompressible blocks.

Frame Header (20 bytes)

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Magic (0x0158434D = "MCX\x01")                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   Version     |    Flags      |    Level      |   Strategy    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Original Size (uint64 LE)                  |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Header CRC32                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Offset	Size	Field	Description
0	4	`magic`	`0x0158434D` (`"MCX\x01"` in little-endian)
4	1	`version`	Format version (currently `1`)
5	1	`flags`	Bitfield (see below)
6	1	`level`	Compression level used (1–26)
7	1	`strategy`	Compression strategy (see below)
8	8	`original_size`	Uncompressed size in bytes (LE uint64)
16	4	`header_crc32`	CRC32 of bytes 0–15

Flags

Bit	Value	Name	Description
0	`0x01`	`HAS_ORIG_SIZE`	Original size field is valid
1	`0x02`	`STREAMING`	Stream mode (original size unknown)
2	`0x04`	`E8E9`	E8/E9 x86 filter applied at frame level
3	`0x08`	`ADAPTIVE_BLOCKS`	Variable-sized blocks (original block sizes stored in frame)
4	`0x10`	`INT_DELTA`	Sorted integer delta preprocessing applied
5	`0x20`	`INT_DELTA_W4`	Int-delta width: 0=16-bit, 1=32-bit (only valid with `INT_DELTA`)
6	`0x40`	`LZP`	LZP preprocessing applied (repeated block removal before compression)
7	`0x80`	`NIBBLE_SPLIT`	Nibble-split preprocessing (high/low nibble streams before BWT)

Note: All 8 flag bits are now allocated. Future preprocessing flags require a format extension (v2+ header).

Strategies

Value	Name	Description
0	`STORE`	Uncompressed
1	`DEFAULT`	BWT pipeline (genome-optimized)
2	`LZ_FAST`	LZ77 greedy + tANS
3	`LZ_HC`	LZ77 lazy + hash chains + tANS/FSE/AAC
4	`BWT`	Forced BWT + MTF + RLE + entropy
6	`BABEL`	Smart Mode (L20+, auto-detect)
7	`STRIDE`	Stride-delta preprocessing
8	`LZ24`	LZ77 with 24-bit offsets (16 MB window)
9	`LZRC`	LZ + Range Coder (v2.0, binary tree match finder)
10	`CM`	Context Mixing (L28, bit-level adaptive compression)

Block Layout

After the frame header:

[num_blocks : uint32 LE]
[block_0_compressed_size : uint32 LE] [block_0_data : ...]
[block_1_compressed_size : uint32 LE] [block_1_data : ...]
...

Maximum block size: 64 MB (67,108,864 bytes).

Blocks are independently decompressible, enabling parallel decoding.

Block Types

The first byte of each block's data identifies the compression method:

Byte	Type	Description
`0x00`	STORE	Uncompressed (block data follows directly)
`0xA8`	LZ16+rANS	LZ77 (16-bit offsets) + rANS entropy coding
`0xA9`	LZ24+rANS	LZ77 (24-bit offsets) + rANS entropy coding
`0xAA`	LZ16+FSE	LZ77 (16-bit offsets) + FSE entropy coding
`0xAB`	LZ16+RAW	LZ77 (16-bit offsets), no entropy coding
`0xAC`	LZ24+FSE	LZ77 (24-bit offsets) + FSE entropy coding
`0xAD`	LZ24+RAW	LZ77 (24-bit offsets), no entropy coding
`0xAE`	LZ16+AAC	LZ77 (16-bit offsets) + adaptive arithmetic coding
`0xAF`	LZ24+AAC	LZ77 (24-bit offsets) + adaptive arithmetic coding
`0xB0`	LZRC	LZ + Range Coder (v2.0, adaptive models)
Other	BWT Genome	BWT pipeline — byte encodes configuration (see below)

BWT Genome Byte

For BWT-pipeline blocks, the first byte encodes the processing pipeline configuration:

Bits	Field	Values
0	`use_bwt`	0 = skip, 1 = apply BWT
1	`use_mtf_rle`	0 = skip, 1 = apply MTF + RLE
2	`use_delta`	0 = skip, 1 = apply delta coding
3–4	`entropy_coder`	0 = Huffman, 1 = rANS, 2 = CM-rANS
5–7	`cm_learning`	6 = multi-table rANS, 7 = RLE2 active

BWT Block Data

[genome_byte : 1 byte]
[primary_index : uint32 LE]    — BWT primary index for inverse transform
[compressed_data : ...]         — entropy-coded output

Multi-table rANS Format

Used when cm_learning = 6 in the genome byte.

[original_size  : uint32 LE]    — pre-entropy size
[n_tables       : uint8]        — number of frequency tables (4–6)
[n_groups       : uint32 LE]    — number of 50-byte groups

For each table (n_tables times):
  [bitmap        : 32 bytes]    — bitfield of active symbols (bit i = symbol i present)
  [freq_0        : varint]      — 14-bit frequency for first active symbol
  [freq_1        : varint]      — ...
  ...

[sel_comp_size  : uint32 LE]    — compressed selector stream size
[sel_data       : ...]          — rANS-compressed table selectors (MTF'd)

[body           : ...]          — interleaved 2-state rANS coded data

Varint encoding: Values 0–127 use 1 byte. Values 128–16383 use 2 bytes (high bit set in first byte).

Adaptive Arithmetic Coding Format

Used for LZ block types 0xAE and 0xAF. Order-1 adaptive model with 256 contexts.

[original_size  : uint32 LE]    — uncompressed LZ token stream size
[ac_data        : ...]          — arithmetic-coded bitstream

The decoder reconstructs the model adaptively — no explicit frequency tables are stored.

LZRC Format (Block Type `0xB0`)

v2.0 LZ + Range Coder. Uses a binary tree match finder with adaptive range-coded models.

[original_size  : uint32 LE]    — uncompressed block size
[window_log     : uint8]        — window size as log2 (20=1MB, 24=16MB)
[rc_data        : ...]          — range-coded token stream

Token types (range-coded):

is_match (1 bit, context-dependent): 0 = literal, 1 = match
Literal: 8-bit byte coded via bit-tree (16 contexts based on previous byte)
Match:
- is_rep (1 bit): 0 = new distance, 1 = repeat distance
- rep_index (if is_rep=1): binary tree encoding of rep0–rep3
- length: 3-tier model (short 4–11, medium 12–19, extra 20–275)
- distance (if is_rep=0): 6-bit slot tree + extra bits + alignment

Distance slot encoding (64 slots):

Slots 0–3: no extra bits (distances 0–3)
Slots 4–17: 1–6 context-coded extra bits
Slots 18+: direct bits (fixed 50/50) + 4 alignment bits

All models are adaptive — no tables stored in the stream. Encoder and decoder maintain identical state.

E8/E9 x86 Filter

When the E8E9 flag (bit 2) is set in the frame header:

Encoding: Before compression, scan for 0xE8 (CALL) and 0xE9 (JMP) opcodes. Convert the following 4-byte relative address to absolute (add current position).
Decoding: After decompression, apply the inverse transform (subtract position from addresses).

Auto-detected when ≥ 0.5% of bytes are 0xE8 or 0xE9.

Stride-Delta Preprocessing

For structured binary data (detected by Smart Mode):

Detect optimal stride width (1–512 bytes) via autocorrelation
Apply delta coding at the detected stride: out[i] = in[i] - in[i - stride]
Result contains many zeros → compresses efficiently with RLE2 + rANS

The stride value is encoded in the block's genome/metadata.

Preprocessing Flags

Adaptive Blocks (`ADAPTIVE_BLOCKS`, 0x08)

When set, block sizes vary based on data entropy. High-entropy regions use smaller blocks (≤4 MB) and low-entropy regions use larger blocks (≤64 MB). An additional table of original block sizes is stored after the block count. Only used on BWT strategies for files >64 MB.

Integer Delta (`INT_DELTA`, 0x10 + `INT_DELTA_W4`, 0x20)

Auto-detected on sorted integer sequences. The encoder detects runs of monotonically increasing 16-bit or 32-bit integers and replaces them with their deltas (differences between consecutive values). Width is indicated by bit 5: 0 = 16-bit integers, 1 = 32-bit integers. Achieves up to 9× improvement on sorted uint16 arrays.

LZP Preprocessing (`LZP`, 0x40)

Lempel-Ziv Prediction removes repeated blocks before the main compression pipeline. The outer frame stores the LZP-compressed data and original size; decompression reverses LZP first, then passes the result through the standard decoder.

Nibble Split (`NIBBLE_SPLIT`, 0x80)

Splits each byte into high and low nibbles, grouping all high nibbles together followed by all low nibbles. Improves BWT compression on structured binary data where nibble-level patterns exist. Applied as a trial — only used when it reduces output size.

Constants

Constant	Value	Description
Magic	`0x0158434D`	Frame identifier
Max block size	64 MB	Maximum uncompressed block
rANS precision	14 bits	Frequency table resolution (M = 16384)
Multi-rANS tables	4–6	K-means clustered tables
Multi-rANS group size	50 bytes	Symbols per group for table selection
Max match length	258 (LZ) / 273 (LZRC)	Maximum match length
LZ16 max offset	65535	16-bit offset limit
LZ24 max offset	16777216	24-bit offset (16 MB window)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MCX File Format Specification

Overview

Frame Header (20 bytes)

Flags

Strategies

Block Layout

Block Types

BWT Genome Byte

BWT Block Data

Multi-table rANS Format

Adaptive Arithmetic Coding Format

LZRC Format (Block Type `0xB0`)

E8/E9 x86 Filter

Stride-Delta Preprocessing

Preprocessing Flags

Adaptive Blocks (`ADAPTIVE_BLOCKS`, 0x08)

Integer Delta (`INT_DELTA`, 0x10 + `INT_DELTA_W4`, 0x20)

LZP Preprocessing (`LZP`, 0x40)

Nibble Split (`NIBBLE_SPLIT`, 0x80)

Constants

Uh oh!

FilesExpand file tree

FORMAT.md

Latest commit

History

FORMAT.md

File metadata and controls

MCX File Format Specification

Overview

Frame Header (20 bytes)

Flags

Strategies

Block Layout

Block Types

BWT Genome Byte

BWT Block Data

Multi-table rANS Format

Adaptive Arithmetic Coding Format

LZRC Format (Block Type 0xB0)

E8/E9 x86 Filter

Stride-Delta Preprocessing

Preprocessing Flags

Adaptive Blocks (ADAPTIVE_BLOCKS, 0x08)

Integer Delta (INT_DELTA, 0x10 + INT_DELTA_W4, 0x20)

LZP Preprocessing (LZP, 0x40)

Nibble Split (NIBBLE_SPLIT, 0x80)

Constants

LZRC Format (Block Type `0xB0`)

Adaptive Blocks (`ADAPTIVE_BLOCKS`, 0x08)

Integer Delta (`INT_DELTA`, 0x10 + `INT_DELTA_W4`, 0x20)

LZP Preprocessing (`LZP`, 0x40)

Nibble Split (`NIBBLE_SPLIT`, 0x80)