"c[allum's|ustom|ompressor for] gzip"
cgzip is a standards-compliant GNU Gzip compressor, implemented from scratch using C++23. Files compressed using cgzip can be decompressed using gzip -d.
Install the cgzip executable using CMake. An install make target is provided for convenience.
β― make installcgzip accepts an input bitstream through stdin and outputs a compressed bitstream to stdout.
β― install/bin/cgzip < data/calgary_corpus/bib > bib.gzCompressed files can be decompressed using any Gzip decompressor.
> gzip -cd bib.gz | diff - data/calgary_corpus/bib -s
Files - and data/calgary_corpus/bib are identical
β― echo "original size = $(stat -c%s data/calgary_corpus/bib | numfmt --to=iec) bytes, compressed size = $(stat -c%s bib.gz | numfmt --to=iec) bytes"
original size = 109K bytes, compressed size = 36K bytesA high-level description of Gzip is in order!
Gzip refers to both a file format and program. The gzip program understands the .gz
gzip file format, as well as the DEFLATE bitstream found within .gz files.
The .gz file format is a container format. Although originally designed to support multiple
compression formats, it was never expanded beyond a single compression format: DEFLATE.
The DEFLATE algorithm achieves compression by combining two tricks: LZSS and Huffman coding.
LZSS works by reusing previously-seen patterns through back-references.
Back-references are themselves pairs containing (1) the length of the pattern, and (2)
the distance backwards to the start of a previous occurrence of the pattern.
For instance, ABCZABC could be encoded as ABCZ(3,4). Back-references can also overlap
into the future. For instance, ABABABAB could be encoded as AB(6,2). In DEFLATE, back-references
are bounded to lengths of at least 3 and at most 258; distances are at most 2^15.
Huffman coding is used to determine optimal prefix code lengths for symbols. As Huffman coding is applied
after LZSS, possible symbols include literals (A, B, C, Z in the examples above), as well as
back-references ((3,4) and (6,2) in the examples above). Huffman coding ensures that symbols
are encoded efficiently according to their frequency. How Huffman code lengths are mapped to
prefix codes is, by nature, ambiguous. As the compressor and decompressor must share an
understanding of how Huffman code lengths map to actual prefix codes, this is problematic.
Thankfully, the RFC for DEFLATE (RFC 1951) defines how prefix codes should be formed from lengths.
According to the RFC, the input AABCCDC would have the following prefix code lengths and codes:
| symbol | count | prefix code length | prefix code |
|---|---|---|---|
A |
2 | 2 | 10 |
B |
1 | 3 | 110 |
C |
3 | 1 | 0 |
D |
1 | 3 | 111 |
Although these two tricks are enough to achieve compression, a host of additional techniques and considerations are involved in creating a Gzip compressor. For instance, DEFLATE supports three block types: block type 0, block type 1, and block type 2. Block type 0 is uncompressed. Block type 1 is compressed using a hardcoded table of prefix codes. Block type 2 is compressed using a custom table of prefix codes. The compressor chooses the sizes and types of blocks. This gives rise to a number of design choices:
- How does the compressor choose the block types and block sizes?
- How is the custom prefix code table communicated to the decompressor?
- How is the custom prefix code table itself compressed using another hardcoded prefix code table?
- How are the custom prefix code lengths derived such that the maximum length is upper-bounded by DEFLATE's maximum of 15?
- How are back-references found efficiently?
- If multiple back-references exist, which should be used, given that Huffman coding will later be applied?
- What if encoding a set of literals directly is more efficient than using a back-reference?
- ...
As the DEFLATE specification only defines the protocol between the compressor and decompressor, there is significant leeway in terms of how these questions can be answered. The decompressor is "dumb" in the sense that it does not expect these questions to be answered in any specific way; it just understands how to reverse the effects of Huffman coding and LZSS. This allows iterative algorithmic improvements to Gzip compressors without requiring changes to the specification or decompressors. This also means any two implementations of a Gzip compressor can achieve different compression and speed, and it is up to the designer of a given compressor to address the significant number of design choices.
cgzip is fully-compliant with the DEFLATE specification and .gz file format, making it
drop-in replaceable for any other Gzip compressor.
Refer to the Design Choices section for more on how cgzip
approaches the key design considerations described above.
The data/ folder contains all files used for speed and compression ratio benchmarks.
Within the data/ folder are two corpora frequently used in the domain of data compression for evaluating the performance
of different compression schemes: the Calgary and Canterbury corpora.
Additional custom corpora are also included.
The dust output below illustrates the absolute and relative file sizes within the data/ folder.
> dust data
4.0K βββ abc.txt ββββ β 0%
4.0K βββ banana.txt ββββ β 0%
4.0K βββ two_cities.txt ββββ β 0%
32K βββ a32768.txt ββββ β 0%
64K βββ a65536.txt ββββ β 0%
332K βββ english_words.txt ββββ β 2%
444K βββ΄ other ββββ β 2%
4.0K β βββ grammar.lsp ββββββββββββββββββ β 0%
8.0K β βββ xargs.1 ββββββββββββββββββ β 0%
12K β βββ fields.c ββββββββββββββββββ β 0%
28K β βββ cp.html ββββββββββββββββββ β 0%
40K β βββ sum ββββββββββββββββββ β 0%
124K β βββ asyoulik.txt ββββββββββββββββββ β 1%
152K β βββ alice29.txt ββββββββββββββββββ β 1%
420K β βββ lcet10.txt ββββββββββββββββββ β 2%
472K β βββ plrabn12.txt ββββββββββββββββββ β 2%
504K β βββ ptt5 ββββββββββββββββββ β 2%
1008K β βββ kennedy.xls ββββββββββββββββββ β 5%
2.7M βββ΄ canterbury_corpus ββββββββββββββββββ β 14%
12K β βββ paper5 βββββββββββββββββββββ β 0%
16K β βββ paper4 βββββββββββββββββββββ β 0%
24K β βββ obj1 βββββββββββββββββββββ β 0%
40K β βββ paper6 βββββββββββββββββββββ β 0%
40K β βββ progc βββββββββββββββββββββ β 0%
48K β βββ paper3 βββββββββββββββββββββ β 0%
52K β βββ paper1 βββββββββββββββββββββ β 0%
52K β βββ progp βββββββββββββββββββββ β 0%
72K β βββ progl βββββββββββββββββββββ β 0%
84K β βββ paper2 βββββββββββββββββββββ β 0%
92K β βββ trans βββββββββββββββββββββ β 0%
100K β βββ geo βββββββββββββββββββββ β 0%
112K β βββ bib βββββββββββββββββββββ β 1%
244K β βββ obj2 βββββββββββββββββββββ β 1%
372K β βββ news βββββββββββββββββββββ β 2%
504K β βββ pic βββββββββββββββββββββ β 2%
600K β βββ book2 βββββββββββββββββββββ β 3%
752K β βββ book1 βββββββββββββββββββββ β 4%
3.1M βββ΄ calgary_corpus βββββββββββββββββββββ β 16%
4.0K β βββ README.txt βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 0%
4.0K β βββ compressed-file-svgrepo-com.svg βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 0%
32K β βββ GCA_009858895.3.fasta βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 0%
36K β βββ reboot.c βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 0%
144K β βββ pg1513.epub βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 1%
164K β βββ pear.jpg βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 1%
256K β βββ regional_district_weekly_2021.xlsxβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 1%
288K β βββ jquery-3.6.4.js βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 1%
360K β βββ TheWarofTheWorlds.txt βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 2%
12M β βββ Historical_Dates.json βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 62%
13M βββ΄ millbay_corpus βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 69%
20M βββ΄ data ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 100%"Speed" is used here instead of "performance" as "performance" may refer to speed or compression ratio.
The following system was used for benchmarking.
β― fastfetch
ββββ βββββ ββββ callumcurtis@wind
βββββ βββββ βββββ -----------------
βββββ ββββββββββ OS: NixOS 25.05 (Warbler) x86_64
βββββ ββββββββ Kernel: Linux 6.12.45
βββββββββββββββββββ ββββββ ββ Uptime: 28 mins
βββββββββββββββββββββ βββββ ββββ Packages: 2252 (nix-system)
βββββ βββββ βββββ Shell: bash 5.2.37
βββββ ββββ βββββ Display (VG27A): 2560x1440 @ 165 Hz in 27" [External]
βββββ ββ βββββ Display (VG27A): 1440x2560 @ 165 Hz in 27" [External]
βββββββββββββ ββββββββββββ WM: Hyprland 0.51.0 (Wayland)
ββββββββββββ βββββββββββββ Theme: adw-gtk3 [GTK2/3/4]
βββββ ββ βββββ Font: DejaVu Sans (10pt) [GTK2/3/4]
βββββ ββββ βββββ Cursor: Bibata-Original-Ice (24px)
βββββ βββββ βββββ Terminal: zellij 0.43.1
ββββ βββββ ββββββββββββββββββββ CPU: AMD Ryzen 9 5900X (24) @ 3.70 GHz
ββ ββββββ ββββββββββββββββββ GPU: NVIDIA GeForce RTX 3070 [Discrete]
ββββββββ βββββ Memory: 4.57 GiB / 31.23 GiB (15%)
ββββββββββ βββββ Swap: Disabled
βββββ βββββ βββββ Disk (/): 0 B / 15.61 GiB (0%) - tmpfs
ββββ βββββ ββββ Disk (/nix): 70.11 GiB / 301.06 GiB (23%) - ext4 [Read-only]The data/ folder was combined into a single data.tar archive of size 20MB.
β― tar -cf data.tar data
β― stat -c%s data.tar | numfmt --to=iec
20MThe archive was then compressed using cgzip and benchmarked using hyperfine.
β― hyperfine './install/bin/cgzip < data.tar > data.tar.gz'
Benchmark 1: ./install/bin/cgzip < data.tar > data.tar.gz
Time (mean Β± Ο): 14.224 s Β± 0.624 s [User: 14.148 s, System: 0.017 s]
Range (min β¦ max): 13.232 s β¦ 15.102 s 10 runsThe chart below compares cgzip and gzip compression ratios across all files in the data/ folder.
Differences are measured in relative percent using gzip as the baseline,
meaning a compression ratio of 300% from gzip and a compression ratio of 290%
from cgzip would translate to a -3.3% relative difference:
Where
The following gzip version was used in the comparison (using the default compression level of 6):
β― gzip --version
gzip 1.14
Copyright (C) 2025 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License <https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
Written by Jean-loup Gailly.The following sections describe some of the design choices taken in cgzip
(make sure to read What is cgzip Really Doing?
first to understand how design choices like these are required under the DEFLATE specification!).
Package merge is used to generate Huffman code lengths based on symbol frequencies. See the implementation and reference for more details. Huffman code lengths are translated into prefix codes according to RFC 1951. See the implementation for more details.
Ring buffers are used to represent look-ahead and look-back buffers, ensuring that memory overhead is upper-bounded by the maximum look-ahead and look-back sizes. The position of the most recent occurrence of each length-three pattern is represented in a hash map. An additional ring buffer is used to represent the "chain" of additional occurrences of the same pattern in the look-back buffer. For example, the hash map could indicate that the pattern "ABC" occurred at position 100 in the look-back buffer. At the same position in the chain buffer, there could be an entry with the value 90, indicating that the pattern "ABC" previously occurred at position 90. In this fashion, the full set of positions for "ABC" are cached in the hash map and chain buffer. See the implementation for more details.
The block type 2 header is optimized to reduce the number of bits required to represent the header in the presence of trailing zero-length meta-prefix-codes. See the implementation for more details.
The block size is dynamically determined based on a CUSUM algorithm for change-point detection in a categorical data stream. This article gives a good overview of the algorithm for the numerical data stream case. To adapt the algorithm for the categorical data stream case, the probability of each symbol is tracked, rather than the mean of the data stream. Using the log-likelihood ratio, the algorithm determines whether the probability of the current symbol is more likely given the current distribution, rather than the distribution from the warmup period. Once the cumulative sum of log-likelihood ratios has accumulated beyond a threshold, the distribution of the data stream is deemed to have shifted, and a change-point is reported. A new block is then created. See the implementation for more details.
The chart below presents the impact of adaptive block sizing on cgzip's compression ratios
across all files in the data/ folder.
The optimal block type is selected by simulating the contents of each block type and comparing the number of bits. Due to the warmup period required for change-point detection, the minimum size of a block is 2^13 bytes. In practice, this is larger than desirable for a block of type 1, as beyond this size, the overhead of block type 2 is relatively small, making block type 2 preferable in essentially all cases. Since block type 1 will never be used, it is disabled to improve speed. It can be re-enabled by increasing the breakpoint value for block type 1 from 0 to a value larger than 2^13 in main.cpp, allowing block type 1 to be considered.

