Skip to content

Custom dictionary#45

Open
danielrh wants to merge 10 commits into
masterfrom
custom-dictionary
Open

Custom dictionary#45
danielrh wants to merge 10 commits into
masterfrom
custom-dictionary

Conversation

@danielrh

Copy link
Copy Markdown
Collaborator

No description provided.

danielrh and others added 9 commits June 14, 2026 11:42
Port the compound-dictionary mechanism from the C implementation
(c/dec/decode.c). The custom dictionary used to be prepended into the
ring buffer and folded into max_distance; once output exceeded the
window, the wrap overwrote the dictionary bytes and dictionary-range
distances failed with ERROR_FORMAT_DICTIONARY. The C decoder instead
keeps attached dictionaries in separate buffers: max_distance is
min(pos, max_backward_distance), distances in
(max_distance, max_distance + dict_size] address the dictionary
directly, and static-dictionary word ids start beyond that range.

The two schemes are byte-identical until the ring buffer wraps, so
existing streams decode unchanged (all prior tests pass). Streams where
content + dictionary exceed the window -- which the C encoder happily
produces -- now decode correctly too, as do dictionaries larger than
the window (the old code silently truncated them to ring buffer size).

The BrotliDecoderCompoundDictionary struct supports up to 15 chunks in
preparation for multi-dictionary attach (#27); the dictionary passed to
new_with_custom_dictionary becomes chunk 0 at stream initialization.
A copy interrupted by ring buffer exhaustion resumes via
BROTLI_STATE_COMMAND_POST_WRITE_1, mirroring the C state machine.

Tested with a checked-in fixture (4KiB dict, 64KiB output, 1KiB window,
produced by brotli 1.1.0 -w 10 -q 9 -D) that fails on the previous
code, plus differential testing against the C decoder over random
dict/content pairs at lgwin 10..24, q 5/9/11 and large_window=26.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Expose the compound-dictionary machinery introduced for #42 as a public
API, the first half of shared-dictionary support (#27):

- BrotliState::attach_dictionary attaches up to 15 raw LZ77 prefix
  dictionaries; allowed only before any compressed data is processed,
  matching the C BrotliDecoderAttachDictionary contract. The most
  recently attached dictionary is nearest in backward-distance space,
  and a dictionary passed to new_with_custom_dictionary is always the
  furthest chunk.
- attach_dictionary plumbed through Decompressor / DecompressorWriter
  and their CustomAlloc/CustomIo layers.
- FFI: BrotliDecoderAttachDictionary with BrotliSharedDictionaryType
  (RAW supported; SERIALIZED reserved, returns failure for now). The
  data is copied with the decoder's allocator, so callers need not
  keep it alive.

Attaching a dictionary in N chunks is byte-equivalent to attaching the
concatenation, so the tool's repeated -dict= flags (which concatenate)
already match the new semantics; tests cover chunk-boundary-crossing
copies and rejection of late attachment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Port the shared-brotli serialized dictionary format (draft-vandevenne-
shared-brotli-format, c/common/shared_dictionary.c) to complete #27:

- New shared_dictionary module parses the 0x91 0x00 container: an
  optional LZ77 prefix chunk (attached as a compound dictionary chunk),
  up to 64 custom word lists (word lengths 4..=31) and transform lists
  (length-prefixed prefix/suffix stringlets, transform types including
  the UTF-8-aware SHIFT_FIRST/SHIFT_ALL with parameters), a dictionary
  table mixing custom and built-in lists, and an optional 64-entry
  literal-context map for per-context dictionary selection.
- Parsed metadata is packed into a u32 arena allocated with the
  decoder's existing allocator so no_std builds need no new machinery;
  word/transform data is referenced by offset into the owned blob.
- decode.rs gains the generalized dictionary-word path from
  c/dec/decode.c: context-map dispatch via the current literal context,
  identity-cutoff fast path, and the cross-dictionary fallback scan for
  out-of-range word addresses. Streams that attach no custom lists take
  the unchanged built-in path. Ring buffer write-ahead slack grows to
  542 bytes since custom transforms may emit 255+31+255 bytes per word.
- API: BrotliState::attach_serialized_dictionary plus reader/writer
  plumbing; FFI BrotliDecoderAttachDictionary now accepts
  BROTLI_SHARED_DICTIONARY_SERIALIZED; tool flag -serialized_dict=.

Tested against the C implementation built with BROTLI_EXPERIMENTAL:
checked-in fixtures (custom words + transforms; context-based selection
which also exercises the fallback scan) decode identically, as do
randomized serialized dictionaries across q5/9/11 and lgwin 12/18/22.
Parser and transform unit tests cover malformed-input rejection, SHIFT
on multi-byte UTF-8, and prefix/suffix application.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Check in 20 (dictionary, content, compressed) fixtures produced by the
reference C encoder with dictionaries attached: 8 randomized serialized
dictionaries (random word lists, transform lists incl. SHIFT params,
context maps, LZ77 prefixes) each at two of q5/q11 x lgwin 12/18/22,
plus 4 randomized raw dictionaries at lgwin 10..26 (including large
window) sized so dictionary references outlive the ring buffer wrap.
Every fixture was verified to roundtrip with the C decoder at
generation time.

test_dictionary_corpus sweeps the corpus directory, so the Rust decoder
stays differentially tested against the C implementation without
needing a C toolchain at test time. scripts/dict_corpus/generate.py
plus harness.c regenerate the corpus from a google/brotli checkout.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- README notes for the #42 fix, the attach_dictionary API and serialized
  shared dictionary support (#27).
- New fuzz target decompress_with_dictionaries splits its input into a
  serialized dictionary, a raw dictionary and a stream, covering the
  serialized parser, compound-dictionary copies and the generalized
  word path.
- Deterministic mutation sweep test: every byte of a valid serialized
  dictionary is corrupted and attach/decode must fail cleanly rather
  than panic (runs in debug too, with overflow and bounds checks).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found by the differential_dictionaries fuzzer within the first minute:
the C implementation checks meta_block_remaining_len < 0 when a
metablock completes (c/dec/decode.c BROTLI_STATE_METABLOCK_DONE) and
fails with BROTLI_DECODER_ERROR_FORMAT_BLOCK_LENGTH_2, but the Rust
port was missing the check, silently accepting corrupted streams whose
final copy or dictionary word ran past the declared metablock length.

This is a pre-existing divergence (reproduced on v5.0.1, with no
dictionary attached); it matters more with shared dictionaries since
custom transforms make oversized dictionary words easier to construct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cargo fuzz target differential_dictionaries (feature c-compat) compiles
a google/brotli checkout -- located via BROTLI_C_ROOT, with
BROTLI_EXPERIMENTAL so serialized dictionaries work -- into the fuzz
binary and checks three properties per input:

1. Round trip: the input is shaped into a valid serialized shared
   dictionary, up to two raw dictionaries, and content referencing
   them; whatever stream the C encoder emits with those dictionaries
   attached (quality 1..11, lgwin 10..26), the Rust decoder must
   reproduce the content byte-for-byte (the C decoder is run as a
   sanity check).
2. Attach agreement: mutated serialized dictionaries must be accepted
   or rejected identically by BrotliDecoderAttachDictionary and
   attach_serialized_dictionary.
3. Verdict agreement: mutated and truncated streams must yield the
   same success/failure verdict from both decoders, with identical
   output on success.

Within the first minute the fuzzer caught the missing metablock-length
check fixed in the previous commit (a pre-existing divergence dating
back to at least 5.0.1); after the fix, 192k executions over 25
minutes under AddressSanitizer found no further divergence.

Also silences new rustc lifetime-elision warnings in
shared_dictionary.rs and documents the differential testing workflow
in the README.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@danielrh danielrh force-pushed the custom-dictionary branch from 825cbdd to 61c640c Compare June 14, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant