Custom dictionary#45
Open
danielrh wants to merge 10 commits into
Open
Conversation
Port the compound-dictionary mechanism from the C implementation (c/dec/decode.c). The custom dictionary used to be prepended into the ring buffer and folded into max_distance; once output exceeded the window, the wrap overwrote the dictionary bytes and dictionary-range distances failed with ERROR_FORMAT_DICTIONARY. The C decoder instead keeps attached dictionaries in separate buffers: max_distance is min(pos, max_backward_distance), distances in (max_distance, max_distance + dict_size] address the dictionary directly, and static-dictionary word ids start beyond that range. The two schemes are byte-identical until the ring buffer wraps, so existing streams decode unchanged (all prior tests pass). Streams where content + dictionary exceed the window -- which the C encoder happily produces -- now decode correctly too, as do dictionaries larger than the window (the old code silently truncated them to ring buffer size). The BrotliDecoderCompoundDictionary struct supports up to 15 chunks in preparation for multi-dictionary attach (#27); the dictionary passed to new_with_custom_dictionary becomes chunk 0 at stream initialization. A copy interrupted by ring buffer exhaustion resumes via BROTLI_STATE_COMMAND_POST_WRITE_1, mirroring the C state machine. Tested with a checked-in fixture (4KiB dict, 64KiB output, 1KiB window, produced by brotli 1.1.0 -w 10 -q 9 -D) that fails on the previous code, plus differential testing against the C decoder over random dict/content pairs at lgwin 10..24, q 5/9/11 and large_window=26. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Expose the compound-dictionary machinery introduced for #42 as a public API, the first half of shared-dictionary support (#27): - BrotliState::attach_dictionary attaches up to 15 raw LZ77 prefix dictionaries; allowed only before any compressed data is processed, matching the C BrotliDecoderAttachDictionary contract. The most recently attached dictionary is nearest in backward-distance space, and a dictionary passed to new_with_custom_dictionary is always the furthest chunk. - attach_dictionary plumbed through Decompressor / DecompressorWriter and their CustomAlloc/CustomIo layers. - FFI: BrotliDecoderAttachDictionary with BrotliSharedDictionaryType (RAW supported; SERIALIZED reserved, returns failure for now). The data is copied with the decoder's allocator, so callers need not keep it alive. Attaching a dictionary in N chunks is byte-equivalent to attaching the concatenation, so the tool's repeated -dict= flags (which concatenate) already match the new semantics; tests cover chunk-boundary-crossing copies and rejection of late attachment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Port the shared-brotli serialized dictionary format (draft-vandevenne- shared-brotli-format, c/common/shared_dictionary.c) to complete #27: - New shared_dictionary module parses the 0x91 0x00 container: an optional LZ77 prefix chunk (attached as a compound dictionary chunk), up to 64 custom word lists (word lengths 4..=31) and transform lists (length-prefixed prefix/suffix stringlets, transform types including the UTF-8-aware SHIFT_FIRST/SHIFT_ALL with parameters), a dictionary table mixing custom and built-in lists, and an optional 64-entry literal-context map for per-context dictionary selection. - Parsed metadata is packed into a u32 arena allocated with the decoder's existing allocator so no_std builds need no new machinery; word/transform data is referenced by offset into the owned blob. - decode.rs gains the generalized dictionary-word path from c/dec/decode.c: context-map dispatch via the current literal context, identity-cutoff fast path, and the cross-dictionary fallback scan for out-of-range word addresses. Streams that attach no custom lists take the unchanged built-in path. Ring buffer write-ahead slack grows to 542 bytes since custom transforms may emit 255+31+255 bytes per word. - API: BrotliState::attach_serialized_dictionary plus reader/writer plumbing; FFI BrotliDecoderAttachDictionary now accepts BROTLI_SHARED_DICTIONARY_SERIALIZED; tool flag -serialized_dict=. Tested against the C implementation built with BROTLI_EXPERIMENTAL: checked-in fixtures (custom words + transforms; context-based selection which also exercises the fallback scan) decode identically, as do randomized serialized dictionaries across q5/9/11 and lgwin 12/18/22. Parser and transform unit tests cover malformed-input rejection, SHIFT on multi-byte UTF-8, and prefix/suffix application. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Check in 20 (dictionary, content, compressed) fixtures produced by the reference C encoder with dictionaries attached: 8 randomized serialized dictionaries (random word lists, transform lists incl. SHIFT params, context maps, LZ77 prefixes) each at two of q5/q11 x lgwin 12/18/22, plus 4 randomized raw dictionaries at lgwin 10..26 (including large window) sized so dictionary references outlive the ring buffer wrap. Every fixture was verified to roundtrip with the C decoder at generation time. test_dictionary_corpus sweeps the corpus directory, so the Rust decoder stays differentially tested against the C implementation without needing a C toolchain at test time. scripts/dict_corpus/generate.py plus harness.c regenerate the corpus from a google/brotli checkout. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- README notes for the #42 fix, the attach_dictionary API and serialized shared dictionary support (#27). - New fuzz target decompress_with_dictionaries splits its input into a serialized dictionary, a raw dictionary and a stream, covering the serialized parser, compound-dictionary copies and the generalized word path. - Deterministic mutation sweep test: every byte of a valid serialized dictionary is corrupted and attach/decode must fail cleanly rather than panic (runs in debug too, with overflow and bounds checks). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found by the differential_dictionaries fuzzer within the first minute: the C implementation checks meta_block_remaining_len < 0 when a metablock completes (c/dec/decode.c BROTLI_STATE_METABLOCK_DONE) and fails with BROTLI_DECODER_ERROR_FORMAT_BLOCK_LENGTH_2, but the Rust port was missing the check, silently accepting corrupted streams whose final copy or dictionary word ran past the declared metablock length. This is a pre-existing divergence (reproduced on v5.0.1, with no dictionary attached); it matters more with shared dictionaries since custom transforms make oversized dictionary words easier to construct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cargo fuzz target differential_dictionaries (feature c-compat) compiles a google/brotli checkout -- located via BROTLI_C_ROOT, with BROTLI_EXPERIMENTAL so serialized dictionaries work -- into the fuzz binary and checks three properties per input: 1. Round trip: the input is shaped into a valid serialized shared dictionary, up to two raw dictionaries, and content referencing them; whatever stream the C encoder emits with those dictionaries attached (quality 1..11, lgwin 10..26), the Rust decoder must reproduce the content byte-for-byte (the C decoder is run as a sanity check). 2. Attach agreement: mutated serialized dictionaries must be accepted or rejected identically by BrotliDecoderAttachDictionary and attach_serialized_dictionary. 3. Verdict agreement: mutated and truncated streams must yield the same success/failure verdict from both decoders, with identical output on success. Within the first minute the fuzzer caught the missing metablock-length check fixed in the previous commit (a pre-existing divergence dating back to at least 5.0.1); after the fix, 192k executions over 25 minutes under AddressSanitizer found no further divergence. Also silences new rustc lifetime-elision warnings in shared_dictionary.rs and documents the differential testing workflow in the README. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
825cbdd to
61c640c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.