Add opt-in graceful bail-out for memory and content handler errors#320
Merged
Conversation
kornelski
approved these changes
May 20, 2026
Contributor
kornelski
left a comment
There was a problem hiding this comment.
This will need non_exhaustive and semver-major bump.
By default the rewriter aborts processing when `max_allowed_memory_usage` is hit, leaving the output sink in a potentially inconsistent state (i.e. the sink has received whatever transformed bytes the rewriter had already produced, but the remaining input bytes are lost). Downstream this typically manifests as a truncated, broken HTTP response. This change adds an opt-in `MemorySettings::graceful_bail_out_on_memory_limit_exceeded` flag. When set, before returning `MemoryLimitExceededError` the rewriter flushes every input byte it has received but not yet emitted to the sink, as-is. The caller can then continue the response by writing any subsequent input bytes directly to its own downstream sink, bypassing the (now poisoned) rewriter. The resulting response has the rewriter's transformations applied up to some boundary and the original bytes after that, but it is not broken. The flush is wired in at all three memory-failure sites in `TransformStream::write()`/`end()`: 1. `Arena::append()`: the buffered (still-unparsed) bytes from previous calls plus the new chunk can't fit. 2. `Parser::parse()`: the selectors VM stack push exceeds the limit mid-parse. The dispatcher's `remaining_content_start` correctly points at the first byte of the failing lexeme (the limit check runs before `lexeme_consumed()`), so flushing from there preserves everything. 3. `Arena::init_with()`: the leftover unconsumed bytes can't be buffered for the next call. A new `Dispatcher::flush_for_bail_out()` does the actual flushing. Unlike `flush_remaining_input()` it ignores `emission_enabled`, which means a caller that was actively removing element content at the time of the bail-out can end up with mismatched tags (the removed content's start tag was suppressed but the trailing bytes flushed raw may include the matching end tag). The alternative (skipping the flush when emission is disabled) would lose bytes from the input chunk entirely, which the caller can't recover from since it doesn't buffer its input. The caveat is documented on the new field. The C API gains the corresponding `bool graceful_bail_out_on_memory_limit_exceeded` field in `lol_html_memory_settings_t`. Existing C clients using designated initializers continue to work: the field defaults to `false` / `0`. Tested with 10 new tests in `rewriter::tests::fatal_errors`: 1. Three failure-site tests: one per memory-error origin (`test_graceful_bail_out_in_buffer_append`, `_in_buffer_init_with`, `_in_parser`). 2. `test_graceful_bail_out_preserves_prefix_transformations`: verifies handler transformations on tokens processed before the failure survive the bail-out. 3. `test_no_graceful_bail_out_by_default`: sanity check that the flag is opt-in. 4. Five response-reconstruction tests, each verifying `sink_output + unfed_remaining == original_html` for a different real-world high-memory shape: a tag with a huge base64 attribute (the INCIDENT-6638 shape), hundreds of small attributes on one tag, a huge HTML comment, deeply nested elements (selectors VM stack overflow), and many never-closed tags.
This is the symmetric companion to the previous `MemoryLimitExceeded` bail-out. Until now, a content handler returning `Err` would abort the rewriter and leave the output sink in a potentially inconsistent state (the sink has whatever transformed bytes the rewriter had already produced, but the remaining input bytes are lost). Downstream this typically manifests as a truncated, broken HTTP response. This change adds an opt-in `Settings::graceful_bail_out_on_content_handler_error` flag. When set, before propagating `RewritingError::ContentHandlerError` the rewriter flushes every input byte it has received but not yet emitted to the sink, as-is. The caller can then continue the response by writing subsequent bytes directly to its own downstream sink, bypassing the (now poisoned) rewriter. The flag lives on `Settings` rather than `MemorySettings` because the underlying error has nothing to do with memory: a handler returning `Err` is an explicit signal from the application, not an environmental constraint. The two flags are intentionally independent so callers can pick the recovery posture per error class. The implementation requires a small but important refactor of the dispatcher. Previously, `lexeme_consumed()` did two things: emit the chunk of input before the lexeme, and advance `remaining_content_start` past the lexeme. The second step happened before the lexeme's token was actually serialized to the sink, so a handler error from `token_produced()` (or text emission) left `remaining_content_start` past a lexeme whose bytes had never been emitted. Flushing from there as-is would lose the lexeme. We split that into: 1. `emit_chunk_before_lexeme()`: emit the raw input chunk between the previously emitted byte and the start of the lexeme, advance `remaining_content_start` to the lexeme's start. 2. `consume_lexeme()`: advance `remaining_content_start` past the lexeme's end. Called only after the lexeme's token has been successfully emitted. With this split, on a `ContentHandlerError` mid-lexeme `remaining_content_start` still points at the failing lexeme's start, and a graceful bail-out can flush the lexeme + the rest of the chunk raw. The bail-out decision in `TransformStream::write()` and `end()` is centralized in `should_bail_out_for()`, which considers each `RewritingError` variant against its own flag. `ParsingAmbiguity` is never recovered from (the whole point of strict mode is to refuse uncertain markup). The C API is left on the original behavior for now. The previous memory flag fit cleanly into `MemorySettings` (a `#[repr(C)]` struct), so it was an additive change to the C ABI. The new flag lives on `Settings`, which is not `repr(C)`, so exposing it through the C API needs either a new function signature or a new settings struct — a maintainer call. A TODO captures this in `c-api/src/rewriter.rs`, and the CHANGELOG notes it. Tested with 6 new tests in `rewriter::tests::fatal_errors`: 1. `test_graceful_bail_out_on_element_handler_error`: element handler returns `Err`, sink ends up with the transformed prefix + the failing element and everything after it raw. 2. `test_no_graceful_bail_out_on_content_handler_error_by_default`: sanity check that the flag is opt-in. 3. `test_graceful_bail_out_on_comment_handler_error`: comment handler error path (exercises the same `lexeme_consumed` restructure). 4. `test_graceful_bail_out_on_end_handler_error`: `handle_end()` returns `Err` after `flush_remaining_input` has already emitted every input byte; sink already has the complete document. 5. `test_bail_out_reconstruct_handler_error_midstream`: end-to-end reconstruction (`sink_output == original_html`). 6. `test_bail_out_flags_independent`: enabling content-handler bail-out doesn't accidentally enable memory bail-out, and vice versa. Caveats documented on the new field's doc comment: 1. Active content removal at the time of the bail-out can produce mismatched tags. Same caveat as the memory flag; uncommon in practice and not byte-truncating. 2. Multi-chunk text emission where a later chunk's handler errors will re-emit the raw input bytes, duplicating the already-emitted chunks. Rare (only on encoding-converted text); the response is byte-bigger but not truncated.
The two preceding commits added public fields to `MemorySettings` and `Settings`, externally-constructible structs without `#[non_exhaustive]`. `cargo-semver-checks` flags this as requiring a major version bump. Also bumps `js-api` to match. `lol_html_c_api` keeps its own version cadence; the maintainer can bump it separately if desired.
24faa2d to
402a19a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Today a
MemoryLimitExceededorContentHandlerErroraborts the rewriter mid-stream and leavesthe downstream response truncated.
This PR adds two opt-in flags:
MemorySettings::graceful_bail_out_on_memory_limit_exceededSettings::graceful_bail_out_on_content_handler_errorWhen set, before returning the error the rewriter flushes every input byte it has received but
not yet emitted to the sink, as-is. The caller can continue the response by writing subsequent
bytes directly to its downstream sink. Both flags default to
false: existing behavior isunchanged.