Skip to content

Add opt-in graceful bail-out for memory and content handler errors#320

Merged
orium merged 3 commits into
mainfrom
memory-limit-better-bail
May 21, 2026
Merged

Add opt-in graceful bail-out for memory and content handler errors#320
orium merged 3 commits into
mainfrom
memory-limit-better-bail

Conversation

@orium
Copy link
Copy Markdown
Member

@orium orium commented May 20, 2026

Today a MemoryLimitExceeded or ContentHandlerError aborts the rewriter mid-stream and leaves
the downstream response truncated.

This PR adds two opt-in flags:

  1. MemorySettings::graceful_bail_out_on_memory_limit_exceeded
  2. Settings::graceful_bail_out_on_content_handler_error

When set, before returning the error the rewriter flushes every input byte it has received but
not yet emitted to the sink, as-is. The caller can continue the response by writing subsequent
bytes directly to its downstream sink. Both flags default to false: existing behavior is
unchanged.

@orium orium requested review from a team, Noah-Kennedy and jasnell as code owners May 20, 2026 17:38
Copy link
Copy Markdown
Contributor

@kornelski kornelski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need non_exhaustive and semver-major bump.

orium added 3 commits May 20, 2026 19:35
By default the rewriter aborts processing when `max_allowed_memory_usage` is hit, leaving the
output sink in a potentially inconsistent state (i.e. the sink has received whatever transformed
bytes the rewriter had already produced, but the remaining input bytes are lost). Downstream
this typically manifests as a truncated, broken HTTP response.

This change adds an opt-in `MemorySettings::graceful_bail_out_on_memory_limit_exceeded` flag.
When set, before returning `MemoryLimitExceededError` the rewriter flushes every input byte it
has received but not yet emitted to the sink, as-is. The caller can then continue the response
by writing any subsequent input bytes directly to its own downstream sink, bypassing the (now
poisoned) rewriter. The resulting response has the rewriter's transformations applied up to
some boundary and the original bytes after that, but it is not broken.

The flush is wired in at all three memory-failure sites in `TransformStream::write()`/`end()`:

1. `Arena::append()`: the buffered (still-unparsed) bytes from previous calls plus the new
   chunk can't fit.
2. `Parser::parse()`: the selectors VM stack push exceeds the limit mid-parse. The
   dispatcher's `remaining_content_start` correctly points at the first byte of the failing
   lexeme (the limit check runs before `lexeme_consumed()`), so flushing from there preserves
   everything.
3. `Arena::init_with()`: the leftover unconsumed bytes can't be buffered for the next call.

A new `Dispatcher::flush_for_bail_out()` does the actual flushing. Unlike
`flush_remaining_input()` it ignores `emission_enabled`, which means a caller that was actively
removing element content at the time of the bail-out can end up with mismatched tags (the
removed content's start tag was suppressed but the trailing bytes flushed raw may include the
matching end tag). The alternative (skipping the flush when emission is disabled) would lose
bytes from the input chunk entirely, which the caller can't recover from since it doesn't
buffer its input. The caveat is documented on the new field.

The C API gains the corresponding `bool graceful_bail_out_on_memory_limit_exceeded` field in
`lol_html_memory_settings_t`. Existing C clients using designated initializers continue to
work: the field defaults to `false` / `0`.

Tested with 10 new tests in `rewriter::tests::fatal_errors`:

1. Three failure-site tests: one per memory-error origin
   (`test_graceful_bail_out_in_buffer_append`, `_in_buffer_init_with`, `_in_parser`).
2. `test_graceful_bail_out_preserves_prefix_transformations`: verifies handler transformations
   on tokens processed before the failure survive the bail-out.
3. `test_no_graceful_bail_out_by_default`: sanity check that the flag is opt-in.
4. Five response-reconstruction tests, each verifying `sink_output + unfed_remaining ==
   original_html` for a different real-world high-memory shape: a tag with a huge base64
   attribute (the INCIDENT-6638 shape), hundreds of small attributes on one tag, a huge HTML
   comment, deeply nested elements (selectors VM stack overflow), and many never-closed tags.
This is the symmetric companion to the previous `MemoryLimitExceeded` bail-out. Until now,
a content handler returning `Err` would abort the rewriter and leave the output sink in a
potentially inconsistent state (the sink has whatever transformed bytes the rewriter had
already produced, but the remaining input bytes are lost). Downstream this typically
manifests as a truncated, broken HTTP response.

This change adds an opt-in `Settings::graceful_bail_out_on_content_handler_error` flag.
When set, before propagating `RewritingError::ContentHandlerError` the rewriter flushes
every input byte it has received but not yet emitted to the sink, as-is. The caller can
then continue the response by writing subsequent bytes directly to its own downstream
sink, bypassing the (now poisoned) rewriter.

The flag lives on `Settings` rather than `MemorySettings` because the underlying error has
nothing to do with memory: a handler returning `Err` is an explicit signal from the
application, not an environmental constraint. The two flags are intentionally independent
so callers can pick the recovery posture per error class.

The implementation requires a small but important refactor of the dispatcher. Previously,
`lexeme_consumed()` did two things: emit the chunk of input before the lexeme, and advance
`remaining_content_start` past the lexeme. The second step happened before the lexeme's
token was actually serialized to the sink, so a handler error from `token_produced()` (or
text emission) left `remaining_content_start` past a lexeme whose bytes had never been
emitted. Flushing from there as-is would lose the lexeme.

We split that into:

1. `emit_chunk_before_lexeme()`: emit the raw input chunk between the previously emitted
   byte and the start of the lexeme, advance `remaining_content_start` to the lexeme's
   start.
2. `consume_lexeme()`: advance `remaining_content_start` past the lexeme's end. Called
   only after the lexeme's token has been successfully emitted.

With this split, on a `ContentHandlerError` mid-lexeme `remaining_content_start` still
points at the failing lexeme's start, and a graceful bail-out can flush the lexeme + the
rest of the chunk raw.

The bail-out decision in `TransformStream::write()` and `end()` is centralized in
`should_bail_out_for()`, which considers each `RewritingError` variant against its own
flag. `ParsingAmbiguity` is never recovered from (the whole point of strict mode is to
refuse uncertain markup).

The C API is left on the original behavior for now. The previous memory flag fit cleanly
into `MemorySettings` (a `#[repr(C)]` struct), so it was an additive change to the C ABI.
The new flag lives on `Settings`, which is not `repr(C)`, so exposing it through the C
API needs either a new function signature or a new settings struct — a maintainer call.
A TODO captures this in `c-api/src/rewriter.rs`, and the CHANGELOG notes it.

Tested with 6 new tests in `rewriter::tests::fatal_errors`:

1. `test_graceful_bail_out_on_element_handler_error`: element handler returns `Err`, sink
   ends up with the transformed prefix + the failing element and everything after it raw.
2. `test_no_graceful_bail_out_on_content_handler_error_by_default`: sanity check that the
   flag is opt-in.
3. `test_graceful_bail_out_on_comment_handler_error`: comment handler error path
   (exercises the same `lexeme_consumed` restructure).
4. `test_graceful_bail_out_on_end_handler_error`: `handle_end()` returns `Err` after
   `flush_remaining_input` has already emitted every input byte; sink already has the
   complete document.
5. `test_bail_out_reconstruct_handler_error_midstream`: end-to-end reconstruction
   (`sink_output == original_html`).
6. `test_bail_out_flags_independent`: enabling content-handler bail-out doesn't
   accidentally enable memory bail-out, and vice versa.

Caveats documented on the new field's doc comment:

1. Active content removal at the time of the bail-out can produce mismatched tags. Same
   caveat as the memory flag; uncommon in practice and not byte-truncating.
2. Multi-chunk text emission where a later chunk's handler errors will re-emit the raw
   input bytes, duplicating the already-emitted chunks. Rare (only on encoding-converted
   text); the response is byte-bigger but not truncated.
The two preceding commits added public fields to `MemorySettings` and `Settings`,
externally-constructible structs without `#[non_exhaustive]`. `cargo-semver-checks` flags
this as requiring a major version bump.

Also bumps `js-api` to match. `lol_html_c_api` keeps its own version cadence; the
maintainer can bump it separately if desired.
@orium orium force-pushed the memory-limit-better-bail branch from 24faa2d to 402a19a Compare May 20, 2026 20:23
@orium orium merged commit e97d16c into main May 21, 2026
6 checks passed
@orium orium deleted the memory-limit-better-bail branch May 21, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants