feat: charset-aware, binary-safe body previews in request/response logging#99
Merged
Merged
Conversation
…gging Body previews in the default sync and async instrumentation steps decoded every captured payload as UTF-8. That produced mojibake for a text body declared in another charset (a latin-1 response logged `café` as `caf?`) and a run of U+FFFD replacement characters for a binary body (gzip, an image), which is both unreadable and capable of corrupting log viewers. Add a shared `BodyPreview` renderer that the two steps delegate to: - Text bodies are decoded with the charset declared on the body's `MediaType`, falling back to UTF-8 when the media type omits a charset or names one the JVM cannot resolve. - A short text/binary sniff (NUL byte, or a high ratio of non-whitespace control bytes in a leading sample) classifies binary bodies, which are rendered as a size-only `[binary N bytes]` summary instead of being decoded as text. Decoding still substitutes the replacement character for malformed input rather than throwing, so a snapshot truncated mid-multibyte-sequence by the bounded capture cannot crash the log line. The bounded/streaming capture behaviour of the loggable bodies is unchanged.
…eview seam The binary heuristic treated any NUL byte as a definitive binary signal and ran before the declared charset was consulted, so a body declared as a fixed-width multi-byte charset (UTF-16/UTF-32) was rendered as [binary N bytes] even though its NUL padding is legitimate text. render() now consults the declared charset first: when the MediaType names a resolvable charset whose ASCII encoding contains NUL bytes, the body is decoded with that charset and the NUL-based heuristic is skipped. Single-byte charsets and bodies with no declared charset still run the heuristic. Document the wide-charset handling and the best-effort nature of the content sniff in the BodyPreview KDoc. Mark previewText and DEFAULT_CHARSET private so the shared/tested seam is exactly render + isProbablyText. Add an async end-to-end test mirroring the sync ISO-8859-1 response-preview assertion so the async step's mediaType() pass-through is covered.
Polish the BodyPreview seam in response to review: - Memoise encodesAsciiWithNul per Charset in a ConcurrentHashMap so the ASCII probe encodes at most once per distinct charset instead of on every render; subsequent renders are a plain map lookup. - Rename the binary summary to [binary N bytes captured]. The byte count is the bounded preview snapshot, not necessarily the full body length, so the wording no longer implies a size it never observed. - Drop the unreachable empty-array branch in previewText; render already returns early for empty input before any decode path is reached.
d96831c to
d0d76f8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Body previews in the instrumentation steps hard-coded a UTF-8 decode, so a latin-1 body logged as mojibake and a binary body logged as replacement-character noise.
This adds an internal
BodyPreview.render(bytes, mediaType):MediaType, falling back to UTF-8 when the media type is absent, declares no charset, or names one the JVM can't resolve;[binary N bytes captured]rather than decoded; high bytes are not counted against the body, so UTF-8 and latin-1 text are still treated as text.Nis the captured (bounded-snapshot) byte count, so the summary never implies a length it didn't observe.MediaTypeexplicitly declares a resolvable fixed-width charset (UTF-16/UTF-32, whose ASCII encoding contains NUL padding) is decoded with that charset rather than being misclassified as binary. The probe is derived from the charset's own encoder and memoized per charset.Both the sync and async instrumentation steps use it, and the duplicated per-step
utf8Previewis removed. The bounded-capture logic is untouched. Internal-only — no API change.Tests cover ISO-8859-1 decoding, the UTF-8 default (declared, absent, and unknown-charset), binary summarization (including a body mislabeled as text), and UTF-16BE/LE decoding, at both the unit and instrumentation-event level.
Closes #25