Skip to content

feat: charset-aware, binary-safe body previews in request/response logging#99

Merged
OmarAlJarrah merged 3 commits into
mainfrom
feat/charset-aware-body-previews
Jun 16, 2026
Merged

feat: charset-aware, binary-safe body previews in request/response logging#99
OmarAlJarrah merged 3 commits into
mainfrom
feat/charset-aware-body-previews

Conversation

@OmarAlJarrah

@OmarAlJarrah OmarAlJarrah commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Body previews in the instrumentation steps hard-coded a UTF-8 decode, so a latin-1 body logged as mojibake and a binary body logged as replacement-character noise.

This adds an internal BodyPreview.render(bytes, mediaType):

  • charset-aware — text bodies are decoded with the charset from the body's MediaType, falling back to UTF-8 when the media type is absent, declares no charset, or names one the JVM can't resolve;
  • binary-safe — a body that looks binary (a NUL byte, or a high ratio of non-whitespace control bytes in the leading bytes) is summarized as [binary N bytes captured] rather than decoded; high bytes are not counted against the body, so UTF-8 and latin-1 text are still treated as text. N is the captured (bounded-snapshot) byte count, so the summary never implies a length it didn't observe.
  • wide-charset–aware — a body whose MediaType explicitly declares a resolvable fixed-width charset (UTF-16/UTF-32, whose ASCII encoding contains NUL padding) is decoded with that charset rather than being misclassified as binary. The probe is derived from the charset's own encoder and memoized per charset.

Both the sync and async instrumentation steps use it, and the duplicated per-step utf8Preview is removed. The bounded-capture logic is untouched. Internal-only — no API change.

Tests cover ISO-8859-1 decoding, the UTF-8 default (declared, absent, and unknown-charset), binary summarization (including a body mislabeled as text), and UTF-16BE/LE decoding, at both the unit and instrumentation-event level.

Closes #25

@OmarAlJarrah OmarAlJarrah changed the base branch from fix/response-body-capture-drain-and-close to main June 16, 2026 20:20
…gging

Body previews in the default sync and async instrumentation steps decoded
every captured payload as UTF-8. That produced mojibake for a text body
declared in another charset (a latin-1 response logged `café` as `caf?`)
and a run of U+FFFD replacement characters for a binary body (gzip, an
image), which is both unreadable and capable of corrupting log viewers.

Add a shared `BodyPreview` renderer that the two steps delegate to:

- Text bodies are decoded with the charset declared on the body's
  `MediaType`, falling back to UTF-8 when the media type omits a charset
  or names one the JVM cannot resolve.
- A short text/binary sniff (NUL byte, or a high ratio of non-whitespace
  control bytes in a leading sample) classifies binary bodies, which are
  rendered as a size-only `[binary N bytes]` summary instead of being
  decoded as text.

Decoding still substitutes the replacement character for malformed input
rather than throwing, so a snapshot truncated mid-multibyte-sequence by the
bounded capture cannot crash the log line. The bounded/streaming capture
behaviour of the loggable bodies is unchanged.
…eview seam

The binary heuristic treated any NUL byte as a definitive binary signal and
ran before the declared charset was consulted, so a body declared as a
fixed-width multi-byte charset (UTF-16/UTF-32) was rendered as
[binary N bytes] even though its NUL padding is legitimate text. render()
now consults the declared charset first: when the MediaType names a
resolvable charset whose ASCII encoding contains NUL bytes, the body is
decoded with that charset and the NUL-based heuristic is skipped. Single-byte
charsets and bodies with no declared charset still run the heuristic.

Document the wide-charset handling and the best-effort nature of the
content sniff in the BodyPreview KDoc. Mark previewText and DEFAULT_CHARSET
private so the shared/tested seam is exactly render + isProbablyText. Add an
async end-to-end test mirroring the sync ISO-8859-1 response-preview
assertion so the async step's mediaType() pass-through is covered.
Polish the BodyPreview seam in response to review:

- Memoise encodesAsciiWithNul per Charset in a ConcurrentHashMap so the
  ASCII probe encodes at most once per distinct charset instead of on
  every render; subsequent renders are a plain map lookup.
- Rename the binary summary to [binary N bytes captured]. The byte count
  is the bounded preview snapshot, not necessarily the full body length,
  so the wording no longer implies a size it never observed.
- Drop the unreachable empty-array branch in previewText; render already
  returns early for empty input before any decode path is reached.
@OmarAlJarrah OmarAlJarrah force-pushed the feat/charset-aware-body-previews branch from d96831c to d0d76f8 Compare June 16, 2026 20:45
@OmarAlJarrah OmarAlJarrah merged commit 26578ba into main Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Charset-aware, binary-safe body previews in request/response logging

1 participant