feat: charset-aware, binary-safe body previews in request/response logging by OmarAlJarrah · Pull Request #99 · dexpace/java-sdk

OmarAlJarrah · 2026-06-16T13:41:28Z

Summary

Body previews in the instrumentation steps hard-coded a UTF-8 decode, so a latin-1 body logged as mojibake and a binary body logged as replacement-character noise.

This adds an internal BodyPreview.render(bytes, mediaType):

charset-aware — text bodies are decoded with the charset from the body's MediaType, falling back to UTF-8 when the media type is absent, declares no charset, or names one the JVM can't resolve;
binary-safe — a body that looks binary (a NUL byte, or a high ratio of non-whitespace control bytes in the leading bytes) is summarized as [binary N bytes captured] rather than decoded; high bytes are not counted against the body, so UTF-8 and latin-1 text are still treated as text. N is the captured (bounded-snapshot) byte count, so the summary never implies a length it didn't observe.
wide-charset–aware — a body whose MediaType explicitly declares a resolvable fixed-width charset (UTF-16/UTF-32, whose ASCII encoding contains NUL padding) is decoded with that charset rather than being misclassified as binary. The probe is derived from the charset's own encoder and memoized per charset.

Both the sync and async instrumentation steps use it, and the duplicated per-step utf8Preview is removed. The bounded-capture logic is untouched. Internal-only — no API change.

Tests cover ISO-8859-1 decoding, the UTF-8 default (declared, absent, and unknown-charset), binary summarization (including a body mislabeled as text), and UTF-16BE/LE decoding, at both the unit and instrumentation-event level.

Closes #25

…gging Body previews in the default sync and async instrumentation steps decoded every captured payload as UTF-8. That produced mojibake for a text body declared in another charset (a latin-1 response logged `café` as `caf?`) and a run of U+FFFD replacement characters for a binary body (gzip, an image), which is both unreadable and capable of corrupting log viewers. Add a shared `BodyPreview` renderer that the two steps delegate to: - Text bodies are decoded with the charset declared on the body's `MediaType`, falling back to UTF-8 when the media type omits a charset or names one the JVM cannot resolve. - A short text/binary sniff (NUL byte, or a high ratio of non-whitespace control bytes in a leading sample) classifies binary bodies, which are rendered as a size-only `[binary N bytes]` summary instead of being decoded as text. Decoding still substitutes the replacement character for malformed input rather than throwing, so a snapshot truncated mid-multibyte-sequence by the bounded capture cannot crash the log line. The bounded/streaming capture behaviour of the loggable bodies is unchanged.

…eview seam The binary heuristic treated any NUL byte as a definitive binary signal and ran before the declared charset was consulted, so a body declared as a fixed-width multi-byte charset (UTF-16/UTF-32) was rendered as [binary N bytes] even though its NUL padding is legitimate text. render() now consults the declared charset first: when the MediaType names a resolvable charset whose ASCII encoding contains NUL bytes, the body is decoded with that charset and the NUL-based heuristic is skipped. Single-byte charsets and bodies with no declared charset still run the heuristic. Document the wide-charset handling and the best-effort nature of the content sniff in the BodyPreview KDoc. Mark previewText and DEFAULT_CHARSET private so the shared/tested seam is exactly render + isProbablyText. Add an async end-to-end test mirroring the sync ISO-8859-1 response-preview assertion so the async step's mediaType() pass-through is covered.

Polish the BodyPreview seam in response to review: - Memoise encodesAsciiWithNul per Charset in a ConcurrentHashMap so the ASCII probe encodes at most once per distinct charset instead of on every render; subsequent renders are a plain map lookup. - Rename the binary summary to [binary N bytes captured]. The byte count is the bounded preview snapshot, not necessarily the full body length, so the wording no longer implies a size it never observed. - Drop the unreachable empty-array branch in previewText; render already returns early for empty input before any decode path is reached.

OmarAlJarrah changed the base branch from fix/response-body-capture-drain-and-close to main June 16, 2026 20:20

OmarAlJarrah added 3 commits June 16, 2026 23:41

OmarAlJarrah force-pushed the feat/charset-aware-body-previews branch from d96831c to d0d76f8 Compare June 16, 2026 20:45

OmarAlJarrah merged commit 26578ba into main Jun 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: charset-aware, binary-safe body previews in request/response logging#99

feat: charset-aware, binary-safe body previews in request/response logging#99
OmarAlJarrah merged 3 commits into
mainfrom
feat/charset-aware-body-previews

OmarAlJarrah commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OmarAlJarrah commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

OmarAlJarrah commented Jun 16, 2026 •

edited

Loading