Skip to content

fix: SSE-aware rehydration for PII split across chunks#3

Open
dfein38347g wants to merge 9 commits into
chandika:mainfrom
dfein38347g:force-no-stream
Open

fix: SSE-aware rehydration for PII split across chunks#3
dfein38347g wants to merge 9 commits into
chandika:mainfrom
dfein38347g:force-no-stream

Conversation

@dfein38347g
Copy link
Copy Markdown

@dfein38347g dfein38347g commented May 21, 2026

Summary

  • Add --no-stream CLI flag and force_no_stream config option to buffer SSE responses before rehydration, eliminating 128-byte chunk-boundary fragility
  • Buffer all raw bytes first, rehydrate the complete body in one pass (no per-chunk rehydration)
  • Handle compressed SSE: decompress → check signed thinking blocks → rehydrate → recompress (matching handle_regular_response pattern)
  • Drop Layer 1 approach (injecting stream: false into request body) — breaks client SDK contract
  • Document why forward_request fast-path is excluded
  • 40/40 existing tests pass, no new clippy warnings

Test Plan

  • cargo build — clean (1 pre-existing dead_code warning)
  • cargo test — 40 passed, 0 failed
  • cargo clippy — no new warnings

Files changed

 README.md           |  3 +++
 mirage.default.yaml |  4 +++
 src/config.rs       |  3 ++
 src/main.rs         | 10 ++++++
 src/proxy.rs        | 98 ++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 116 insertions(+), 2 deletions(-)

Closes issue #4

rehydrate_sse_body() parses the SSE body, joins all delta.content
values into one contiguous string, rehydrates there, then reconstructs
the SSE with all content in the first chunk.

Fixes the case where a fake value (IP, email, API key, etc.) is split
across multiple SSE content fields — naive string replacement on the
raw buffered body fails because the full value never appears contiguously.
@dfein38347g dfein38347g changed the title feat: add --no-stream flag to buffer SSE responses and rehydrate in one pass fix: SSE-aware rehydration for PII split across chunks May 22, 2026
@dfein38347g
Copy link
Copy Markdown
Author

SSE-aware rehydration

The original --no-stream implementation buffered all SSE bytes then called faker.rehydrate() directly on the raw body. This fails when a fake value is split across multiple delta.content JSON fields — the full value never appears as a contiguous substring in the buffered text.

What changed

rehydrate_sse_body() (replaces the naive faker.rehydrate(body) call):

  1. Parses SSE events, joins all delta.content values into one contiguous string
  2. Calls faker.rehydrate() on the joined string
  3. Reconstructs the SSE with corrected content in the first chunk, empty content in subsequent chunks

Why it matters

Any PII type that gets tokenized across SSE content boundaries benefits — IPs, emails, API keys, SSNs, etc. The tokenizer splits 84.106.142.195 into separate tokens, each in its own SSE event. Naive string replacement on the raw body cannot find the full pattern.

Tests added

  • Split IP reassembly (SSE-aware parsing correctly rehydrates across chunks)
  • No-op preservation (clean SSE bodies pass through unchanged)
  • Contiguous IP rehydration (baseline for replace_token_bounded)
  • SSE-split baseline (proves naive rehydration fails)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant