Skip to content

fix: UTF-8 corruption in streaming — read chunks instead of byte-by-byte#11

Closed
mitre88 wants to merge 1 commit into
walter-grace:mainfrom
mitre88:fix/utf8-streaming-corruption
Closed

fix: UTF-8 corruption in streaming — read chunks instead of byte-by-byte#11
mitre88 wants to merge 1 commit into
walter-grace:mainfrom
mitre88:fix/utf8-streaming-corruption

Conversation

@mitre88
Copy link
Copy Markdown

@mitre88 mitre88 commented Apr 18, 2026

Problem

stream_llm() in agent.py:547 reads SSE responses byte-by-byte (resp.read(1)) and decodes each byte individually as UTF-8:

ch = resp.read(1)
buf += ch.decode("utf-8", errors="replace")

This corrupts all multi-byte UTF-8 characters:

  • Emojis: 🍎 → `` (4 bytes decoded as 4 replacement chars)
  • Accented: ñ, é, ü → `` (2 bytes each)
  • CJK: 中文 → `` (3 bytes each)

Every streamed response with non-ASCII characters is garbled.

Fix

Read 4096-byte chunks instead of single bytes:

chunk = resp.read(4096)
buf += chunk.decode("utf-8", errors="replace")

Multi-byte characters are now correctly assembled before decoding.

Impact

  • Before: Any emoji, accent, or non-ASCII char in streamed output is corrupted
  • After: All UTF-8 characters stream correctly
  • Performance: 4096-byte chunks are also faster (fewer syscalls)
  • Zero breaking changes: Same SSE parsing logic, just reads bigger chunks

stream_llm() reads SSE response byte-by-byte (resp.read(1)) and decodes
each byte individually as UTF-8. This corrupts all multi-byte characters:
emojis (🍎→????), accented chars (ñ→??), CJK text, etc.

Fix: read 4096-byte chunks and decode the full chunk. Multi-byte
characters are now correctly assembled before decoding.

This is the same issue reported in PR walter-grace#10.
@mitre88
Copy link
Copy Markdown
Author

mitre88 commented Apr 18, 2026

@walter-grace One-line fix: stream_llm() reads byte-by-byte which corrupts all multi-byte UTF-8 (emojis, accents, CJK). This PR reads 4KB chunks instead. Same issue as PR #10.

@mitre88 mitre88 closed this May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant