fix: UTF-8 decoding corrupts multi-byte characters in streaming#10
Open
TheBlueHouse75 wants to merge 1 commit into
Open
fix: UTF-8 decoding corrupts multi-byte characters in streaming#10TheBlueHouse75 wants to merge 1 commit into
TheBlueHouse75 wants to merge 1 commit into
Conversation
The SSE reader was calling resp.read(1) and decoding each single byte with utf-8. Multi-byte characters (é, à, 中, emoji, etc.) span 2–4 bytes, so each byte was individually replaced by U+FFFD, producing garbled output for any non-ASCII language. Fixed by reading 4 KiB chunks and feeding them through an incremental UTF-8 decoder, which correctly handles multi-byte sequences that span chunk boundaries.
mitre88
pushed a commit
to mitre88/mac-code
that referenced
this pull request
Apr 18, 2026
stream_llm() reads SSE response byte-by-byte (resp.read(1)) and decodes each byte individually as UTF-8. This corrupts all multi-byte characters: emojis (🍎→????), accented chars (ñ→??), CJK text, etc. Fix: read 4096-byte chunks and decode the full chunk. Multi-byte characters are now correctly assembled before decoding. This is the same issue reported in PR walter-grace#10.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes garbled output for any non-ASCII text streamed from the LLM (French accents, Chinese, Japanese, emoji, etc.).
The bug
In
stream_llm(), the SSE reader was doing:Multi-byte UTF-8 characters span 2–4 bytes (
é=0xC3 0xA9,中= 3 bytes, emoji = 4 bytes). Decoding each byte individually fails for every byte of a multi-byte sequence, so each byte is replaced byU+FFFD. The result is unreadable output for any non-English response.Example before fix (French)
The fix
Use
codecs.getincrementaldecoder("utf-8")and read in larger chunks. The incremental decoder keeps state acrossdecode()calls, so multi-byte sequences that straddle chunk boundaries are handled correctly.Switching from 1-byte to 4 KiB reads is also noticeably cheaper on CPU.
Test plan
python3 agent.pyagainst a French prompt — accents render correctly