fix: UTF-8 corruption in streaming — read chunks instead of byte-by-byte#11
Closed
mitre88 wants to merge 1 commit into
Closed
fix: UTF-8 corruption in streaming — read chunks instead of byte-by-byte#11mitre88 wants to merge 1 commit into
mitre88 wants to merge 1 commit into
Conversation
stream_llm() reads SSE response byte-by-byte (resp.read(1)) and decodes each byte individually as UTF-8. This corrupts all multi-byte characters: emojis (🍎→????), accented chars (ñ→??), CJK text, etc. Fix: read 4096-byte chunks and decode the full chunk. Multi-byte characters are now correctly assembled before decoding. This is the same issue reported in PR walter-grace#10.
Author
|
@walter-grace One-line fix: stream_llm() reads byte-by-byte which corrupts all multi-byte UTF-8 (emojis, accents, CJK). This PR reads 4KB chunks instead. Same issue as PR #10. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
stream_llm()inagent.py:547reads SSE responses byte-by-byte (resp.read(1)) and decodes each byte individually as UTF-8:This corrupts all multi-byte UTF-8 characters:
Every streamed response with non-ASCII characters is garbled.
Fix
Read 4096-byte chunks instead of single bytes:
Multi-byte characters are now correctly assembled before decoding.
Impact