fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136
Open
tcconnally wants to merge 3 commits into
Open
fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136tcconnally wants to merge 3 commits into
tcconnally wants to merge 3 commits into
Conversation
The default RougeScorer tokenizer uses r'\\w+' regex which only matches ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese, etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when the response matches the expected output exactly. Added _unicode_tokenize function that uses re.UNICODE flag and falls back to character-level tokenization for non-ASCII scripts. Closes google#3111
e275a87 to
6dff0a2
Compare
- Replace function _unicode_tokenize with _UnicodeTokenizer class
implementing the tokenize() method expected by RougeScorer
- Move import re to module level
- Fix double-escaped regex patterns (\w -> \w, remove unsupported \p{P})
- Add return type annotation for tokenize() to satisfy mypy strict mode
- Fix RougeScorer constructor indentation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When evaluating text in non-Latin scripts (Thai, Chinese, Japanese, Arabic, etc.), the v1 ROUGE-1 evaluator returns scores of 0.0 even when the response matches the expected output exactly.
Root cause: The
rouge_scorelibrary's default tokenizer usesre.findall(r'\\w+', text)which only matches ASCII[a-zA-Z0-9_]. Non-Latin characters produce zero tokens → ROUGE-1 score of 0.0 regardless of correctness.Reproduction (from #3111)
Fix
Added
_unicode_tokenizefunction that:re.UNICODEflag for ASCII-majority text (preserves existing behavior)Closes #3111