Skip to content

fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136

Open
tcconnally wants to merge 3 commits into
google:mainfrom
Perseus-Computing-LLC:fix/non-english-eval-rouge
Open

fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136
tcconnally wants to merge 3 commits into
google:mainfrom
Perseus-Computing-LLC:fix/non-english-eval-rouge

Conversation

@tcconnally

Copy link
Copy Markdown

Problem

When evaluating text in non-Latin scripts (Thai, Chinese, Japanese, Arabic, etc.), the v1 ROUGE-1 evaluator returns scores of 0.0 even when the response matches the expected output exactly.

Root cause: The rouge_score library's default tokenizer uses re.findall(r'\\w+', text) which only matches ASCII [a-zA-Z0-9_]. Non-Latin characters produce zero tokens → ROUGE-1 score of 0.0 regardless of correctness.

Reproduction (from #3111)

agent = Agent(
    model="gemini-2.5-flash",
    instruction='Reply with only the word "สวัสดี"',
)
# Agent responds "สวัสดี" → ROUGE-1 score: 0.0 (should be 1.0)

Fix

Added _unicode_tokenize function that:

  1. Uses re.UNICODE flag for ASCII-majority text (preserves existing behavior)
  2. Splits on Unicode whitespace/punctuation for non-ASCII text
  3. Falls back to character-level tokens for scripts without word boundaries (Chinese, Japanese)

Closes #3111

The default RougeScorer tokenizer uses r'\\w+' regex which only matches
ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese,
etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when
the response matches the expected output exactly.

Added _unicode_tokenize function that uses re.UNICODE flag and falls
back to character-level tokenization for non-ASCII scripts.

Closes google#3111
@tcconnally tcconnally force-pushed the fix/non-english-eval-rouge branch from e275a87 to 6dff0a2 Compare June 15, 2026 21:22
@rohityan rohityan self-assigned this Jun 15, 2026
rohityan and others added 2 commits June 15, 2026 15:45
- Replace function _unicode_tokenize with _UnicodeTokenizer class
  implementing the tokenize() method expected by RougeScorer
- Move import re to module level
- Fix double-escaped regex patterns (\w -> \w, remove unsupported \p{P})
- Add return type annotation for tokenize() to satisfy mypy strict mode
- Fix RougeScorer constructor indentation
@wyf7107 wyf7107 self-assigned this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval fails for non-English languages

3 participants