Skip to content

Commit 6dff0a2

Browse files
committed
fix: ROUGE-1 eval fails for non-English languages (ASCII-only tokenizer)
The default RougeScorer tokenizer uses r'\\w+' regex which only matches ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese, etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when the response matches the expected output exactly. Added _unicode_tokenize function that uses re.UNICODE flag and falls back to character-level tokenization for non-ASCII scripts. Closes #3111
1 parent 3f505d2 commit 6dff0a2

1 file changed

Lines changed: 29 additions & 1 deletion

File tree

src/google/adk/evaluation/final_response_match_v1.py

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,30 @@ def _get_eval_status(score: float, threshold: float):
9292
return EvalStatus.PASSED if score >= threshold else EvalStatus.FAILED
9393

9494

95+
def _unicode_tokenize(text: str):
96+
"""Tokenizes text using Unicode-aware word boundaries.
97+
98+
The default RougeScorer tokenizer uses r'\\w+' which only matches ASCII
99+
[a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese, Arabic, etc.),
100+
this returns zero tokens, causing ROUGE scores of 0.0 on matching responses.
101+
102+
This tokenizer uses re.UNICODE for ASCII-majority text and falls back to
103+
character-level tokenization for non-ASCII text.
104+
"""
105+
import re
106+
# For primarily non-ASCII text, tokenize by Unicode-aware patterns
107+
ascii_chars = sum(1 for c in text if ord(c) < 128)
108+
if ascii_chars > len(text) * 0.5:
109+
return re.findall(r'[\\w]+', text.lower(), re.UNICODE)
110+
# For non-Latin scripts, use whitespace splitting with Unicode support
111+
tokens = re.split(r'[\\s\\p{P}]+', text, flags=re.UNICODE)
112+
tokens = [t.lower() for t in tokens if t]
113+
if tokens:
114+
return tokens
115+
# Character-level fallback for scripts without word boundaries
116+
return list(text.lower())
117+
118+
95119
def _calculate_rouge_1_scores(candidate: str, reference: str):
96120
"""Calculates the ROUGE-1 score between a candidate and reference text.
97121
@@ -110,7 +134,11 @@ def _calculate_rouge_1_scores(candidate: str, reference: str):
110134
Returns:
111135
A dictionary containing the ROUGE-1 precision, recall, and f-measure.
112136
"""
113-
scorer = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
137+
scorer = rouge_scorer.RougeScorer(
138+
["rouge1"],
139+
use_stemmer=True,
140+
tokenizer=_unicode_tokenize,
141+
)
114142

115143
# The score method returns a dictionary where keys are the ROUGE types
116144
# and values are Score objects (tuples) with precision, recall, and fmeasure.

0 commit comments

Comments
 (0)