Description
LRMs and CLI agents return reasoning traces inside <think>...</think> blocks before the final answer. For example, an answer may contain several internal reasoning blocks followed by the actual final response.
Currently, evaluation can include this reasoning text unless the answer is normalized first. This can make scoring unreliable, especially for tasks that expect a single integer or JSON object.
Expected Behavior
Before evaluation, the evaluator should remove all <think>...</think> blocks from the trajectory answer and score only the remaining final answer.
Proposed Fix
Add answer normalization in the evaluator before passing the answer to the scorer.
Example:
answer = _strip_think_blocks(traj.answer)
score = scorer(scenario, answer, trajectory_text)
The cleaned answer should also be recorded in the evaluation report.
Description
LRMs and CLI agents return reasoning traces inside
<think>...</think>blocks before the final answer. For example, an answer may contain several internal reasoning blocks followed by the actual final response.Currently, evaluation can include this reasoning text unless the answer is normalized first. This can make scoring unreliable, especially for tasks that expect a single integer or JSON object.
Expected Behavior
Before evaluation, the evaluator should remove all
<think>...</think>blocks from the trajectory answer and score only the remaining final answer.Proposed Fix
Add answer normalization in the evaluator before passing the answer to the scorer.
Example:
The cleaned answer should also be recorded in the evaluation report.