Skip to content

Ignore <think> blocks before evaluating agent answers #417

Description

@ChathurangiShyalika

Description

LRMs and CLI agents return reasoning traces inside <think>...</think> blocks before the final answer. For example, an answer may contain several internal reasoning blocks followed by the actual final response.

Currently, evaluation can include this reasoning text unless the answer is normalized first. This can make scoring unreliable, especially for tasks that expect a single integer or JSON object.

Expected Behavior

Before evaluation, the evaluator should remove all <think>...</think> blocks from the trajectory answer and score only the remaining final answer.

Proposed Fix

Add answer normalization in the evaluator before passing the answer to the scorer.

Example:

answer = _strip_think_blocks(traj.answer)
score = scorer(scenario, answer, trajectory_text)

The cleaned answer should also be recorded in the evaluation report.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions