Ignore `<think>` blocks before evaluating agent answers

## Description
LRMs and CLI agents return reasoning traces inside `<think>...</think>` blocks before the final answer. For example, an answer may contain several internal reasoning blocks followed by the actual final response.

Currently, evaluation can include this reasoning text unless the answer is normalized first. This can make scoring unreliable, especially for tasks that expect a single integer or JSON object.

## Expected Behavior
Before evaluation, the evaluator should remove all `<think>...</think>` blocks from the trajectory answer and score only the remaining final answer.

## Proposed Fix
Add answer normalization in the evaluator before passing the answer to the scorer.

Example:

```python
answer = _strip_think_blocks(traj.answer)
score = scorer(scenario, answer, trajectory_text)
```
The cleaned answer should also be recorded in the evaluation report.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ignore `<think>` blocks before evaluating agent answers #417

Description

Expected Behavior

Proposed Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Ignore <think> blocks before evaluating agent answers #417

Description

Description

Expected Behavior

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Ignore `<think>` blocks before evaluating agent answers #417