[Bug] 评估阶段崩溃 - IndexError when app_agent_log is empty

# UFO 评估阶段崩溃 Bug 报告

**项目**: UFO (Windows 桌面智能体框架)
**组件**: 评估模块 (Evaluation Agent)
**严重程度**: 🔴 高 - 导致任务执行失败
**状态**: 待修复
**报告日期**: 2025-04-10
**测试环境**: UFO 3.0.0 + NVIDIA LLaMA 4 Maverick (OpenRouterService)

---

## 📋 问题摘要

当 HostAgent 能够独立完成任务而无需 AppAgent 介入时，UFO 框架的评估阶段会因 `IndexError` 异常崩溃，导致整个程序异常退出。

**错误信息**:
```
IndexError: list index out of range
  File "D:\apps\UFO-3.0.0\ufo\prompter\eva_prompter.py", line 108, in user_content_construction_head_tail
    trajectory.app_agent_log[0]
```

**触发条件**:
- HostAgent 在第一步就判定任务完成（如：目标应用已经打开）
- 任务不需要 AppAgent 执行任何步骤
- 评估阶段尝试访问空的 `app_agent_log` 列表

---

## 🔍 复现步骤

### 方法 1: 最简单复现

```bash
# 1. 确保计算器已打开（或其他目标应用已运行）
# 2. 运行 UFO 任务
python -m ufo --task test_bug --request "打开计算器" --log-level ERROR
```

**预期**: 任务成功完成
**实际**: 评估阶段崩溃

### 方法 2: 已复现的任务

```bash
python -m ufo --task test_simple --request "打开计算器" --log-level ERROR
```

### 方法 3: 关闭已打开的应用

```bash
# 先手动打开计算器
calc.exe
# 然后运行
python -m ufo --task test_shortcuts --request "关闭当前打开的计算器窗口" --log-level ERROR
```

---

## 📊 完整错误堆栈

```python
Traceback (most recent call last):
  File "D:\apps\UFO-3.0.0\ufo\module\basic.py", line 827, in evaluation
    result, cost = evaluator.evaluate(
                   ^^^^^^^^^^^^^^^^^^^
  File "D:\apps\UFO-3.0.0\ufo\agents\agent\evaluation_agent.py", line 135, in evaluate
    result = json_parser(result)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\apps\UFO-3.0.utils\__init__.py", line 91, in json_parser
    return json.loads(json_string)
           ^^^^^^^^^^^^^^^^^^^^^^^
  json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\apps\UFO-3.0.0\ufo\agents\agent\evaluation_agent.py", line 128, in evaluate
    message = self.message_constructor(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\apps\UFO-3.0.0\ufo\agents\agent\evaluation_agent.py", line 78, in message_constructor
    evaagent_prompt_user_message = self.prompter.user_content_construction(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\apps\UFO-3.0.0\ufo\prompter\eva_prompter.py", line 89, in user_content_construction
    return self.user_content_construction_head_tail(log_path, request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\apps\UFO-3.0.0\ufo\prompter\eva_prompter.py", line 108, in user_content_construction_head_tail
    trajectory.app_agent_log[0]
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
```

---

## 🔎 根因分析

### 问题代码位置

**文件**: `ufo/prompter/eva_prompter.py`
**行号**: 108

```python
def user_content_construction_head_tail(self, log_path, request):
    # ... 前导代码 ...
    # 问题行：直接访问第一个元素，没有检查列表是否为空
    app_agent_first_step = trajectory.app_agent_log[0]
    # ...
```

### 数据结构分析

当 HostAgent 直接完成任务时，`trajectory.app_agent_log` 是空列表 `[]`：

| 日志类型 | 内容 | 说明 |
|---------|------|------|
| `host_agent_log` | `[...]` | 有 1-2 条 HostAgent 记录 |
| `app_agent_log` | `[]` | ❌ 空列表（无 AppAgent 步骤） |
| `computer_log` | `[...]` | 可选 |

**为什么为空？**
- HostAgent 在第一轮就判定任务完成（如：应用已打开）
- 没有触发 AppAgent 的执行
- 评估代码错误地假设 `app_agent_log` 至少有一个元素

### 评估流程触发点

1. Session 执行完成 → `basic.py:run()` → `self.evaluation()`
2. 评估器创建 EvaluationAgent → 读取所有 log 文件
3. 构建 `Trajectory` 对象 → 填充三个日志列表
4. 调用 `prompter.user_content_construction()` → 访问 `app_agent_log[0]`
5. **崩溃**: 空列表索引越界

---

## 📈 影响范围

### 受影响的场景

| 场景 | 触发概率 | 影响 |
|------|---------|------|
| 目标应用已运行 | 高 | 🗑️ 任务报告为失败 |
| 简单任务（无需后续操作） | 中 | 🗑️ 任务报告为失败 |
| 复杂任务（需要 AppAgent） | 低 | ✅ 正常 |
| 多轮任务（AppAgent 介入） | 低 | ✅ 正常 |

### 统计数据（基于测试）

- **测试任务数**: 6
- **HostAgent 独立完成**: 2 (33%)
- **评估崩溃次数**: 2/2 (100%)
- **实际任务成功率**: 100%
- **程序退出率**: 33%

---

## 💡 解决方案建议

### 方案 1: 快速修复（推荐）

在 `eva_prompter.py` 添加空列表检查：

```python
def user_content_construction_head_tail(self, log_path, request):
    trajectory = self._load_trajectory(log_path)

    # 新增：检查 app_agent_log 是否为空
    if not trajectory.app_agent_log:
        return self._handle_empty_app_agent_log(trajectory, request)

    # 原代码保持不变
    app_agent_first_step = trajectory.app_agent_log[0]
    # ...
```

**优点**:
- 最小化修改
- 不影响现有逻辑
- 边缘情况显式处理

### 方案 2: 优化评估逻辑

修改评估逻辑，支持无 AppAgent 的任务：

```python
# eva_prompter.py
def user_content_construction_head_tail(self, log_path, request):
    trajectory = self._load_trajectory(log_path)

    # 根据是否有 app_agent_log 选择不同的评估模板
    if trajectory.app_agent_log:
        return self._construct_with_app_agent(trajectory, request)
    else:
        return self._construct_host_only(trajectory, request)
```

**优点**:
- 更合理的评估模型
- 区分不同任务类型

### 方案 3: 临时规避（用户级别）

如果不想修改框架代码，可以临时禁用评估：

```bash
# 修改 ufo.py 或配置文件，跳过 self.evaluation() 调用
# 或修改 BasicSession.run() 方法
```

**缺点**:
- 无法获得任务质量评估
- 需要用户自行修改

---

## 🧪 验证用例

### 测试用例 1: HostAgent 独立完成

```bash
# 前置条件：计算器已打开
calc.exe
sleep 2

# 执行
python -m ufo --task test_host_only --request "打开计算器" --log-level ERROR
```

**预期修复后**:
- ✅ 程序正常退出
- ✅ 评估结果正确（任务已完成）
- ✅ 无异常堆栈

### 测试用例 2: 正常多步骤任务

```bash
python -m ufo --task test_normal --request "打开记事本并输入测试文字" --log-level ERROR
```

**预期**:
- ✅ 不破坏现有功能
- ✅ 评估正常工作

---

## 📎 相关文件

- **问题代码**: `ufo/prompter/eva_prompter.py:108`
- **调用栈**: `ufo/agents/agent/evaluation_agent.py:78-128`
- **测试日志**: `logs/test_simple/response.log`
- **配置**: `config/ufo/agents.yaml`
- **自定义服务**: `ufo/llm/openrouter.py` (与问题无关)

---

## 🏷️ 元数据

- **UFO 版本**: 3.0.0
- **Git Commit**: (待补充)
- **Python**: 3.12
- **平台**: Windows 10/11
- **LLM**: NVIDIA LLaMA 4 Maverick 17B
- **API**: OpenRouterService

---

## 📬 提交信息

**标题**: [Bug] 评估阶段崩溃 - IndexError: list index out of range when app_agent_log is empty

**标签**: bug, evaluation, crash, edge-case

**分配给**: @microsoft/UFO-team

**优先级**: P2 (高)

**描述**:
```
当 HostAgent 独立完成任务而无需 AppAgent 介入时，评估阶段会因访问空列表而崩溃。
这是评估代码中未处理边缘情况的回归缺陷。

复现步骤见报告。建议在 eva_prompter.py:108 前添加空列表检查。
```

---

## ✅ 修正验证

修复后，运行以下测试验证：

```bash
# 测试 1: HostAgent 独立完成
python -m ufo --task test_fix_1 --request "打开计算器" --log-level ERROR
# 预期: 成功退出，无异常

# 测试 2: 正常多步骤
python -m ufo --task test_fix_2 --request "打开记事本并输入'Hello'" --log-level ERROR
# 预期: 正常执行并评估

# 检查日志
cat logs/test_fix_*/evaluation.log
# 应有有效评估内容，无 "IndexError"
```

---

**报告完成时间**: 2025-04-10
**报告者**: Claude Code AI Assistant
**附件**: `bug_reports/UFO_EVALUATION_CRASH_20250410.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 评估阶段崩溃 - IndexError when app_agent_log is empty #304

UFO 评估阶段崩溃 Bug 报告

📋 问题摘要

🔍 复现步骤

方法 1: 最简单复现

方法 2: 已复现的任务

方法 3: 关闭已打开的应用

📊 完整错误堆栈

🔎 根因分析

问题代码位置

数据结构分析

评估流程触发点

📈 影响范围

受影响的场景

统计数据（基于测试）

💡 解决方案建议

方案 1: 快速修复（推荐）

方案 2: 优化评估逻辑

方案 3: 临时规避（用户级别）

🧪 验证用例

测试用例 1: HostAgent 独立完成

测试用例 2: 正常多步骤任务

📎 相关文件

🏷️ 元数据

📬 提交信息

✅ 修正验证

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

日志类型	内容	说明
`host_agent_log`	`[...]`	有 1-2 条 HostAgent 记录
`app_agent_log`	`[]`	❌ 空列表（无 AppAgent 步骤）
`computer_log`	`[...]`	可选

场景	触发概率	影响
目标应用已运行	高	🗑️ 任务报告为失败
简单任务（无需后续操作）	中	🗑️ 任务报告为失败
复杂任务（需要 AppAgent）	低	✅ 正常
多轮任务（AppAgent 介入）	低	✅ 正常

[Bug] 评估阶段崩溃 - IndexError when app_agent_log is empty #304

Description

UFO 评估阶段崩溃 Bug 报告

📋 问题摘要

🔍 复现步骤

方法 1: 最简单复现

方法 2: 已复现的任务

方法 3: 关闭已打开的应用

📊 完整错误堆栈

🔎 根因分析

问题代码位置

数据结构分析

评估流程触发点

📈 影响范围

受影响的场景

统计数据（基于测试）

💡 解决方案建议

方案 1: 快速修复（推荐）

方案 2: 优化评估逻辑

方案 3: 临时规避（用户级别）

🧪 验证用例

测试用例 1: HostAgent 独立完成

测试用例 2: 正常多步骤任务

📎 相关文件

🏷️ 元数据

📬 提交信息

✅ 修正验证

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions