Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 60 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ Unlike prior approaches that simply concatenate full interaction histories, `ver
- [3. Sokoban](#3-sokoban)
- [4. Gym Cards](#4-gym-cards)
- [5. AppWorld (Experimental)](#5-appworld-experimental)
- [6. Search](#6-search)
- [Run Examples](#run-examples)
- [RL Training](#rl-training)
- [1. GiGPO](#1-gigpo)
Expand Down Expand Up @@ -139,7 +140,7 @@ We have released our models on [HuggingFace](https://huggingface.co/collections/
# Installation
## Install veRL
```bash
conda create -n verl-agent python==3.12 -y
conda create -n verl-agent python==3.10 -y
conda activate verl-agent

pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
Expand Down Expand Up @@ -177,12 +178,6 @@ alfworld-play-tw
---

### 2. WebShop
WebShop requires Python <=3.10, so begin by creating a new `verl-agent-webshop` environment
```bash
conda create -n verl-agent-webshop python==3.10 -y
conda activate verl-agent-webshop
```

Install WebShop
```bash
cd ./agent_system/environments/env_package/webshop/webshop
Expand Down Expand Up @@ -241,9 +236,59 @@ appworld install
appworld download data
```

### 6. Search
```bash
conda activate verl-agent
cd ./agent_system/environments/env_package/search/third_party
pip install -e .
pip install gym==0.26.2
```

Prepare dataset (data will be saved at `~/data/searchR1_processed_direct`):
```bash
cd repo_root/
python examples/data_preprocess/preprocess_search_r1_dataset.py
```


Since faiss-gpu is not available via pip, we setup a separate conda environment for the local retrieval server. Running this server will use around 6GB of GPU memory per GPU, so make sure to account for this in your training run configuration. Build Retriever environments:
```bash
# Create and activate the retriever environment with Python 3.10
conda create -n retriever python=3.10 -y
conda activate retriever

# Install PyTorch (with GPU support) and related libraries
conda install numpy==1.26.4 # needed to stop incompatible version of numpy from being installed via pip
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

<!-- > ⚠️ **Important:**
To run an agent in any of these environments, you must first install and configure the corresponding environment. Please refer to the [Environment Setup Guide](agent_system/environments/README.md) for step-by-step installation instructions. -->
# Install other Python packages
pip install transformers datasets pyserini huggingface_hub

# Install the GPU version of faiss
conda install faiss-gpu==1.8.0 -c pytorch -c nvidia -y

# Install the API service framework
pip install uvicorn fastapi
```

Download the index:
```bash
conda activate retriever

local_dir=~/data/searchR1
python examples/search/searchr1_download.py --local_dir $local_dir
cat $local_dir/part_* > $local_dir/e5_Flat.index
gzip -d $local_dir/wiki-18.jsonl.gz
```

Start the local flat e5 retrieval server:
```bash
conda activate retriever

# redirect the output to a file to avoid cluttering the terminal
# we have observed outputting to the terminal causing spikes in server response times
bash examples/search/retriever/retrieval_launch.sh > retrieval_server.log
```

# Run Examples
## RL Training
Expand All @@ -266,6 +311,9 @@ bash examples/gigpo_trainer/run_webshop.sh # WebShop
```bash
bash examples/gigpo_trainer/run_sokoban.sh # Sokoban
```
```bash
bash examples/gigpo_trainer/run_search.sh # Search
```
### 2. GRPO
GRPO is a critic-free algorithm that estimates relative advantages based on a group of full episode trajectories.
```bash
Expand All @@ -274,6 +322,9 @@ bash examples/grpo_trainer/run_alfworld.sh # ALFWorld
```bash
bash examples/grpo_trainer/run_webshop.sh # WebShop
```
```bash
bash examples/grpo_trainer/run_search.sh # Search
```
### 3. PPO
PPO is a classic actor-critic algorithm that updates the policy using a clipped objective to ensure stable learning. It requires a separate value network (critic) to estimate state values.
```bash
Expand Down
247 changes: 0 additions & 247 deletions agent_system/agent/agent.py

This file was deleted.

14 changes: 14 additions & 0 deletions agent_system/agent/agents/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# agent_system/agent/agents/__init__.py
from .action_agent import ActionAgent
from .memory_agent import MemoryAgent
from .planning_agent import PlanningAgent
from .reflexion_agent import ReflexionAgent
from .search_agent import SearchAgent

__all__ = [
"ActionAgent",
"MemoryAgent",
"PlanningAgent",
"ReflexionAgent",
"SearchAgent",
]
Loading