BuilderBench

_{Can AI models build a world which today's generative models can only dream of?}

Paper · Blogpost · Leaderboard

BuilderBench is a benchmark to accelerate research into training that centers around exploration. The vision for BuilderBench is to enable an open-ended stream of potential interactions, where pre-training could only ever cover a tiny slice of all possible behaviors. Our hypothesis is that the space of skills and discoveries that an agent has to know to build all possible structures is so vast, that it is impossible to memorize them at design time. This motivates the use of block-building to evaluate AI models in BuilderBench.

Please check out our paper for details about BuilderBench and check out the blogpost to see the failure modes of some of the strongest models (as of March 2026). We encourage you to try new ideas and submit your solutions to the live leaderboard!

Features include:

A simulated environment consisting of a robot interacting with building blocks.
A task suite of 51 tasks for evaluating AI models.
A wrapper for evaluating language model based agents.
Single interface for evaluating OpenAI, Claude and Gemini language models and open-source model.
Implementations of Chain of Thought and Reflexion agents.

Installation

Prerequisites: uv package manager.

Clone the repository:
```
git clone <repo-url>
cd builderbench
```
Install the package and its dependencies:
```
uv venv --python 3.12
uv sync
```
Set up your API keys by creating a SECRETS file in the project root:
```
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
GEMINI_API_KEY=your-gemini-key
```
The above step is not needed for self-hosted models via vLLM.

Usage

Before running experiments, one must generate task meta-data files. To create task data (.npz files in builderbench/tasks/):

uv run python builderbench/create_task_data.py

This saves meta-data configurations for all tasks to builderbench/tasks/.

To run a Reflexion agent powered by GPT 5.2 on the cube-9-task-3:

uv run python run.py --client_name openai --model_id gpt-5.2-2025-12-11 --level_id cube-9-task-3 --agent_name cot --num_episode 3

To run a chain-of-thought agent powered by a self-hosted model (for e.g., Qwen3-4B-Instruct-2507) served via vLLM on the cube-1-task-1:

vllm serve Qwen/Qwen3-4B-Instruct-2507 --port 8080

uv run python run.py --client_name vllm --base_url http://localhost:8080/v1 --model_id Qwen/Qwen3-4B-Instruct-2507 --level_id cube-1-task-1 --agent_name cot

Agents Implemented

Agent	File
Naive `naive`	agents/naive.py
Chain-of-thought `cot`	agents/cot.py
Reflexion `reflexion`	agents/reflexion.py

Leaderboard Submission

After running run.py, results are stored under:

outputs/<agent_name>/<model_id>/<level_id>-seed-<seed>-timestamp-<...>/

Each run directory contains eval_summary.jsonl, run_config.json, and run_metadata.json.

To generate a submission-ready file for a single task:

python submit.py \
    --results_dir outputs/ \
    --level_id cube-5-task-3 \
    --model_id your-model-id \
    --agent_name your-agent-name \
    --website_url https://your-model-url

submit.py aggregates results across seeds, runs a consistency check (task version, git commit, model ID must match across seeds), and writes a leaderboard ready JSON file to tmp/<level_id>-leaderboard.json. See scripts/submit_claude_opus4.6.sh for an example script of how to submit all tasks for an agent at once.

Once the tmp folder is ready, follow the instructions in leaderboard repository and make a new pull request. You need to place the tmp folder in the cloned repository and run place_data.py file in the leaderboard repository and then make a new pull request.

Paper Experiments

The scripts to replicate experiments from the paper are in the scripts/ folder.

Agents evaluated:

Claude Opus 4.6 — reflexion agent, 3 episodes
Gemini 3 Flash Preview — reflexion agent, 3 episodes
GPT-5.2 — cot agent, 1 episode

To run all tasks for a model:

bash scripts/run_claude_opus4.6.sh
bash scripts/run_gemini_flash3.1.sh
bash scripts/run_openai_gpt5.2.sh

Tasks are processed in parallel (default is 1 job(s) in parallel) and logs are written inside the scripts/ folder.

To create submission-ready results:

bash scripts/submit_claude_opus4.6.sh
bash scripts/submit_gemini_flash3.1.sh
bash scripts/submit_openai_gpt5.2.sh

Submission scripts read the results of completed runs from outputs/ and calls submit.py for each task.

RL experiments

The code for tabula-rasa RL experiments can be found in the rl/ folder.

Teleoperation

The teleop folder provides two scripts for manually controlling the robot in any environment — useful for exploring tasks or debugging. You can control the robot using the keyboard or a slider based GUI window. See the teleop readme for details.

Acknowledgements

OGBench for providing the backbone code to control the robot.
Balrog for providing backbone code for the language model clients.
MuJoCo for the physics simulation.
MuJoCo Menagerie for the robot model.

Citation

@misc{ghugare2025builderbench,
      title={BuilderBench: The Building Blocks of Intelligent Agents}, 
      author={Raj Ghugare and Roger Creus Castanyer and Catherine Ji and Kathryn Wantlin and Jin Schofield and Karthik Narasimhan and Benjamin Eysenbach},
      year={2026},
      eprint={2510.06288},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.06288}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agents		agents
builderbench		builderbench
media		media
rl		rl
scripts		scripts
teleop		teleop
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
submit.py		submit.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BuilderBench

Paper · Blogpost · Leaderboard

Installation

Usage

Leaderboard Submission

Paper Experiments

RL experiments

Teleoperation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BuilderBench

Paper · Blogpost · Leaderboard

Installation

Usage

Leaderboard Submission

Paper Experiments

RL experiments

Teleoperation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages