Can AI models build a world which today's generative models can only dream of?
Paper · Blogpost · Leaderboard
BuilderBench is a benchmark to accelerate research into training that centers around exploration. The vision for BuilderBench is to enable an open-ended stream of potential interactions, where pre-training could only ever cover a tiny slice of all possible behaviors. Our hypothesis is that the space of skills and discoveries that an agent has to know to build all possible structures is so vast, that it is impossible to memorize them at design time. This motivates the use of block-building to evaluate AI models in BuilderBench.
Please check out our paper for details about BuilderBench and check out the blogpost to see the failure modes of some of the strongest models (as of March 2026). We encourage you to try new ideas and submit your solutions to the live leaderboard!
Features include:
- A simulated environment consisting of a robot interacting with building blocks.
- A task suite of 51 tasks for evaluating AI models.
- A wrapper for evaluating language model based agents.
- Single interface for evaluating OpenAI, Claude and Gemini language models and open-source model.
- Implementations of Chain of Thought and Reflexion agents.
Prerequisites: uv package manager.
-
Clone the repository:
git clone <repo-url> cd builderbench
-
Install the package and its dependencies:
uv venv --python 3.12 uv sync
-
Set up your API keys by creating a
SECRETSfile in the project root:OPENAI_API_KEY=your-openai-key ANTHROPIC_API_KEY=your-anthropic-key GEMINI_API_KEY=your-gemini-keyThe above step is not needed for self-hosted models via vLLM.
Before running experiments, one must generate task meta-data files. To create task data (.npz files in builderbench/tasks/):
uv run python builderbench/create_task_data.pyThis saves meta-data configurations for all tasks to builderbench/tasks/.
To run a Reflexion agent powered by GPT 5.2 on the cube-9-task-3:
uv run python run.py --client_name openai --model_id gpt-5.2-2025-12-11 --level_id cube-9-task-3 --agent_name cot --num_episode 3To run a chain-of-thought agent powered by a self-hosted model (for e.g., Qwen3-4B-Instruct-2507) served via vLLM on the cube-1-task-1:
vllm serve Qwen/Qwen3-4B-Instruct-2507 --port 8080
uv run python run.py --client_name vllm --base_url http://localhost:8080/v1 --model_id Qwen/Qwen3-4B-Instruct-2507 --level_id cube-1-task-1 --agent_name cotAgents Implemented
| Agent | File |
|---|---|
Naive naive |
agents/naive.py |
Chain-of-thought cot |
agents/cot.py |
Reflexion reflexion |
agents/reflexion.py |
After running run.py, results are stored under:
outputs/<agent_name>/<model_id>/<level_id>-seed-<seed>-timestamp-<...>/
Each run directory contains eval_summary.jsonl, run_config.json, and run_metadata.json.
To generate a submission-ready file for a single task:
python submit.py \
--results_dir outputs/ \
--level_id cube-5-task-3 \
--model_id your-model-id \
--agent_name your-agent-name \
--website_url https://your-model-urlsubmit.py aggregates results across seeds, runs a consistency check (task version, git commit, model ID must match across seeds), and writes a leaderboard ready JSON file to tmp/<level_id>-leaderboard.json. See scripts/submit_claude_opus4.6.sh for an example script of how to submit all tasks for an agent at once.
Once the tmp folder is ready, follow the instructions in leaderboard repository and make a new pull request. You need to place the tmp folder in the cloned repository and run place_data.py file in the leaderboard repository and then make a new pull request.
The scripts to replicate experiments from the paper are in the scripts/ folder.
Agents evaluated:
- Claude Opus 4.6 —
reflexionagent, 3 episodes - Gemini 3 Flash Preview —
reflexionagent, 3 episodes - GPT-5.2 —
cotagent, 1 episode
To run all tasks for a model:
bash scripts/run_claude_opus4.6.sh
bash scripts/run_gemini_flash3.1.sh
bash scripts/run_openai_gpt5.2.shTasks are processed in parallel (default is 1 job(s) in parallel) and logs are written inside the scripts/ folder.
To create submission-ready results:
bash scripts/submit_claude_opus4.6.sh
bash scripts/submit_gemini_flash3.1.sh
bash scripts/submit_openai_gpt5.2.shSubmission scripts read the results of completed runs from outputs/ and calls submit.py for each task.
The code for tabula-rasa RL experiments can be found in the rl/ folder.
The teleop folder provides two scripts for manually controlling the robot in any environment — useful for exploring tasks or debugging. You can control the robot using the keyboard or a slider based GUI window. See the teleop readme for details.
- OGBench for providing the backbone code to control the robot.
- Balrog for providing backbone code for the language model clients.
- MuJoCo for the physics simulation.
- MuJoCo Menagerie for the robot model.
@misc{ghugare2025builderbench,
title={BuilderBench: The Building Blocks of Intelligent Agents},
author={Raj Ghugare and Roger Creus Castanyer and Catherine Ji and Kathryn Wantlin and Jin Schofield and Karthik Narasimhan and Benjamin Eysenbach},
year={2026},
eprint={2510.06288},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.06288},
}