BuildBench: LLM-Assisted Compilation Benchmark

BuildBench is a benchmark and multi-agent system for automatically compiling C/C++ GitHub repositories using Large Language Models. Agents iteratively generate and execute shell commands, diagnosing and fixing build errors through dialogue. You can find the paper at arXiv:2509.25248.

Setup

Prerequisites

Python 3.10+
Docker
(Optional) Kubernetes cluster for batch execution

Installation

conda create -n BuildBench python=3.10
conda activate BuildBench
pip install -r requirements.txt

Docker Images

# Build base image (compilers, build tools, Python deps)
sudo docker build -t docker_image_compilation -f src/Dockerfile_compilation .

# Build K8S worker image
sudo docker build -t docker_image_compilation_k8s -f src/Dockerfile_k8s .

Running

Single Repository (Local Docker)

python3 src/main.py --api_key=$API_KEY --model_name=o3-mini \
  --github_repo=https://github.com/user/repo.git --random=-1 \
  --host_project_dir=$(pwd)

All Repositories (Local Docker)

python3 src/main.py --api_key=$API_KEY --model_name=o3-mini \
  --host_project_dir=$(pwd)

Kubernetes Batch Execution

python3 orchestrator.py --api_key=$API_KEY --model_name=claude-3-7-sonnet-20250219 \
  --docker_image=your_registry/llm_assisted_compilation_k8s:latest \
  --k8s_parallelism=50 --data_path=data/sampled_repos_149_cleaned_higher_split_compilable.jsonl \
  --retrieval --host_project_dir=$(pwd)

Tests

python3 tests/API_key_test.py
python3 tests/docker_test.py
python3 tests/get_repo_test.py
python3 tests/link_extraction_test.py

Architecture

Execution Flow

orchestrator.py → K8S Jobs → worker_main.py → compilation.py → Agent.compile_repo()
                                                                   ↓
src/main.py (local) ──────→ compilation.py → Agent.compile_repo()

Entry points: orchestrator.py (K8S batch), src/main.py (local Docker), src/compilation.py (inside container)
Agent dialogue: src/agents.py — creates a GroupChat with 2–3 AutoGen agents (Bash Command Generator + Executor, optionally a Build Instructions Retriever). Agents loop: generate commands → execute → analyze errors → retry.
Compilation terminates when agents output "terminate" keyword or hit max_turns.

Key Components

File	Description
`src/agents.py`	Multi-agent orchestration using Microsoft AutoGen
`src/prompts.py`	Agent system prompts with build rules
`src/bash_executor.py`	Safe command executor with sanitization
`src/build_info_retrieval.py`	Heuristic retrieval from README/docs
`src/RAG_retrieval.py`	LangChain/Chroma vector search over docs
`src/validation_pipeline.py`	Compilation validation (source vs binary functions)
`src/worker_main.py`	K8S pod entry point

Retrieval Modes

Three retrieval strategies supply build context to agents:

Perfect Retrieval (--perfect_retrieval): Hand-curated URLs from PERFECT_RETRIEVAL_DICT
RAG Retrieval (--RAG_retrieval): Vector search over repo docs using LangChain/Chroma
Heuristic Retrieval (--retrieval): LLM-driven extraction from README and build files

Supported LLM Providers

OpenAI (GPT-4o, O3-mini): standard OpenAI API
Claude (Claude 3.5/3.7 Sonnet): api_type="anthropic"
Gemini: Google Generative AI API
Qwen: via HuggingFace router

Baselines

The src/compilation_baselines/ directory contains all baseline build methods evaluated in Table 1:

Baseline	Type	Description
GHCC	Rule-based	Heuristic Makefile-based compilation (huzecong/ghcc)
Assemblage	Rule-based	Automated build system detection (Assemblage-Dataset/Assemblage)
CompileAgent	Multi-Agent	Flow-based agent with CompileNavigator + ErrorSolver (Hu et al., 2025)
LLM Baseline	Single-Turn	One-shot LLM command generation (o3-mini, Claude 3.7-Sonnet)
LLM Multi-Turn	Multi-Turn	Iterative LLM with error refinement

# Run a baseline
python src/compilation_baselines/main.py --compilation_method assemblage \
  --input_file_path data/sampled_repos_149_cleaned_higher_split_compilable.jsonl

See src/compilation_baselines/README.md for detailed baseline documentation.

Project Structure

├── src/                    # Core compilation system
│   ├── compilation_baselines/  # Baseline build methods (GHCC, Assemblage, CompileAgent, LLM)
│   └── ...                 # OSS-BUILD-AGENT components
├── postprocessing/         # Result analysis pipeline
├── data_labeling/          # Human evaluation Dockerfiles
├── data/                   # Benchmark data & references
├── tests/                  # Test suite
├── scripts/                # Build & deployment scripts
├── orchestrator.py         # K8S batch orchestration
└── job-template.yaml       # K8S job template

See postprocessing/README.md, data_labeling/README.md, data/README.md, and src/compilation_baselines/README.md for component-specific documentation.

Evaluating Results

After compilation, evaluate Strict/Flexible success metrics against expert-labeled binary names:

python3 postprocessing/evaluate_success.py \
  --ground_truth data/compilation_label.json \
  --compiled_dir compiled_repos/ \
  --output compiled_results/evaluation.json

Run postprocessing analysis on compiled binaries:

python3 postprocessing/postprocessing.py --compiled_dir compiled_repos/ --results_dir compiled_results/

Data

Validation Set (70 repos): Included in data/Compilation_Bench_Validation_Set_70.csv
Full Benchmark (385 repos) and Compilable Subset (149 repos): Available on HuggingFace

See data/README.md for schema details and download instructions.

Environment Variables

Variable	Description
`API_KEY`	LLM provider API key
`MODEL_NAME`	Model to use (e.g., `o3-mini`, `claude-3-7-sonnet-20250219`)
`TIMEOUT_BASH`	Command timeout in seconds
`MAX_TURNS`	Maximum agent conversation turns
`CORES`	Number of CPU cores for parallel compilation
`RETRIEVAL`	Enable heuristic retrieval (`True`/`False`)
`RAG_RETRIEVAL`	Enable RAG retrieval (`True`/`False`)
`PERFECT_RETRIEVAL`	Enable perfect retrieval (`True`/`False`)
`TAVILY_API_KEY`	(Optional) Tavily API key for web search

Citation

@article{zhang2025buildbench,
  title={BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software},
  author={Zhang, Zehua and Bajaj, Ati Priya and Handa, Divij and Liu, Siyu and Raj, Arvind S and Chen, Hongkai and Wang, Hulin and Liu, Yibo and Basque, Zion Leonahenahe and Nath, Souradip and Juneja, Vishal and Chapre, Nikhil and Bao, Tiffany and Shoshitaishvili, Yan and Doup{\'e}, Adam and Baral, Chitta and Wang, Ruoyu},
  journal={arXiv preprint arXiv:2509.25248},
  year={2025}
}

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BuildBench: LLM-Assisted Compilation Benchmark

Setup

Prerequisites

Installation

Docker Images

Running

Single Repository (Local Docker)

All Repositories (Local Docker)

Kubernetes Batch Execution

Tests

Architecture

Execution Flow

Key Components

Retrieval Modes

Supported LLM Providers

Baselines

Project Structure

Evaluating Results

Data

Environment Variables

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
postprocessing		postprocessing
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REPRODUCE.md		REPRODUCE.md
job-template.yaml		job-template.yaml
orchestrator.py		orchestrator.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BuildBench: LLM-Assisted Compilation Benchmark

Setup

Prerequisites

Installation

Docker Images

Running

Single Repository (Local Docker)

All Repositories (Local Docker)

Kubernetes Batch Execution

Tests

Architecture

Execution Flow

Key Components

Retrieval Modes

Supported LLM Providers

Baselines

Project Structure

Evaluating Results

Data

Environment Variables

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages