Skip to content

StevenZ904/BuildBench

Repository files navigation

BuildBench: LLM-Assisted Compilation Benchmark

BuildBench is a benchmark and multi-agent system for automatically compiling C/C++ GitHub repositories using Large Language Models. Agents iteratively generate and execute shell commands, diagnosing and fixing build errors through dialogue. You can find the paper at arXiv:2509.25248.

Setup

Prerequisites

  • Python 3.10+
  • Docker
  • (Optional) Kubernetes cluster for batch execution

Installation

conda create -n BuildBench python=3.10
conda activate BuildBench
pip install -r requirements.txt

Docker Images

# Build base image (compilers, build tools, Python deps)
sudo docker build -t docker_image_compilation -f src/Dockerfile_compilation .

# Build K8S worker image
sudo docker build -t docker_image_compilation_k8s -f src/Dockerfile_k8s .

Running

Single Repository (Local Docker)

python3 src/main.py --api_key=$API_KEY --model_name=o3-mini \
  --github_repo=https://github.com/user/repo.git --random=-1 \
  --host_project_dir=$(pwd)

All Repositories (Local Docker)

python3 src/main.py --api_key=$API_KEY --model_name=o3-mini \
  --host_project_dir=$(pwd)

Kubernetes Batch Execution

python3 orchestrator.py --api_key=$API_KEY --model_name=claude-3-7-sonnet-20250219 \
  --docker_image=your_registry/llm_assisted_compilation_k8s:latest \
  --k8s_parallelism=50 --data_path=data/sampled_repos_149_cleaned_higher_split_compilable.jsonl \
  --retrieval --host_project_dir=$(pwd)

Tests

python3 tests/API_key_test.py
python3 tests/docker_test.py
python3 tests/get_repo_test.py
python3 tests/link_extraction_test.py

Architecture

Execution Flow

orchestrator.py → K8S Jobs → worker_main.py → compilation.py → Agent.compile_repo()
                                                                   ↓
src/main.py (local) ──────→ compilation.py → Agent.compile_repo()
  1. Entry points: orchestrator.py (K8S batch), src/main.py (local Docker), src/compilation.py (inside container)
  2. Agent dialogue: src/agents.py — creates a GroupChat with 2–3 AutoGen agents (Bash Command Generator + Executor, optionally a Build Instructions Retriever). Agents loop: generate commands → execute → analyze errors → retry.
  3. Compilation terminates when agents output "terminate" keyword or hit max_turns.

Key Components

File Description
src/agents.py Multi-agent orchestration using Microsoft AutoGen
src/prompts.py Agent system prompts with build rules
src/bash_executor.py Safe command executor with sanitization
src/build_info_retrieval.py Heuristic retrieval from README/docs
src/RAG_retrieval.py LangChain/Chroma vector search over docs
src/validation_pipeline.py Compilation validation (source vs binary functions)
src/worker_main.py K8S pod entry point

Retrieval Modes

Three retrieval strategies supply build context to agents:

  1. Perfect Retrieval (--perfect_retrieval): Hand-curated URLs from PERFECT_RETRIEVAL_DICT
  2. RAG Retrieval (--RAG_retrieval): Vector search over repo docs using LangChain/Chroma
  3. Heuristic Retrieval (--retrieval): LLM-driven extraction from README and build files

Supported LLM Providers

  • OpenAI (GPT-4o, O3-mini): standard OpenAI API
  • Claude (Claude 3.5/3.7 Sonnet): api_type="anthropic"
  • Gemini: Google Generative AI API
  • Qwen: via HuggingFace router

Baselines

The src/compilation_baselines/ directory contains all baseline build methods evaluated in Table 1:

Baseline Type Description
GHCC Rule-based Heuristic Makefile-based compilation (huzecong/ghcc)
Assemblage Rule-based Automated build system detection (Assemblage-Dataset/Assemblage)
CompileAgent Multi-Agent Flow-based agent with CompileNavigator + ErrorSolver (Hu et al., 2025)
LLM Baseline Single-Turn One-shot LLM command generation (o3-mini, Claude 3.7-Sonnet)
LLM Multi-Turn Multi-Turn Iterative LLM with error refinement
# Run a baseline
python src/compilation_baselines/main.py --compilation_method assemblage \
  --input_file_path data/sampled_repos_149_cleaned_higher_split_compilable.jsonl

See src/compilation_baselines/README.md for detailed baseline documentation.

Project Structure

├── src/                    # Core compilation system
│   ├── compilation_baselines/  # Baseline build methods (GHCC, Assemblage, CompileAgent, LLM)
│   └── ...                 # OSS-BUILD-AGENT components
├── postprocessing/         # Result analysis pipeline
├── data_labeling/          # Human evaluation Dockerfiles
├── data/                   # Benchmark data & references
├── tests/                  # Test suite
├── scripts/                # Build & deployment scripts
├── orchestrator.py         # K8S batch orchestration
└── job-template.yaml       # K8S job template

See postprocessing/README.md, data_labeling/README.md, data/README.md, and src/compilation_baselines/README.md for component-specific documentation.

Evaluating Results

After compilation, evaluate Strict/Flexible success metrics against expert-labeled binary names:

python3 postprocessing/evaluate_success.py \
  --ground_truth data/compilation_label.json \
  --compiled_dir compiled_repos/ \
  --output compiled_results/evaluation.json

Run postprocessing analysis on compiled binaries:

python3 postprocessing/postprocessing.py --compiled_dir compiled_repos/ --results_dir compiled_results/

Data

  • Validation Set (70 repos): Included in data/Compilation_Bench_Validation_Set_70.csv
  • Full Benchmark (385 repos) and Compilable Subset (149 repos): Available on HuggingFace

See data/README.md for schema details and download instructions.

Environment Variables

Variable Description
API_KEY LLM provider API key
MODEL_NAME Model to use (e.g., o3-mini, claude-3-7-sonnet-20250219)
TIMEOUT_BASH Command timeout in seconds
MAX_TURNS Maximum agent conversation turns
CORES Number of CPU cores for parallel compilation
RETRIEVAL Enable heuristic retrieval (True/False)
RAG_RETRIEVAL Enable RAG retrieval (True/False)
PERFECT_RETRIEVAL Enable perfect retrieval (True/False)
TAVILY_API_KEY (Optional) Tavily API key for web search

Citation

@article{zhang2025buildbench,
  title={BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software},
  author={Zhang, Zehua and Bajaj, Ati Priya and Handa, Divij and Liu, Siyu and Raj, Arvind S and Chen, Hongkai and Wang, Hulin and Liu, Yibo and Basque, Zion Leonahenahe and Nath, Souradip and Juneja, Vishal and Chapre, Nikhil and Bao, Tiffany and Shoshitaishvili, Yan and Doup{\'e}, Adam and Baral, Chitta and Wang, Ruoyu},
  journal={arXiv preprint arXiv:2509.25248},
  year={2025}
}

License

MIT License — see LICENSE for details.

About

BuildBench is a benchmark and multi-agent system for automatically compiling C/C++ GitHub repositories using Large Language Models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages