PwP-Bench is a unification of SWE-benchmarks designed to evaluate AI agents' ability to interact with visual interfaces and perform programming tasks. Each benchmark focuses on different aspects of programming and interface interaction.
| Task | Description | Input | Output | Evaluation |
|---|---|---|---|---|
| HumanEval | Python coding problems | Problem specification | Python function | Functional correctness |
| Design2Code | Converting design mockups to code | Design image | HTML/CSS | Visual similarity |
| ChartMimic | Recreating charts from visual references | Chart image | Code to generate chart | Visual similarity |
| InterCode | Interactive coding (bash, SQL, CTF) | Task specification | Interactive solution | Task completion |
| RES-Q | Reasoning about SQL queries | Database + question | SQL query | Query correctness |
| CanitEdit | Code editing tasks | Code + edit instruction | Edited code | Edit correctness |
| VSCode | VSCode-specific tasks | Task specification | VSCode operations | Task completion |
| Bird | BI reporting dashboard tasks | Business question | SQL query | Query correctness |
| DSBench | Data science benchmark | Data + task | Analysis code | Output correctness |
| SWE-bench | Software engineering tasks | Code + bug report | Fixed code | Functional correctness |
| SWE-bench-java | Java software engineering tasks | Code + task | Java solution | Functional correctness |
| SWE-bench-mm | Multimodal software engineering | Code + screenshots | Fixed code | Functional correctness |
| MiniCtx | Minimal context understanding | Code snippet | Solution | Output correctness |
| NoCode | No-code tool interaction | Task | Interface operations | Task completion |
To download and set up a particular task, go to its corresponding folder and run the first_time.sh script:
cd pwp_bench/Design2Code
./first_time.shAlternatively, you can use the provided setup script:
python -m pwp_bench.setup_benchmark --task Design2CodeHere's how to use a benchmark task:
from pwp.bench import PwPBench
# Create a benchmark instance
bench = PwPBench('humaneval')
# Get the dataset (all tasks in the benchmark)
dataset = bench.get_dataset()
# Create an environment for a specific task
task_env = bench.get_env(dataset[0])
# Interact with the environment (for agent implementation)
task_env.step("ls -la")
screenshot = task_env.render()
# Evaluate the solution
reward = bench.get_reward(task_env, dataset[0])
print(f"Task reward: {reward}")Each benchmark follows a standard directory structure:
benchmark_name/
├── data.json # Task data in JSON format
├── data.jsonl # Or in JSONL format (one task per line)
├── first_time.sh # Setup script
├── Dockerfile # Docker configuration
└── setup_files/
├── setup.py # Environment setup for tasks
└── eval.py # Evaluation functions for tasks
If you have a benchmark you'd like to see included in PwP-Bench, please contact us to discuss integration. We're actively looking to expand the range of tasks and welcome community contributions.
To add a new benchmark to the PwP-Bench suite:
- Create a new directory in
pwp_bench/with your benchmark name - Create a
data.jsonordata.jsonlfile with your task examples - Create a
setup_filesdirectory withsetup.pyandeval.py - Create a
first_time.shscript to handle any necessary downloads - Add a Dockerfile for any custom environment setup
- Add your benchmark to
task_configsinpwp.bench.benchmark
The task data should follow this general format (adjust as needed for your specific benchmark):
{
"task_id": "unique_task_id",
"prompt": "Task description the agent sees",
"reference_solution": "Reference solution for evaluation",
"setup_script": "setup_function_name", // Optional
"eval_script": "eval_function_name", // Optional
"eval_arguments": "[arg1, arg2, ...]", // Optional
"additional_fields": "task-specific data"
}The setup.py file should contain a setup(env, task) function that prepares the environment for a specific task.
The eval.py file should contain an eval(env, task) function that evaluates the agent's solution.
For more details on how to add your own benchmark, please refer to our Contributing Guidelines.
The following benchmarks are currently available:
- Design2Code
- InterCode (bash, SQL, CTF)
- RES-Q
- ChartMimic
- HumanEval
- CanitEdit
- VSCode
- Bird
- DSBench
- MiniCtx
- SWE-bench
- SWE-bench-java
- SWTbench
- SWE-bench-mm
- GeneralSWE
Please hang tight, as we complete the upload for pending tasks!
We welcome contributions of new benchmarks or improvements to existing ones! Please follow these steps:
- Fork the repository
- Create your benchmark following the structure above
- Submit a pull request with a clear description of your benchmark
For detailed guidance, please refer to our Contributing Guidelines.
If you use PwP-Bench in your research, please cite our paper:
@article{pwp2025,
title={Programming with Pixels: Computer-Use Meets Software Engineering},
author={Aggarwal, Pranjal and Welleck, Sean},
journal={Preprint. Under Review},
year={2025}
}