**Bench2Drive-*Robust***

Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Bench2Drive-Robust is a closed-loop robustness benchmark for end-to-end autonomous driving, evaluating deployment-oriented perturbations including camera-stream failures, ego-state estimation errors, and compute-control latency.

✨ Overview

End-to-end autonomous driving (E2E-AD) systems have achieved strong performance under clean closed-loop evaluation settings, but their robustness under deployment-side perturbations remains insufficiently understood.

Bench2Drive-Robust fills this gap by introducing a device-centric robustness benchmark for closed-loop E2E-AD. Instead of only testing external appearance changes such as weather or image corruptions, we evaluate system-level imperfections that can arise from onboard sensing, localization, and computation pipelines.

Bench2Drive-Robust introduces:

📷 Camera-stream failures — cached burst frame drop and partial observation
📍 Ego-state estimation errors — GPS localization noise and multiplicative speed noise
⏱️ Compute-control latency — delayed control execution through FIFO action buffering
🔁 Closed-loop robustness evaluation — perturbations affect future states and observations
📊 Cross-model robustness analysis — robustness profiles across representative E2E-AD agents

🔄 Compatibility with Bench2Drive

Bench2Drive-Robust maintains 100% compatibility with Bench2Drive:

Environment Setup — Identical to Bench2Drive (CARLA 0.9.15, Python dependencies)
Model Integration — Same symlinks and directory structure for Baseline models
Agent Code — Your existing agent.py works without modification
Evaluation Routes — Same route files and scenarios

To enable perturbations, simply add environment variables to your evaluation script:

export ROBUSTNESS_ENABLE=1
export FRAME_DROP_ENABLE=1
export BURST_PROBABILITY=0.01
# ... run your existing Bench2Drive evaluation

See How to Adapt from Bench2Drive for the 4 files you need to replace.

📊 Benchmark

Evaluated Models

Bench2Drive-Robust is designed as a model-agnostic extension of Bench2Drive and can be applied to any Bench2Drive-compatible closed-loop E2E-AD agent without modifying the model architecture or checkpoint.

In our benchmark experiments, we evaluate four representative E2E-AD models:

SimLingo-base — vision-only closed-loop autonomous driving with language-action alignment
TCP-traj — trajectory-guided control prediction baseline
UniAD — unified perception-prediction-planning autonomous driving model
VAD — vectorized scene representation for end-to-end autonomous driving

All evaluated models are tested as fixed policies without robustness finetuning.

Perturbation Types

Perturbation	Settings	Description
GPS Localization Noise	5 m, 15 m	Gaussian noise on GPS localization input
Inference Latency	100 ms, 200 ms, 500 ms	Delayed control execution through FIFO action buffering
Burst Frame Drop	20 ticks, 60 ticks	Cached frozen-frame camera-stream failure
Partial Observation	50%, 80%	Random rectangular camera occlusion
Speed Noise	η ~ N(0.5, 0.2²), η ~ N(0.2, 0.2²)	Multiplicative ego-speed noise

Main Results

We evaluate representative E2E-AD models, including TCP-traj, UniAD, VAD, and SimLingo-base, under camera-stream, ego-state, speed, and latency perturbations.

Our results show that deployment-side perturbations induce heterogeneous closed-loop degradation patterns. Models with strong clean-driving performance can still degrade substantially under specific system-level perturbations such as severe occlusion, GPS localization noise, speed underestimation, or inference latency.

Relative Degradation (RD)

To quantify robustness, we use Relative Degradation (RD):

RD = (baseline_score - perturbed_score) / baseline_score × 100%

Positive RD = performance degradation under perturbations
Lower RD = better robustness (less degradation)
RD = 0% = no degradation (perfect robustness)

Compute RD from evaluation results:

# After evaluation, you'll have JSON result files
python tools/compute_rd.py results/baseline.json results/gps_drift_severe.json

# Example output:
# Baseline score:     75.20%
# Perturbed score:    58.30%
# RD:                 22.47%

📦 Setup

1. Environment Setup

This release requires CARLA 0.9.15 and a Python environment with the required Bench2Drive dependencies.

Please refer to the original Bench2Drive repository for basic environment setup, including CARLA installation, Python dependencies, route files, and evaluation tools.

For agent-specific setup:

SimLingo: Please refer to SimLingo's official documentation. Clone the SimLingo repository and place the required team_code files in this directory.
TCP-traj, UniAD, VAD: Please refer to the Bench2Drive repository and each agent's original setup instructions.

2. Directory Preparation

Your directory structure should look like this:

Bench2Drive-Robust/
├── assets/
├── docs/
├── leaderboard/
├────── team_code/
          --> Please add your agent HEAR
├── scenario_runner
└── tools/

3. How to Adapt from Bench2Drive

To adapt your existing Bench2Drive installation to Bench2Drive-Robust, you need to replace these 4 files:

File	Location	Purpose
`sensor_interface.py`	`leaderboard/leaderboard/envs/`	Redirects to new robust sensor interface
`sensor_interface_with_perturbations/`	`leaderboard/leaderboard/envs/`	New modular robustness implementation
`agent_wrapper.py`	`leaderboard/leaderboard/autoagents/`	Adds inference latency support
`autonomous_agent.py`	`leaderboard/leaderboard/autoagents/`	Updated with debug logging

Steps:

Backup your original files
Copy the 4 files from Bench2Drive-Robust to your Bench2Drive installation
Keep all other files unchanged (routes, scenarios, etc.)
export ROBUSTNESS_ENABLE=0 to run clean (no perturbations)

🚀 Evaluation

We provide sample evaluation scripts in leaderboard/scripts/. Modify the configuration parameters in the script according to your local setup, then run the script directly with bash.

Sample Scripts

run_evaluation_debug.sh (for correctness of the team agent debugging)
run_evaluation.sh (basic evaluation)
run_evaluation_multi_simlingo_robust.sh (w/ perturbations)
run_evaluation_multi_tcp_robust.sh
run_evaluation_multi_uniad_robust.sh
run_evaluation_multi_vad_robust.sh

Running Evaluation

Modify the script to set your paths:
- WORK_DIR — your working directory
- TEAM_AGENT — path to your agent file
- TEAM_CONFIG — path to your model checkpoint
- BASE_ROUTES — path to evaluation routes
- GPU_RANK_LIST — GPUs to use for evaluation
- BASE_CHECKPOINT_ENDPOINT — path to the evaluation checkpoint JSON
- SAVE_PATH — directory for saving outputs and metric files, (end with /)
Run the script:

bash leaderboard/scripts/run_evaluation_multi_simlingo_robust.sh

You can modify the provided scripts to evaluate your own Bench2Drive-compatible agent.

Enable Perturbations

Set the corresponding environment variables in the evaluation script.

# Global switch
export ROBUSTNESS_ENABLE=1
export ROBUSTNESS_SEED=2026

# GPS localization noise
export GPS_DRIFT_ENABLE=1
export GPS_DRIFT_MODE=severe  # medium (5m) or severe (15m)
export GPS_DRIFT_SEED=42

# Inference latency
export INFERENCE_LATENCY_ENABLE=1
export INFERENCE_LATENCY_MS=100  # 100, 200, or 500 ms

# Burst frame drop
export FRAME_DROP_ENABLE=1
export BURST_PROBABILITY=0.01
export BURST_MAX_TICKS=20  # 20 or 60

# Partial observation
export PARTIAL_OBS_ENABLE=1
export PARTIAL_OBS_TYPE=occlusion
export PARTIAL_OBS_RATIO=0.5  # 0.5 or 0.8

# Speed noise
export SPEED_BIAS_ENABLE=1
export SPEED_BIAS_MEAN=0.5  # 0.5 or 0.2
export SPEED_BIAS_STD=0.2

Main Environment Variables

Variable	Description	Values
`ROBUSTNESS_ENABLE`	Enable robustness perturbations	`0`, `1`
`ROBUSTNESS_SEED`	Random seed for stochastic perturbations	integer
`GPS_DRIFT_ENABLE`	Enable GPS localization noise	`0`, `1`
`GPS_DRIFT_MODE`	GPS drift severity mode	`medium` (5m), `severe` (15m)
`GPS_DRIFT_SEED`	Random seed for GPS drift	integer
`SPEED_BIAS_ENABLE`	Enable multiplicative speed noise	`0`, `1`
`SPEED_BIAS_MEAN`	Mean of speed multiplier	`0.5`, `0.2`
`SPEED_BIAS_STD`	Standard deviation of speed multiplier	`0.2`
`FRAME_DROP_ENABLE`	Enable burst frame drop	`0`, `1`
`BURST_PROBABILITY`	Burst trigger probability	e.g., `0.01`
`BURST_MAX_TICKS`	Maximum burst duration	`20`, `60`
`PARTIAL_OBS_ENABLE`	Enable partial observation perturbation	`0`, `1`
`PARTIAL_OBS_TYPE`	Partial observation type	`blur`, `occlusion`
`PARTIAL_OBS_RATIO`	Occlusion mask ratio	`0.5`, `0.8`
`INFERENCE_LATENCY_ENABLE`	Enable action-side latency	`0`, `1`
`INFERENCE_LATENCY_MS`	Fixed inference latency (used in `fixed` mode)	`100`, `200`, `500`
`INFERENCE_LATENCY_MODE`	Latency mode: `fixed` uses constant delay, `measured` uses actual inference time	`fixed`, `measured`
`WARMUP_STEPS`	Warmup steps before latency injection	e.g., `20`
`SIM_RATE`	Simulation/control rate used for latency conversion	e.g., `20`

🔬 Key Implementation

Sensor Interface: `sensor_interface_with_perturbations/`

We introduce a modular sensor-interface implementation for observation-side perturbation injection.

Main components:

DelayBuffer: frame-based buffering with age tracking
RobustnessProcessor: perturbation injection for camera-stream and ego-state inputs
SensorConfig: centralized configuration through environment variables
Compatibility layer: drop-in replacement for the original Bench2Drive sensor interface

Supported observation-side perturbations include cached burst frame drop, partial observation, GPS localization noise, and multiplicative speed noise.

Agent Wrapper: `agent_wrapper.py`

We modify the agent wrapper to support action-side inference-latency injection.

Main features:

FIFO action buffering for fixed-latency evaluation
delayed application of control commands
dynamic real-time latency support
policy-agnostic perturbation without modifying agent checkpoints or model architectures

SimLingo Integration

The provided team_code/agent_simlingo.py includes minor adaptations for our evaluation setup.

These changes support Bench2Drive-style route parsing, simplified metadata saving, metric-info error handling, and compatibility with our server-side evaluation scripts. They do not modify the model inference logic, sensor processing logic, control generation, or planning/navigation behavior.

📝 License

All assets and code are under the CC-BY-NC-ND license unless specified otherwise.

This work extends the original Bench2Drive benchmark. Please also refer to the original Bench2Drive repository for its licensing terms and usage restrictions.

📜 Citation

If you find this work useful for your research, please cite:

@article{zhang2026b2drobust,
  title={Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations},
  author={Zhang, Zhiyuan and Jin, Zhenghao and Peng, Yanlun and Guo, Xianda and Liu, Haoran and Zhang, Shaofeng and Ma, Xingjun and Wu, Zuxuan and Yan, Junchi and Jia, Xiaosong and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2605.18059},
  year={2026}
}

Please also consider citing the original Bench2Drive paper:

@inproceedings{jia2024bench2drive,
  title={Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving},
  author={Jia, Xiaosong and Yang, Zhenjie and Li, Qifeng and Zhang, Zhiyuan and Yan, Junchi},
  booktitle={Advances in Neural Information Processing Systems},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
docs		docs
leaderboard		leaderboard
scenario_runner		scenario_runner
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

**Bench2Drive-*Robust***

✨ Overview

🔄 Compatibility with Bench2Drive

📑 Contents

📊 Benchmark

Evaluated Models

Perturbation Types

Main Results

Relative Degradation (RD)

📦 Setup

1. Environment Setup

2. Directory Preparation

3. How to Adapt from Bench2Drive

🚀 Evaluation

Sample Scripts

Running Evaluation

Enable Perturbations

Main Environment Variables

🔬 Key Implementation

Sensor Interface: `sensor_interface_with_perturbations/`

Agent Wrapper: `agent_wrapper.py`

SimLingo Integration

📝 License

📜 Citation

🔗 Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bench2Drive-Robust

✨ Overview

🔄 Compatibility with Bench2Drive

📑 Contents

📊 Benchmark

Evaluated Models

Perturbation Types

Main Results

Relative Degradation (RD)

📦 Setup

1. Environment Setup

2. Directory Preparation

3. How to Adapt from Bench2Drive

🚀 Evaluation

Sample Scripts

Running Evaluation

Enable Perturbations

Main Environment Variables

🔬 Key Implementation

Sensor Interface: sensor_interface_with_perturbations/

Agent Wrapper: agent_wrapper.py

SimLingo Integration

📝 License

📜 Citation

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

**Bench2Drive-*Robust***

Sensor Interface: `sensor_interface_with_perturbations/`

Agent Wrapper: `agent_wrapper.py`

Packages