Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations
Bench2Drive-Robust is a closed-loop robustness benchmark for end-to-end autonomous driving, evaluating deployment-oriented perturbations including camera-stream failures, ego-state estimation errors, and compute-control latency.
End-to-end autonomous driving (E2E-AD) systems have achieved strong performance under clean closed-loop evaluation settings, but their robustness under deployment-side perturbations remains insufficiently understood.
Bench2Drive-Robust fills this gap by introducing a device-centric robustness benchmark for closed-loop E2E-AD. Instead of only testing external appearance changes such as weather or image corruptions, we evaluate system-level imperfections that can arise from onboard sensing, localization, and computation pipelines.
Bench2Drive-Robust introduces:
- 📷 Camera-stream failures — cached burst frame drop and partial observation
- 📍 Ego-state estimation errors — GPS localization noise and multiplicative speed noise
- ⏱️ Compute-control latency — delayed control execution through FIFO action buffering
- 🔁 Closed-loop robustness evaluation — perturbations affect future states and observations
- 📊 Cross-model robustness analysis — robustness profiles across representative E2E-AD agents
Bench2Drive-Robust maintains 100% compatibility with Bench2Drive:
- Environment Setup — Identical to Bench2Drive (CARLA 0.9.15, Python dependencies)
- Model Integration — Same symlinks and directory structure for Baseline models
- Agent Code — Your existing
agent.pyworks without modification - Evaluation Routes — Same route files and scenarios
To enable perturbations, simply add environment variables to your evaluation script:
export ROBUSTNESS_ENABLE=1
export FRAME_DROP_ENABLE=1
export BURST_PROBABILITY=0.01
# ... run your existing Bench2Drive evaluationSee How to Adapt from Bench2Drive for the 4 files you need to replace.
Bench2Drive-Robust is designed as a model-agnostic extension of Bench2Drive and can be applied to any Bench2Drive-compatible closed-loop E2E-AD agent without modifying the model architecture or checkpoint.
In our benchmark experiments, we evaluate four representative E2E-AD models:
- SimLingo-base — vision-only closed-loop autonomous driving with language-action alignment
- TCP-traj — trajectory-guided control prediction baseline
- UniAD — unified perception-prediction-planning autonomous driving model
- VAD — vectorized scene representation for end-to-end autonomous driving
All evaluated models are tested as fixed policies without robustness finetuning.
| Perturbation | Settings | Description |
|---|---|---|
| GPS Localization Noise | 5 m, 15 m | Gaussian noise on GPS localization input |
| Inference Latency | 100 ms, 200 ms, 500 ms | Delayed control execution through FIFO action buffering |
| Burst Frame Drop | 20 ticks, 60 ticks | Cached frozen-frame camera-stream failure |
| Partial Observation | 50%, 80% | Random rectangular camera occlusion |
| Speed Noise | η ~ N(0.5, 0.2²), η ~ N(0.2, 0.2²) | Multiplicative ego-speed noise |
We evaluate representative E2E-AD models, including TCP-traj, UniAD, VAD, and SimLingo-base, under camera-stream, ego-state, speed, and latency perturbations.
Our results show that deployment-side perturbations induce heterogeneous closed-loop degradation patterns. Models with strong clean-driving performance can still degrade substantially under specific system-level perturbations such as severe occlusion, GPS localization noise, speed underestimation, or inference latency.
To quantify robustness, we use Relative Degradation (RD):
RD = (baseline_score - perturbed_score) / baseline_score × 100%
- Positive RD = performance degradation under perturbations
- Lower RD = better robustness (less degradation)
- RD = 0% = no degradation (perfect robustness)
Compute RD from evaluation results:
# After evaluation, you'll have JSON result files
python tools/compute_rd.py results/baseline.json results/gps_drift_severe.json
# Example output:
# Baseline score: 75.20%
# Perturbed score: 58.30%
# RD: 22.47%This release requires CARLA 0.9.15 and a Python environment with the required Bench2Drive dependencies.
Please refer to the original Bench2Drive repository for basic environment setup, including CARLA installation, Python dependencies, route files, and evaluation tools.
For agent-specific setup:
- SimLingo: Please refer to SimLingo's official documentation. Clone the SimLingo repository and place the required
team_codefiles in this directory. - TCP-traj, UniAD, VAD: Please refer to the Bench2Drive repository and each agent's original setup instructions.
Your directory structure should look like this:
Bench2Drive-Robust/
├── assets/
├── docs/
├── leaderboard/
├────── team_code/
--> Please add your agent HEAR
├── scenario_runner
└── tools/
To adapt your existing Bench2Drive installation to Bench2Drive-Robust, you need to replace these 4 files:
| File | Location | Purpose |
|---|---|---|
sensor_interface.py |
leaderboard/leaderboard/envs/ |
Redirects to new robust sensor interface |
sensor_interface_with_perturbations/ |
leaderboard/leaderboard/envs/ |
New modular robustness implementation |
agent_wrapper.py |
leaderboard/leaderboard/autoagents/ |
Adds inference latency support |
autonomous_agent.py |
leaderboard/leaderboard/autoagents/ |
Updated with debug logging |
Steps:
- Backup your original files
- Copy the 4 files from Bench2Drive-Robust to your Bench2Drive installation
- Keep all other files unchanged (routes, scenarios, etc.)
export ROBUSTNESS_ENABLE=0to run clean (no perturbations)
We provide sample evaluation scripts in leaderboard/scripts/. Modify the configuration parameters in the script according to your local setup, then run the script directly with bash.
run_evaluation_debug.sh(for correctness of the team agent debugging)run_evaluation.sh(basic evaluation)run_evaluation_multi_simlingo_robust.sh(w/ perturbations)run_evaluation_multi_tcp_robust.shrun_evaluation_multi_uniad_robust.shrun_evaluation_multi_vad_robust.sh
-
Modify the script to set your paths:
WORK_DIR— your working directoryTEAM_AGENT— path to your agent fileTEAM_CONFIG— path to your model checkpointBASE_ROUTES— path to evaluation routesGPU_RANK_LIST— GPUs to use for evaluationBASE_CHECKPOINT_ENDPOINT— path to the evaluation checkpoint JSONSAVE_PATH— directory for saving outputs and metric files, (end with /)
-
Run the script:
bash leaderboard/scripts/run_evaluation_multi_simlingo_robust.shYou can modify the provided scripts to evaluate your own Bench2Drive-compatible agent.
Set the corresponding environment variables in the evaluation script.
# Global switch
export ROBUSTNESS_ENABLE=1
export ROBUSTNESS_SEED=2026
# GPS localization noise
export GPS_DRIFT_ENABLE=1
export GPS_DRIFT_MODE=severe # medium (5m) or severe (15m)
export GPS_DRIFT_SEED=42
# Inference latency
export INFERENCE_LATENCY_ENABLE=1
export INFERENCE_LATENCY_MS=100 # 100, 200, or 500 ms
# Burst frame drop
export FRAME_DROP_ENABLE=1
export BURST_PROBABILITY=0.01
export BURST_MAX_TICKS=20 # 20 or 60
# Partial observation
export PARTIAL_OBS_ENABLE=1
export PARTIAL_OBS_TYPE=occlusion
export PARTIAL_OBS_RATIO=0.5 # 0.5 or 0.8
# Speed noise
export SPEED_BIAS_ENABLE=1
export SPEED_BIAS_MEAN=0.5 # 0.5 or 0.2
export SPEED_BIAS_STD=0.2| Variable | Description | Values |
|---|---|---|
ROBUSTNESS_ENABLE |
Enable robustness perturbations | 0, 1 |
ROBUSTNESS_SEED |
Random seed for stochastic perturbations | integer |
GPS_DRIFT_ENABLE |
Enable GPS localization noise | 0, 1 |
GPS_DRIFT_MODE |
GPS drift severity mode | medium (5m), severe (15m) |
GPS_DRIFT_SEED |
Random seed for GPS drift | integer |
SPEED_BIAS_ENABLE |
Enable multiplicative speed noise | 0, 1 |
SPEED_BIAS_MEAN |
Mean of speed multiplier | 0.5, 0.2 |
SPEED_BIAS_STD |
Standard deviation of speed multiplier | 0.2 |
FRAME_DROP_ENABLE |
Enable burst frame drop | 0, 1 |
BURST_PROBABILITY |
Burst trigger probability | e.g., 0.01 |
BURST_MAX_TICKS |
Maximum burst duration | 20, 60 |
PARTIAL_OBS_ENABLE |
Enable partial observation perturbation | 0, 1 |
PARTIAL_OBS_TYPE |
Partial observation type | blur, occlusion |
PARTIAL_OBS_RATIO |
Occlusion mask ratio | 0.5, 0.8 |
INFERENCE_LATENCY_ENABLE |
Enable action-side latency | 0, 1 |
INFERENCE_LATENCY_MS |
Fixed inference latency (used in fixed mode) |
100, 200, 500 |
INFERENCE_LATENCY_MODE |
Latency mode: fixed uses constant delay, measured uses actual inference time |
fixed, measured |
WARMUP_STEPS |
Warmup steps before latency injection | e.g., 20 |
SIM_RATE |
Simulation/control rate used for latency conversion | e.g., 20 |
We introduce a modular sensor-interface implementation for observation-side perturbation injection.
Main components:
- DelayBuffer: frame-based buffering with age tracking
- RobustnessProcessor: perturbation injection for camera-stream and ego-state inputs
- SensorConfig: centralized configuration through environment variables
- Compatibility layer: drop-in replacement for the original Bench2Drive sensor interface
Supported observation-side perturbations include cached burst frame drop, partial observation, GPS localization noise, and multiplicative speed noise.
We modify the agent wrapper to support action-side inference-latency injection.
Main features:
- FIFO action buffering for fixed-latency evaluation
- delayed application of control commands
- dynamic real-time latency support
- policy-agnostic perturbation without modifying agent checkpoints or model architectures
The provided team_code/agent_simlingo.py includes minor adaptations for our evaluation setup.
These changes support Bench2Drive-style route parsing, simplified metadata saving, metric-info error handling, and compatibility with our server-side evaluation scripts. They do not modify the model inference logic, sensor processing logic, control generation, or planning/navigation behavior.
All assets and code are under the CC-BY-NC-ND license unless specified otherwise.
This work extends the original Bench2Drive benchmark. Please also refer to the original Bench2Drive repository for its licensing terms and usage restrictions.
If you find this work useful for your research, please cite:
@article{zhang2026b2drobust,
title={Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations},
author={Zhang, Zhiyuan and Jin, Zhenghao and Peng, Yanlun and Guo, Xianda and Liu, Haoran and Zhang, Shaofeng and Ma, Xingjun and Wu, Zuxuan and Yan, Junchi and Jia, Xiaosong and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2605.18059},
year={2026}
}Please also consider citing the original Bench2Drive paper:
@inproceedings{jia2024bench2drive,
title={Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving},
author={Jia, Xiaosong and Yang, Zhenjie and Li, Qifeng and Zhang, Zhiyuan and Yan, Junchi},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}
