Skip to content

Thinklab-SJTU/Bench2Drive-Robust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bench2Drive-Robust

Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

📄 Paper | 🚗 Baseline

Bench2Drive-Robust is a closed-loop robustness benchmark for end-to-end autonomous driving, evaluating deployment-oriented perturbations including camera-stream failures, ego-state estimation errors, and compute-control latency.


✨ Overview

End-to-end autonomous driving (E2E-AD) systems have achieved strong performance under clean closed-loop evaluation settings, but their robustness under deployment-side perturbations remains insufficiently understood.

Bench2Drive-Robust fills this gap by introducing a device-centric robustness benchmark for closed-loop E2E-AD. Instead of only testing external appearance changes such as weather or image corruptions, we evaluate system-level imperfections that can arise from onboard sensing, localization, and computation pipelines.

Bench2Drive-Robust introduces:

  • 📷 Camera-stream failures — cached burst frame drop and partial observation
  • 📍 Ego-state estimation errors — GPS localization noise and multiplicative speed noise
  • ⏱️ Compute-control latency — delayed control execution through FIFO action buffering
  • 🔁 Closed-loop robustness evaluation — perturbations affect future states and observations
  • 📊 Cross-model robustness analysis — robustness profiles across representative E2E-AD agents

🔄 Compatibility with Bench2Drive

Bench2Drive-Robust maintains 100% compatibility with Bench2Drive:

  • Environment Setup — Identical to Bench2Drive (CARLA 0.9.15, Python dependencies)
  • Model Integration — Same symlinks and directory structure for Baseline models
  • Agent Code — Your existing agent.py works without modification
  • Evaluation Routes — Same route files and scenarios

To enable perturbations, simply add environment variables to your evaluation script:

export ROBUSTNESS_ENABLE=1
export FRAME_DROP_ENABLE=1
export BURST_PROBABILITY=0.01
# ... run your existing Bench2Drive evaluation

See How to Adapt from Bench2Drive for the 4 files you need to replace.


📑 Contents


📊 Benchmark

Evaluated Models

Bench2Drive-Robust is designed as a model-agnostic extension of Bench2Drive and can be applied to any Bench2Drive-compatible closed-loop E2E-AD agent without modifying the model architecture or checkpoint.

In our benchmark experiments, we evaluate four representative E2E-AD models:

  1. SimLingo-base — vision-only closed-loop autonomous driving with language-action alignment
  2. TCP-traj — trajectory-guided control prediction baseline
  3. UniAD — unified perception-prediction-planning autonomous driving model
  4. VAD — vectorized scene representation for end-to-end autonomous driving

All evaluated models are tested as fixed policies without robustness finetuning.

Perturbation Types

Perturbation Settings Description
GPS Localization Noise 5 m, 15 m Gaussian noise on GPS localization input
Inference Latency 100 ms, 200 ms, 500 ms Delayed control execution through FIFO action buffering
Burst Frame Drop 20 ticks, 60 ticks Cached frozen-frame camera-stream failure
Partial Observation 50%, 80% Random rectangular camera occlusion
Speed Noise η ~ N(0.5, 0.2²), η ~ N(0.2, 0.2²) Multiplicative ego-speed noise

Main Results

We evaluate representative E2E-AD models, including TCP-traj, UniAD, VAD, and SimLingo-base, under camera-stream, ego-state, speed, and latency perturbations.

Our results show that deployment-side perturbations induce heterogeneous closed-loop degradation patterns. Models with strong clean-driving performance can still degrade substantially under specific system-level perturbations such as severe occlusion, GPS localization noise, speed underestimation, or inference latency.

Relative Degradation (RD)

To quantify robustness, we use Relative Degradation (RD):

RD = (baseline_score - perturbed_score) / baseline_score × 100%
  • Positive RD = performance degradation under perturbations
  • Lower RD = better robustness (less degradation)
  • RD = 0% = no degradation (perfect robustness)

Compute RD from evaluation results:

# After evaluation, you'll have JSON result files
python tools/compute_rd.py results/baseline.json results/gps_drift_severe.json

# Example output:
# Baseline score:     75.20%
# Perturbed score:    58.30%
# RD:                 22.47%

📦 Setup

1. Environment Setup

This release requires CARLA 0.9.15 and a Python environment with the required Bench2Drive dependencies.

Please refer to the original Bench2Drive repository for basic environment setup, including CARLA installation, Python dependencies, route files, and evaluation tools.

For agent-specific setup:

2. Directory Preparation

Your directory structure should look like this:

Bench2Drive-Robust/
├── assets/
├── docs/
├── leaderboard/
├────── team_code/
          --> Please add your agent HEAR
├── scenario_runner
└── tools/

3. How to Adapt from Bench2Drive

To adapt your existing Bench2Drive installation to Bench2Drive-Robust, you need to replace these 4 files:

File Location Purpose
sensor_interface.py leaderboard/leaderboard/envs/ Redirects to new robust sensor interface
sensor_interface_with_perturbations/ leaderboard/leaderboard/envs/ New modular robustness implementation
agent_wrapper.py leaderboard/leaderboard/autoagents/ Adds inference latency support
autonomous_agent.py leaderboard/leaderboard/autoagents/ Updated with debug logging

Steps:

  1. Backup your original files
  2. Copy the 4 files from Bench2Drive-Robust to your Bench2Drive installation
  3. Keep all other files unchanged (routes, scenarios, etc.)
  4. export ROBUSTNESS_ENABLE=0 to run clean (no perturbations)

🚀 Evaluation

We provide sample evaluation scripts in leaderboard/scripts/. Modify the configuration parameters in the script according to your local setup, then run the script directly with bash.

Sample Scripts

  • run_evaluation_debug.sh (for correctness of the team agent debugging)
  • run_evaluation.sh (basic evaluation)
  • run_evaluation_multi_simlingo_robust.sh (w/ perturbations)
  • run_evaluation_multi_tcp_robust.sh
  • run_evaluation_multi_uniad_robust.sh
  • run_evaluation_multi_vad_robust.sh

Running Evaluation

  1. Modify the script to set your paths:

    • WORK_DIR — your working directory
    • TEAM_AGENT — path to your agent file
    • TEAM_CONFIG — path to your model checkpoint
    • BASE_ROUTES — path to evaluation routes
    • GPU_RANK_LIST — GPUs to use for evaluation
    • BASE_CHECKPOINT_ENDPOINT — path to the evaluation checkpoint JSON
    • SAVE_PATH — directory for saving outputs and metric files, (end with /)
  2. Run the script:

bash leaderboard/scripts/run_evaluation_multi_simlingo_robust.sh

You can modify the provided scripts to evaluate your own Bench2Drive-compatible agent.

Enable Perturbations

Set the corresponding environment variables in the evaluation script.

# Global switch
export ROBUSTNESS_ENABLE=1
export ROBUSTNESS_SEED=2026

# GPS localization noise
export GPS_DRIFT_ENABLE=1
export GPS_DRIFT_MODE=severe  # medium (5m) or severe (15m)
export GPS_DRIFT_SEED=42

# Inference latency
export INFERENCE_LATENCY_ENABLE=1
export INFERENCE_LATENCY_MS=100  # 100, 200, or 500 ms

# Burst frame drop
export FRAME_DROP_ENABLE=1
export BURST_PROBABILITY=0.01
export BURST_MAX_TICKS=20  # 20 or 60

# Partial observation
export PARTIAL_OBS_ENABLE=1
export PARTIAL_OBS_TYPE=occlusion
export PARTIAL_OBS_RATIO=0.5  # 0.5 or 0.8

# Speed noise
export SPEED_BIAS_ENABLE=1
export SPEED_BIAS_MEAN=0.5  # 0.5 or 0.2
export SPEED_BIAS_STD=0.2

Main Environment Variables

Variable Description Values
ROBUSTNESS_ENABLE Enable robustness perturbations 0, 1
ROBUSTNESS_SEED Random seed for stochastic perturbations integer
GPS_DRIFT_ENABLE Enable GPS localization noise 0, 1
GPS_DRIFT_MODE GPS drift severity mode medium (5m), severe (15m)
GPS_DRIFT_SEED Random seed for GPS drift integer
SPEED_BIAS_ENABLE Enable multiplicative speed noise 0, 1
SPEED_BIAS_MEAN Mean of speed multiplier 0.5, 0.2
SPEED_BIAS_STD Standard deviation of speed multiplier 0.2
FRAME_DROP_ENABLE Enable burst frame drop 0, 1
BURST_PROBABILITY Burst trigger probability e.g., 0.01
BURST_MAX_TICKS Maximum burst duration 20, 60
PARTIAL_OBS_ENABLE Enable partial observation perturbation 0, 1
PARTIAL_OBS_TYPE Partial observation type blur, occlusion
PARTIAL_OBS_RATIO Occlusion mask ratio 0.5, 0.8
INFERENCE_LATENCY_ENABLE Enable action-side latency 0, 1
INFERENCE_LATENCY_MS Fixed inference latency (used in fixed mode) 100, 200, 500
INFERENCE_LATENCY_MODE Latency mode: fixed uses constant delay, measured uses actual inference time fixed, measured
WARMUP_STEPS Warmup steps before latency injection e.g., 20
SIM_RATE Simulation/control rate used for latency conversion e.g., 20

🔬 Key Implementation

Sensor Interface: sensor_interface_with_perturbations/

We introduce a modular sensor-interface implementation for observation-side perturbation injection.

Main components:

  • DelayBuffer: frame-based buffering with age tracking
  • RobustnessProcessor: perturbation injection for camera-stream and ego-state inputs
  • SensorConfig: centralized configuration through environment variables
  • Compatibility layer: drop-in replacement for the original Bench2Drive sensor interface

Supported observation-side perturbations include cached burst frame drop, partial observation, GPS localization noise, and multiplicative speed noise.

Agent Wrapper: agent_wrapper.py

We modify the agent wrapper to support action-side inference-latency injection.

Main features:

  • FIFO action buffering for fixed-latency evaluation
  • delayed application of control commands
  • dynamic real-time latency support
  • policy-agnostic perturbation without modifying agent checkpoints or model architectures

SimLingo Integration

The provided team_code/agent_simlingo.py includes minor adaptations for our evaluation setup.

These changes support Bench2Drive-style route parsing, simplified metadata saving, metric-info error handling, and compatibility with our server-side evaluation scripts. They do not modify the model inference logic, sensor processing logic, control generation, or planning/navigation behavior.


📝 License

All assets and code are under the CC-BY-NC-ND license unless specified otherwise.

This work extends the original Bench2Drive benchmark. Please also refer to the original Bench2Drive repository for its licensing terms and usage restrictions.


📜 Citation

If you find this work useful for your research, please cite:

@article{zhang2026b2drobust,
  title={Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations},
  author={Zhang, Zhiyuan and Jin, Zhenghao and Peng, Yanlun and Guo, Xianda and Liu, Haoran and Zhang, Shaofeng and Ma, Xingjun and Wu, Zuxuan and Yan, Junchi and Jia, Xiaosong and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2605.18059},
  year={2026}
}

Please also consider citing the original Bench2Drive paper:

@inproceedings{jia2024bench2drive,
  title={Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving},
  author={Jia, Xiaosong and Yang, Zhenjie and Li, Qifeng and Zhang, Zhiyuan and Yan, Junchi},
  booktitle={Advances in Neural Information Processing Systems},
  year={2024}
}

🔗 Links

About

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors