[Benchmark Contribution] RoboGate: 68-Scenario Adversarial Pick-and-Place with 30K Failure Dictionary & 4-Model VLA Leaderboard

## Summary

We'd like to contribute **RoboGate**, an adversarial pick-and-place benchmark suite, to the Isaac Lab-Arena ecosystem. The benchmark is designed to answer one question: *"Is this learned policy safe to deploy on a real production line?"*

**Pull Request:** #506

## What RoboGate Adds

### 68 Adversarial Scenarios (4 Difficulty Tiers)

| Category | Count | Target SR | Description |
|----------|-------|-----------|-------------|
| Nominal | 20 | 95-100% | Standard objects, lighting, centered placement |
| Edge Cases | 15 | 70-85% | Small/heavy/edge/occluded/transparent objects |
| Adversarial | 10 | 40-60% | Low light, clutter, slippery, disturbances |
| Domain Randomization | 23 | 85-95% | Lighting/color/position/camera variations |

### 5 Safety Metrics + Deployment Confidence Score (0-100)

| Metric | Weight | Threshold |
|--------|--------|-----------|
| Grasp Success Rate | 0.30 | >= 92% |
| Collision Count | 0.25 | == 0 |
| Cycle Time | 0.20 | <= baseline × 1.1 |
| Drop Rate | 0.15 | <= 3% |
| Grasp Miss Rate | 0.10 | <= baseline × 1.2 |

### 4-Model VLA Leaderboard

| Model | Params | SR | Confidence | Failure Pattern |
|-------|--------|-----|-----------|-----------------|
| Scripted Controller (IK) | — | **100%** (68/68) | 76/100 | — |
| **GR00T N1.6 (NVIDIA)** | 3B | 0% (0/68) | 1/100 | grasp_miss + collision |
| OpenVLA (Stanford + TRI) | 7B | 0% (0/68) | 27/100 | grasp_miss dominant, 0 collision |
| Octo-Base (UC Berkeley) | 93M | 0% (0/68) | 1/100 | grasp_miss 79%, collision 21% |
| Octo-Small (UC Berkeley) | 27M | 0% (0/68) | 1/100 | grasp_miss 79.4%, collision 20.6% |

**Key finding:** All 4 VLA models — including NVIDIA's official GR00T N1.6 (3B) — score 0% SR on scenarios a scripted IK controller solves 100%. The 100-point confidence gap is driven by training-deployment distribution mismatch, not model capacity.

### 30,000-Experiment Failure Dictionary

- Two-stage adaptive sampling (LHS + boundary-focused) across 8-dimensional parameter space
- Franka Panda (7-DOF) + UR5e (6-DOF), 30K total experiments
- Risk model AUC: 0.780, closed-form failure boundary equation
- 4 universal danger zones identified across both robot platforms
- Dataset: [liveplex/robogate-failure-dictionary](https://huggingface.co/datasets/liveplex/robogate-failure-dictionary)

## Integration with Isaac Lab-Arena

The benchmark integrates with the existing Arena environment builder:

```python
from robogate_benchmark.environments import RoboGateBenchmarkEnvironment

env_def = RoboGateBenchmarkEnvironment()
arena_env = env_def.get_env(args_cli)
```

It also supports `--mock` mode for CI/CD testing without GPU.

## Links

- **PR:** #506
- **Paper:** [arXiv:2603.22126](https://arxiv.org/abs/2603.22126)
- **Website & Leaderboard:** [robogate.io/vla](https://robogate.io/vla)
- **GitHub:** [github.com/liveplex-cpu/robogate](https://github.com/liveplex-cpu/robogate)
- **HuggingFace Dataset:** [liveplex/robogate-failure-dictionary](https://huggingface.co/datasets/liveplex/robogate-failure-dictionary)

We welcome feedback on the benchmark design, scenario coverage, or integration approach. Happy to adjust the PR based on maintainer guidance.

— AgentAI Co., Ltd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark Contribution] RoboGate: 68-Scenario Adversarial Pick-and-Place with 30K Failure Dictionary & 4-Model VLA Leaderboard #507

Summary

What RoboGate Adds

68 Adversarial Scenarios (4 Difficulty Tiers)

5 Safety Metrics + Deployment Confidence Score (0-100)

4-Model VLA Leaderboard

30,000-Experiment Failure Dictionary

Integration with Isaac Lab-Arena

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	Count	Target SR	Description
Nominal	20	95-100%	Standard objects, lighting, centered placement
Edge Cases	15	70-85%	Small/heavy/edge/occluded/transparent objects
Adversarial	10	40-60%	Low light, clutter, slippery, disturbances
Domain Randomization	23	85-95%	Lighting/color/position/camera variations

Metric	Weight	Threshold
Grasp Success Rate	0.30	>= 92%
Collision Count	0.25	== 0
Cycle Time	0.20	<= baseline × 1.1
Drop Rate	0.15	<= 3%
Grasp Miss Rate	0.10	<= baseline × 1.2

Model	Params	SR	Confidence	Failure Pattern
Scripted Controller (IK)	—	100% (68/68)	76/100	—
GR00T N1.6 (NVIDIA)	3B	0% (0/68)	1/100	grasp_miss + collision
OpenVLA (Stanford + TRI)	7B	0% (0/68)	27/100	grasp_miss dominant, 0 collision
Octo-Base (UC Berkeley)	93M	0% (0/68)	1/100	grasp_miss 79%, collision 21%
Octo-Small (UC Berkeley)	27M	0% (0/68)	1/100	grasp_miss 79.4%, collision 20.6%

[Benchmark Contribution] RoboGate: 68-Scenario Adversarial Pick-and-Place with 30K Failure Dictionary & 4-Model VLA Leaderboard #507

Description

Summary

What RoboGate Adds

68 Adversarial Scenarios (4 Difficulty Tiers)

5 Safety Metrics + Deployment Confidence Score (0-100)

4-Model VLA Leaderboard

30,000-Experiment Failure Dictionary

Integration with Isaac Lab-Arena

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions