Skip to content

[Benchmark Contribution] RoboGate: 68-Scenario Adversarial Pick-and-Place with 30K Failure Dictionary & 4-Model VLA Leaderboard #507

@liveplex-cpu

Description

@liveplex-cpu

Summary

We'd like to contribute RoboGate, an adversarial pick-and-place benchmark suite, to the Isaac Lab-Arena ecosystem. The benchmark is designed to answer one question: "Is this learned policy safe to deploy on a real production line?"

Pull Request: #506

What RoboGate Adds

68 Adversarial Scenarios (4 Difficulty Tiers)

Category Count Target SR Description
Nominal 20 95-100% Standard objects, lighting, centered placement
Edge Cases 15 70-85% Small/heavy/edge/occluded/transparent objects
Adversarial 10 40-60% Low light, clutter, slippery, disturbances
Domain Randomization 23 85-95% Lighting/color/position/camera variations

5 Safety Metrics + Deployment Confidence Score (0-100)

Metric Weight Threshold
Grasp Success Rate 0.30 >= 92%
Collision Count 0.25 == 0
Cycle Time 0.20 <= baseline × 1.1
Drop Rate 0.15 <= 3%
Grasp Miss Rate 0.10 <= baseline × 1.2

4-Model VLA Leaderboard

Model Params SR Confidence Failure Pattern
Scripted Controller (IK) 100% (68/68) 76/100
GR00T N1.6 (NVIDIA) 3B 0% (0/68) 1/100 grasp_miss + collision
OpenVLA (Stanford + TRI) 7B 0% (0/68) 27/100 grasp_miss dominant, 0 collision
Octo-Base (UC Berkeley) 93M 0% (0/68) 1/100 grasp_miss 79%, collision 21%
Octo-Small (UC Berkeley) 27M 0% (0/68) 1/100 grasp_miss 79.4%, collision 20.6%

Key finding: All 4 VLA models — including NVIDIA's official GR00T N1.6 (3B) — score 0% SR on scenarios a scripted IK controller solves 100%. The 100-point confidence gap is driven by training-deployment distribution mismatch, not model capacity.

30,000-Experiment Failure Dictionary

  • Two-stage adaptive sampling (LHS + boundary-focused) across 8-dimensional parameter space
  • Franka Panda (7-DOF) + UR5e (6-DOF), 30K total experiments
  • Risk model AUC: 0.780, closed-form failure boundary equation
  • 4 universal danger zones identified across both robot platforms
  • Dataset: liveplex/robogate-failure-dictionary

Integration with Isaac Lab-Arena

The benchmark integrates with the existing Arena environment builder:

from robogate_benchmark.environments import RoboGateBenchmarkEnvironment

env_def = RoboGateBenchmarkEnvironment()
arena_env = env_def.get_env(args_cli)

It also supports --mock mode for CI/CD testing without GPU.

Links

We welcome feedback on the benchmark design, scenario coverage, or integration approach. Happy to adjust the PR based on maintainer guidance.

— AgentAI Co., Ltd.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions