Skip to content

Fix: Bump gripper and GetImage timeouts to stabilize CI#614

Closed
JWhitleyWork wants to merge 1 commit into
mainfrom
fix/ci-flake-gripper-and-getimage-timeouts
Closed

Fix: Bump gripper and GetImage timeouts to stabilize CI#614
JWhitleyWork wants to merge 1 commit into
mainfrom
fix/ci-flake-gripper-and-getimage-timeouts

Conversation

@JWhitleyWork
Copy link
Copy Markdown
Member

@JWhitleyWork JWhitleyWork commented May 7, 2026

Summary

  • Bumps MoveGripperAction timeout from 15s → 30s in close_gripper.xml and open_gripper.xml.
  • Bumps GetImage message_timeout_sec from 5s → 15s in the shared picknik_ur_base_config Segment Image from No Negative Text Prompt Subtree.

Why

CI on main started flaking ~24h after the MuJoCo 3.2.7 → 3.6.0 upgrade in moveit_pro (6eedef88a5, Apr 14). The new constraint solver is heavier per step, so on CI hardware the sim runs slower than realtime and emits Mujoco model timestep not running in realtime. Increase the model timestep. warnings. Two previously-borderline timeouts in lab_sim integration tests then fall over:

Failing test ~ frequency Underlying error
push_button_with_a_trajectory.xml ~9/10 MoveGripperAction Error: ... gripper failed to reach the target position within 15.0s. Current values: position=0.7929
ml_segment_point_cloud.xml ~4/10 GetImage Error: Failed to get next message on topic '/wrist_camera/color': Timed out after 5 seconds

CI history correlates tightly with the upgrade:

Period Pass Fail
Apr 8–14 (pre-upgrade) 24 0
Apr 15 (day after upgrade) 4 1
Apr 17 1 7
Apr 20–30 ~3 ~37

This is a mitigation, not a root-cause fix — moveit_pro#18534 and PR #610's description ("the real test harness infrastructure needs some rethinking ... for physics this tight") call out that the durable answer is rethinking the test harness or coarsening the MuJoCo timestep (currently the default 0.002s / 500Hz; no explicit timestep is set in lab_sim/description/scene.xml). Doing that lives in a follow-up.

Test plan

  • Trigger CI workflow on this branch and confirm integration-test-in-studio-container (humble) passes.
  • Re-run a few times to verify the historical ~92% failure rate drops.
  • Manual smoke test: Push Button With a Trajectory, ML Segment Point Cloud, and Close/Open Gripper Objectives still execute correctly on a dev machine.

🤖 Generated with Claude Code

The MuJoCo 3.2.7 -> 3.6.0 upgrade (moveit_pro@6eedef88a5) made the
constraint solver heavier per step, so on CI hardware the simulator
runs slower than realtime and emits "Mujoco model timestep not running
in realtime" warnings. This pushes two previously-borderline timeouts
in lab_sim integration tests over the edge:

- MoveGripperAction's 15s timeout fires while the Robotiq 2f85 is still
  closing on the push-button laptop, failing
  push_button_with_a_trajectory ~9/10 runs.
- GetImage's 5s wait on /wrist_camera/color times out before the first
  EGL-rendered frame is published, failing ml_segment_point_cloud
  ~4/10 runs.

Bump MoveGripperAction timeout 15s -> 30s in close_gripper.xml and
open_gripper.xml, and GetImage message_timeout_sec 5s -> 15s in the
shared picknik_ur_base_config Segment Image subtree. These are
mitigations; the durable fix (per moveit_pro#18534) is rethinking the
test harness for tighter physics, but this should flip main green
while that work happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JWhitleyWork JWhitleyWork added this to the 9.3.0 milestone May 7, 2026
@JWhitleyWork JWhitleyWork marked this pull request as ready for review May 7, 2026 21:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates recent CI flakiness in lab_sim integration tests by increasing timeouts for gripper actuation and image acquisition to better tolerate slower-than-realtime MuJoCo simulation on CI hardware.

Changes:

  • Increased MoveGripperAction timeout from 15s to 30s in lab_sim open/close gripper objectives.
  • Increased GetImage message_timeout_sec from 5s to 15s in the shared UR base config segmentation subtree.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/moveit_pro_ur_configs/picknik_ur_base_config/objectives/segment_image_from_no_negative_text_prompt_subtree.xml Increases GetImage message timeout to reduce image topic timeout flakes.
src/lab_sim/objectives/open_gripper.xml Increases gripper open timeout to reduce actuation timeout flakes under slow sim.
src/lab_sim/objectives/close_gripper.xml Increases gripper close timeout to reduce actuation timeout flakes under slow sim.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@nbbrooks nbbrooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree something prob changed with the Mujoco upgrade, but I doubt increasing the timeout will help that much, or at least, is a bandaide.

I would follow the advice in https://docs.picknik.ai/troubleshooting/Planning%20and%20Control%20Troubleshooting/#movegripperaction-fails-to-reach-target-position to address the core of the issue.

@nbbrooks
Copy link
Copy Markdown
Member

nbbrooks commented May 7, 2026

Having to bump GetImage past 5sec also seems pretty wild to me. If the sim is lagging so bad we can't get an image reliably after 5sec I don't think we can trust any of the results (who knows how cpu starved the physics is), so personally I would just make this CI job optional and mostly treat it as noise until this is addressed.

@JWhitleyWork
Copy link
Copy Markdown
Member Author

Superseded by an Option-A approach: instead of bumping per-objective timeouts, we'll add an opt-in input to the moveit_pro_ci/workspace_integration_test.yaml reusable workflow that patches the MuJoCo <option timestep> for CI runs only. That keeps the slower-than-realtime warning visible on dev hardware (where it's diagnostic) and removes the need for tolerance band-aids here. New PR coming.

@JWhitleyWork JWhitleyWork deleted the fix/ci-flake-gripper-and-getimage-timeouts branch May 7, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants