Fix: Bump gripper and GetImage timeouts to stabilize CI#614
Fix: Bump gripper and GetImage timeouts to stabilize CI#614JWhitleyWork wants to merge 1 commit into
Conversation
The MuJoCo 3.2.7 -> 3.6.0 upgrade (moveit_pro@6eedef88a5) made the constraint solver heavier per step, so on CI hardware the simulator runs slower than realtime and emits "Mujoco model timestep not running in realtime" warnings. This pushes two previously-borderline timeouts in lab_sim integration tests over the edge: - MoveGripperAction's 15s timeout fires while the Robotiq 2f85 is still closing on the push-button laptop, failing push_button_with_a_trajectory ~9/10 runs. - GetImage's 5s wait on /wrist_camera/color times out before the first EGL-rendered frame is published, failing ml_segment_point_cloud ~4/10 runs. Bump MoveGripperAction timeout 15s -> 30s in close_gripper.xml and open_gripper.xml, and GetImage message_timeout_sec 5s -> 15s in the shared picknik_ur_base_config Segment Image subtree. These are mitigations; the durable fix (per moveit_pro#18534) is rethinking the test harness for tighter physics, but this should flip main green while that work happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR mitigates recent CI flakiness in lab_sim integration tests by increasing timeouts for gripper actuation and image acquisition to better tolerate slower-than-realtime MuJoCo simulation on CI hardware.
Changes:
- Increased
MoveGripperActiontimeout from 15s to 30s inlab_simopen/close gripper objectives. - Increased
GetImagemessage_timeout_secfrom 5s to 15s in the shared UR base config segmentation subtree.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/moveit_pro_ur_configs/picknik_ur_base_config/objectives/segment_image_from_no_negative_text_prompt_subtree.xml | Increases GetImage message timeout to reduce image topic timeout flakes. |
| src/lab_sim/objectives/open_gripper.xml | Increases gripper open timeout to reduce actuation timeout flakes under slow sim. |
| src/lab_sim/objectives/close_gripper.xml | Increases gripper close timeout to reduce actuation timeout flakes under slow sim. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
nbbrooks
left a comment
There was a problem hiding this comment.
I agree something prob changed with the Mujoco upgrade, but I doubt increasing the timeout will help that much, or at least, is a bandaide.
I would follow the advice in https://docs.picknik.ai/troubleshooting/Planning%20and%20Control%20Troubleshooting/#movegripperaction-fails-to-reach-target-position to address the core of the issue.
|
Having to bump GetImage past 5sec also seems pretty wild to me. If the sim is lagging so bad we can't get an image reliably after 5sec I don't think we can trust any of the results (who knows how cpu starved the physics is), so personally I would just make this CI job optional and mostly treat it as noise until this is addressed. |
|
Superseded by an Option-A approach: instead of bumping per-objective timeouts, we'll add an opt-in input to the |
Summary
MoveGripperActiontimeoutfrom 15s → 30s inclose_gripper.xmlandopen_gripper.xml.GetImagemessage_timeout_secfrom 5s → 15s in the sharedpicknik_ur_base_configSegment Image from No Negative Text Prompt Subtree.Why
CI on
mainstarted flaking ~24h after the MuJoCo3.2.7 → 3.6.0upgrade inmoveit_pro(6eedef88a5, Apr 14). The new constraint solver is heavier per step, so on CI hardware the sim runs slower than realtime and emitsMujoco model timestep not running in realtime. Increase the model timestep.warnings. Two previously-borderline timeouts inlab_simintegration tests then fall over:push_button_with_a_trajectory.xmlMoveGripperAction Error: ... gripper failed to reach the target position within 15.0s. Current values: position=0.7929ml_segment_point_cloud.xmlGetImage Error: Failed to get next message on topic '/wrist_camera/color': Timed out after 5 secondsCI history correlates tightly with the upgrade:
This is a mitigation, not a root-cause fix —
moveit_pro#18534and PR #610's description ("the real test harness infrastructure needs some rethinking ... for physics this tight") call out that the durable answer is rethinking the test harness or coarsening the MuJoCo timestep (currently the default 0.002s / 500Hz; no explicittimestepis set inlab_sim/description/scene.xml). Doing that lives in a follow-up.Test plan
CIworkflow on this branch and confirmintegration-test-in-studio-container (humble)passes.Push Button With a Trajectory,ML Segment Point Cloud, andClose/Open GripperObjectives still execute correctly on a dev machine.🤖 Generated with Claude Code