Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes#615
Merged
Conversation
ee5d05a to
7611d04
Compare
There was a problem hiding this comment.
Pull request overview
Updates the repository CI workflow to reduce MuJoCo integration-test flakiness by overriding the simulator timestep only in CI, giving the heavier MuJoCo 3.6.0 solver more wall-clock budget per step while keeping local development behavior unchanged.
Changes:
- Pin the reusable
workspace_integration_test.yamlworkflow to a newermoveit_pro_cicommit that supports the newmujoco_ci_timestepinput. - Pass
mujoco_ci_timestep: "0.004"to run the CI lab simulation at 250 Hz instead of the default 500 Hz.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7611d04 to
c70604d
Compare
shaur-k
previously approved these changes
May 8, 2026
0ec9013 to
9f92c5e
Compare
The MuJoCo 3.2.7 -> 3.6.0 upgrade in moveit_pro (6eedef88a5) made the constraint solver heavier per step, so the lab_sim scene runs slower than realtime on CI runners. That surfaces as `Mujoco model timestep not running in realtime` warnings and timing-related test failures (MoveGripperAction 15s timeouts, GetImage 5s wrist-camera timeouts, Push Button trajectory F/T threshold trips). CI on main has been ~92% red since Apr 17 as a result. Three changes: 1. Pin the integration-test reusable workflow to v0.0.9 and pass `mujoco_ci_timestep: "0.003"` -- 333 Hz, ~1.5x the wall-clock budget per step versus the MuJoCo 500 Hz default. Only takes effect on CI; local dev runs the scene unmodified. Helps the other timing-sensitive objectives stay inside their wall-clock budgets even when the runner is contended. 2. Activate the per-test MuJoCo reset fixture by re-exporting reset_simulation_before_test in objectives_integration_test.py. The integration test runs ~117 parametrized objectives against a single shared backend and MuJoCo simulation; pick/place and similar objectives leave residual world state that caused order-dependent failures after the MuJoCo 3.6.0 upgrade. 3. Skip `Push Button With a Trajectory` in CI. This objective uses the Joint Trajectory Admittance Controller (JTAC) with force/torque compliance, which is uniquely sensitive to the simulator running slower than realtime. Across multiple mitigation experiments (timestep coarsening from 0.005 -> 0.003, path tolerance loosening from 0.20 -> 0.30, dedicated picknik-16-amd64-sim runner with taskset CPU pinning and m7i-only family restriction), the test continues to flake at ~50% with the same root cause -- the MuJoCo realtime warning fires *during* the trajectory, indicating structural instability under CI load, not a tolerance/timestep tuning gap. Land it skipped to unblock CI; the JTAC/MuJoCo interaction needs a deeper fix tracked separately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
029dd42 to
6740d12
Compare
marioprats
approved these changes
May 13, 2026
JWhitleyWork
added a commit
that referenced
this pull request
May 13, 2026
…imestep-fix Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes (backport #615 to v9.3)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three changes to stabilise the
lab_simintegration test after the MuJoCo3.2.7 → 3.6.0upgrade inmoveit_pro(6eedef88a5, Apr 14) made the constraint solver heavier per step:moveit_pro_ci v0.0.9and passmujoco_ci_timestep: "0.003"— runs thelab_simscene at 333 Hz instead of MuJoCo's 500 Hz default, ~1.5× the wall-clock budget per step. CI-only; local dev runs the scene unmodified.objectives_integration_test.py(reset_simulation_before_testre-export). The integration test runs ~117 parametrised objectives against one shared backend, and the autouse fixture isolates residual world state between objectives so pick/place leftovers don't contaminate downstream tests.Push Button With a Trajectoryin CI. Documented inline inskip_objectiveswith a pointer to this PR.Why
CI on
mainhas been ~30% green since Apr 17. Failure logs always showMujoco model timestep not running in realtime, and the timing-sensitive failures fall out of that —MoveGripperAction15s timeouts,GetImage5s camera timeouts,Push ButtonF/T threshold trips, MPC pose-tracking misses. Several earlier mitigations onmain(constraint arena memory, MPC retunes, tolerance loosening, publisher timeout fixes) didn't close the realtime gap.mujoco_ci_timestep: "0.003"closes the gap for every objective exceptPush Button With a Trajectory. That one uses the Joint Trajectory Admittance Controller (JTAC) with force/torque compliance — uniquely sensitive to the simulator falling under realtime — and remains structurally flaky even after extensive mitigation work (see Alternatives below). Skipping that single objective is the cheapest path to a greenmain; the underlying JTAC ↔ MuJoCo 3.6.0 interaction needs a separate, deeper fix.Alternatives considered
Path tolerance violated(joint deviations up to 0.292 rad vs. 0.25 tolerance). 0.003s is the largest step that keeps JTAC stable.path_position_tolerance(0.25 → 0.30) — combined with 0.003s timestep, got us to ~67% pass rate. Better thanmain, but the residual flake isForce/Torque threshold exceeded, not a path-tolerance issue, so further loosening doesn't help. Reverted; tolerance is back at 0.25 in this PR.picknik-16-amd64-simrunner tier withtasksetCPU pinning andm7i-only family restriction — added in PickNikRobotics/moveit_pro_ci_runner_config #2. Restricts the runner to vCPUs1-7,9-15(reserving physical core 0 for kubelet/system DaemonSets) and forcesm7i.4xlarge(Sapphire Rapids) by droppingm6i/c*. Validated with 5 CI runs: 2/5 pass, no improvement over the general tier (also 2/5 across the same window). Diagnostic: the MuJoCo realtime warning still fires during the failing trajectory, indicating the bottleneck is inside MuJoCo's per-step work for this specific scene + controller combination, not in OS-level scheduling jitter that taskset could fix. EKS Auto Mode blocks the strongest CPU-isolation knob (cpuManagerPolicy: static), so this tier is as far as we can take Auto Mode-compatible CPU pinning.skip_objectives. Unblocksmainimmediately. The flake is reliably one objective, and the tier infra plus the timestep coarsening from this PR keep the other 117 objectives healthy.Tradeoffs
Push Button With a Trajectory. That path is also exercised (less strictly) by the loop variant incancel_objectives, and JTAC is used as the default controller across thelab_simobjectives generally, so we still have some integration coverage of JTAC itself — just not the F/T-threshold edge case that this specific test hits.picknik-16-amd64-simscale-set and NodePool from the runner-config PR were torn down in PickNikRobotics/moveit_pro_ci_runner_config@395b1cf after the 5-run validation showed no improvement. The original commit (1340c1e) can be cherry-picked back if a future workload genuinely benefits from the m7i + taskset pinning combination.Follow-ups
🤖 Generated with Claude Code