Skip to content

Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes#615

Merged
JWhitleyWork merged 1 commit into
mainfrom
fix/ci-mujoco-timestep
May 13, 2026
Merged

Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes#615
JWhitleyWork merged 1 commit into
mainfrom
fix/ci-mujoco-timestep

Conversation

@JWhitleyWork
Copy link
Copy Markdown
Member

@JWhitleyWork JWhitleyWork commented May 7, 2026

Summary

Three changes to stabilise the lab_sim integration test after the MuJoCo 3.2.7 → 3.6.0 upgrade in moveit_pro (6eedef88a5, Apr 14) made the constraint solver heavier per step:

  1. Bump the integration-test reusable workflow pin to moveit_pro_ci v0.0.9 and pass mujoco_ci_timestep: "0.003" — runs the lab_sim scene at 333 Hz instead of MuJoCo's 500 Hz default, ~1.5× the wall-clock budget per step. CI-only; local dev runs the scene unmodified.
  2. Activate the per-test MuJoCo reset fixture in objectives_integration_test.py (reset_simulation_before_test re-export). The integration test runs ~117 parametrised objectives against one shared backend, and the autouse fixture isolates residual world state between objectives so pick/place leftovers don't contaminate downstream tests.
  3. Skip Push Button With a Trajectory in CI. Documented inline in skip_objectives with a pointer to this PR.

Why

CI on main has been ~30% green since Apr 17. Failure logs always show Mujoco model timestep not running in realtime, and the timing-sensitive failures fall out of that — MoveGripperAction 15s timeouts, GetImage 5s camera timeouts, Push Button F/T threshold trips, MPC pose-tracking misses. Several earlier mitigations on main (constraint arena memory, MPC retunes, tolerance loosening, publisher timeout fixes) didn't close the realtime gap.

mujoco_ci_timestep: "0.003" closes the gap for every objective except Push Button With a Trajectory. That one uses the Joint Trajectory Admittance Controller (JTAC) with force/torque compliance — uniquely sensitive to the simulator falling under realtime — and remains structurally flaky even after extensive mitigation work (see Alternatives below). Skipping that single objective is the cheapest path to a green main; the underlying JTAC ↔ MuJoCo 3.6.0 interaction needs a separate, deeper fix.

Alternatives considered

  • Coarser timestep (0.005s / 200 Hz) — tried first. Destabilised JTAC's compliance loop: the controller couldn't track the planned Cartesian path and tripped Path tolerance violated (joint deviations up to 0.292 rad vs. 0.25 tolerance). 0.003s is the largest step that keeps JTAC stable.
  • Loosen path_position_tolerance (0.25 → 0.30) — combined with 0.003s timestep, got us to ~67% pass rate. Better than main, but the residual flake is Force/Torque threshold exceeded, not a path-tolerance issue, so further loosening doesn't help. Reverted; tolerance is back at 0.25 in this PR.
  • Dedicated picknik-16-amd64-sim runner tier with taskset CPU pinning and m7i-only family restriction — added in PickNikRobotics/moveit_pro_ci_runner_config #2. Restricts the runner to vCPUs 1-7,9-15 (reserving physical core 0 for kubelet/system DaemonSets) and forces m7i.4xlarge (Sapphire Rapids) by dropping m6i/c*. Validated with 5 CI runs: 2/5 pass, no improvement over the general tier (also 2/5 across the same window). Diagnostic: the MuJoCo realtime warning still fires during the failing trajectory, indicating the bottleneck is inside MuJoCo's per-step work for this specific scene + controller combination, not in OS-level scheduling jitter that taskset could fix. EKS Auto Mode blocks the strongest CPU-isolation knob (cpuManagerPolicy: static), so this tier is as far as we can take Auto Mode-compatible CPU pinning.
  • Skip the test (chosen) — single-line addition to skip_objectives. Unblocks main immediately. The flake is reliably one objective, and the tier infra plus the timestep coarsening from this PR keep the other 117 objectives healthy.

Tradeoffs

  • We lose CI coverage of the JTAC + F/T compliance path through Push Button With a Trajectory. That path is also exercised (less strictly) by the loop variant in cancel_objectives, and JTAC is used as the default controller across the lab_sim objectives generally, so we still have some integration coverage of JTAC itself — just not the F/T-threshold edge case that this specific test hits.
  • The picknik-16-amd64-sim scale-set and NodePool from the runner-config PR were torn down in PickNikRobotics/moveit_pro_ci_runner_config@395b1cf after the 5-run validation showed no improvement. The original commit (1340c1e) can be cherry-picked back if a future workload genuinely benefits from the m7i + taskset pinning combination.

Follow-ups

  • Track the underlying JTAC/MuJoCo 3.6.0 interaction in a separate issue (the long-term fix Shaur called out in fix: push button tolerance too low #610 — test-harness rethink — is the right home for this).
  • If we later find we need the deterministic-runner approach (for a different workload, or with kubelet-level CPU pinning), leaving Auto Mode for a self-managed Karpenter is the next escalation point.

🤖 Generated with Claude Code

@JWhitleyWork JWhitleyWork added this to the 9.3.0 milestone May 7, 2026
@JWhitleyWork JWhitleyWork force-pushed the fix/ci-mujoco-timestep branch from ee5d05a to 7611d04 Compare May 8, 2026 19:58
@JWhitleyWork JWhitleyWork requested review from Copilot and shaur-k May 8, 2026 19:58
@JWhitleyWork JWhitleyWork self-assigned this May 8, 2026
@JWhitleyWork JWhitleyWork marked this pull request as ready for review May 8, 2026 19:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the repository CI workflow to reduce MuJoCo integration-test flakiness by overriding the simulator timestep only in CI, giving the heavier MuJoCo 3.6.0 solver more wall-clock budget per step while keeping local development behavior unchanged.

Changes:

  • Pin the reusable workspace_integration_test.yaml workflow to a newer moveit_pro_ci commit that supports the new mujoco_ci_timestep input.
  • Pass mujoco_ci_timestep: "0.004" to run the CI lab simulation at 250 Hz instead of the default 500 Hz.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JWhitleyWork JWhitleyWork force-pushed the fix/ci-mujoco-timestep branch from 7611d04 to c70604d Compare May 8, 2026 20:07
@JWhitleyWork JWhitleyWork enabled auto-merge May 8, 2026 20:07
shaur-k
shaur-k previously approved these changes May 8, 2026
The MuJoCo 3.2.7 -> 3.6.0 upgrade in moveit_pro (6eedef88a5) made the
constraint solver heavier per step, so the lab_sim scene runs slower
than realtime on CI runners. That surfaces as `Mujoco model timestep
not running in realtime` warnings and timing-related test failures
(MoveGripperAction 15s timeouts, GetImage 5s wrist-camera timeouts,
Push Button trajectory F/T threshold trips). CI on main has been ~92%
red since Apr 17 as a result.

Three changes:

1. Pin the integration-test reusable workflow to v0.0.9 and pass
   `mujoco_ci_timestep: "0.003"` -- 333 Hz, ~1.5x the wall-clock
   budget per step versus the MuJoCo 500 Hz default. Only takes
   effect on CI; local dev runs the scene unmodified. Helps the
   other timing-sensitive objectives stay inside their wall-clock
   budgets even when the runner is contended.

2. Activate the per-test MuJoCo reset fixture by re-exporting
   reset_simulation_before_test in objectives_integration_test.py.
   The integration test runs ~117 parametrized objectives against
   a single shared backend and MuJoCo simulation; pick/place and
   similar objectives leave residual world state that caused
   order-dependent failures after the MuJoCo 3.6.0 upgrade.

3. Skip `Push Button With a Trajectory` in CI. This objective uses
   the Joint Trajectory Admittance Controller (JTAC) with force/torque
   compliance, which is uniquely sensitive to the simulator running
   slower than realtime. Across multiple mitigation experiments
   (timestep coarsening from 0.005 -> 0.003, path tolerance loosening
   from 0.20 -> 0.30, dedicated picknik-16-amd64-sim runner with
   taskset CPU pinning and m7i-only family restriction), the test
   continues to flake at ~50% with the same root cause -- the MuJoCo
   realtime warning fires *during* the trajectory, indicating
   structural instability under CI load, not a tolerance/timestep
   tuning gap. Land it skipped to unblock CI; the JTAC/MuJoCo
   interaction needs a deeper fix tracked separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@JWhitleyWork JWhitleyWork merged commit 4a84f71 into main May 13, 2026
4 checks passed
@JWhitleyWork JWhitleyWork deleted the fix/ci-mujoco-timestep branch May 13, 2026 15:50
JWhitleyWork added a commit that referenced this pull request May 13, 2026
…imestep-fix

Fix: Coarsen MuJoCo timestep on CI to stop slower-than-realtime flakes (backport #615 to v9.3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants