Skip to content

feat: use individual module connections#2049

Open
paul-nechifor wants to merge 1 commit into
mainfrom
paul/feat/simulation-modules
Open

feat: use individual module connections#2049
paul-nechifor wants to merge 1 commit into
mainfrom
paul/feat/simulation-modules

Conversation

@paul-nechifor
Copy link
Copy Markdown
Contributor

@paul-nechifor paul-nechifor commented May 12, 2026

Problem

Closes DIM-XXX

Solution

How to Test

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR refactors Go2 and G1 robot connections from a single monolithic class with internal dispatch into separate per-backend Module classes (Go2WebRtcConnection, Go2MujocoConnection, Go2ReplayConnection, G1WebRtcConnection, G1MujocoConnection) registered via a new @connection(robot, backend) decorator. A new Blueprint.with_backend(backend) method leverages this registry to swap connection atoms at blueprint-construction time, replacing the old make_connection() runtime dispatch.

  • Registry + decorator: dimos/robot/connection_registry.py introduces _REGISTRY and @connection(robot, backend) which stamps _connection_tag onto a module class and checks for duplicates.
  • Blueprint.with_backend: walks atoms, finds tagged ones whose backend differs, validates stream-surface parity and kwarg compatibility, then substitutes the target class and rewrites remappings/disabled-module entries.
  • Shared bases extracted: MujocoConnectionBase (Module) and UnitreeWebRtcSession are factored out so robot-specific subclasses only add camera offsets, start/stop hooks, and extra RPC stubs.

Confidence Score: 3/5

The backend-swap machinery is sound but several unresolved defects in connection modules and CLI routing make this risky to merge as-is.

The --replay CLI path bypasses with_backend entirely, the G1 backends have incompatible stream surfaces making with_backend("mujoco") always fail, camera-info threads loop forever blocking teardown, and WebRTC handshakes block indefinitely on unreachable robots.

dimos/robot/cli/dimos.py and dimos/core/global_config.py (replay routing gap), dimos/robot/unitree/go2/connection_webrtc.py and connection_replay.py (non-terminating camera-info threads), dimos/robot/unitree/webrtc_session.py (unbounded handshake block), dimos/robot/unitree/g1/__init__.py (incompatible backend stream surfaces).

Important Files Changed

Filename Overview
dimos/core/coordination/blueprints.py Adds with_backend() method with stream-parity and kwarg-compat checks; _check_kwargs_compat silently skips validation when get_type_hints raises.
dimos/robot/connection_registry.py New @connection(robot, backend) decorator and registry; duplicate-registration guard and get_connection/backends_for helpers look correct.
dimos/core/global_config.py Adds simulator field and effective_simulator property; replay=True still returns None from effective_simulator, leaving --replay CLI path unhandled by the new with_backend routing.
dimos/robot/cli/dimos.py Uses effective_simulator to call with_backend; --replay flag is not covered by effective_simulator, so replay blueprints are never swapped and the WebRTC handshake is attempted instead.
dimos/robot/unitree/go2/connection_webrtc.py New Go2 WebRTC connection module; _publish_camera_info loops with while True and no stop-event check, causing stop() to always time out waiting on that thread.
dimos/robot/unitree/go2/connection_replay.py New Go2 replay connection module; _publish_camera_info has same infinite loop issue; otherwise stream wiring and odom-to-tf plumbing are correct.
dimos/robot/unitree/mujoco_connection.py Refactored from a standalone class to MujocoConnectionBase(Module); _publish_camera_info_loop correctly checks _stop_event, subprocess teardown and stream wiring look intact.
dimos/robot/unitree/go2/fleet_connection.py Refactored to use _FleetMemberClient wrapping UnitreeWebRtcSession; _FleetMemberClient.start() calls session.start() which blocks indefinitely on connection_ready.wait() with no timeout.
dimos/robot/unitree/webrtc_session.py New shared WebRTC session abstraction; connection_ready.wait() in start() has no timeout, blocking indefinitely on unreachable robots.
dimos/robot/unitree/g1/init.py Eagerly imports both G1 backends to populate the registry; G1WebRtcConnection (1 stream) and G1MujocoConnection (5 streams) have incompatible surfaces, so with_backend("mujoco") on a G1 webrtc blueprint raises Stream surface drift.

Reviews (6): Last reviewed commit: "feat: use individual module connections" | Re-trigger Greptile

Comment on lines +213 to +215
for atom in self.blueprints:
tag = getattr(atom.module, "_connection_tag", None)
if tag is None or tag.backend == backend:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The with_backend method uses getattr(atom.module, "_connection_tag", None) which walks the MRO and picks up inherited tags. Go2FleetConnection inherits _connection_tag = ConnectionTag(robot="go2", backend="webrtc") from Go2WebRtcConnection without being directly decorated. This means a blueprint containing Go2FleetConnection would have with_backend("mujoco") silently swap it to a bare Go2MujocoConnection, discarding all fleet-coordination logic.

Suggested change
for atom in self.blueprints:
tag = getattr(atom.module, "_connection_tag", None)
if tag is None or tag.backend == backend:
for atom in self.blueprints:
tag = vars(atom.module).get("_connection_tag", None)
if tag is None or tag.backend == backend:

Comment on lines +461 to +464
def _publish_camera_info(self) -> None:
while True:
self.camera_info.publish(self.camera_info_static)
time.sleep(1.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Camera-info thread never terminates, causing stop() to always time-out

_publish_camera_info loops forever with no exit condition. The stop() method joins this thread with timeout=DEFAULT_THREAD_JOIN_TIMEOUT, so every stop() call blocks for the full timeout before continuing teardown. The same pattern is present in Go2MujocoConnection and Go2ReplayConnection. Compare with G1MujocoConnection._publish_camera_info_loop, which checks _stop_event.is_set() and exits cleanly. Adding a similar stop-event check here would make teardown prompt.


self.thread = Thread(target=start_background_loop, daemon=True)
self.thread.start()
self.connection_ready.wait()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unbounded block on connection handshake

self.connection_ready.wait() is called with no timeout. If the fleet-member robot at the given IP is unreachable or the WebRTC handshake never completes, constructing Go2FleetConnection (which builds one _FleetMemberClient per extra robot) will block the calling thread indefinitely with no error, no warning, and no recovery path.

The same pattern exists in Go2WebRtcConnection.start() (line 167 of connection_webrtc.py) and G1WebRtcConnection.start() (line 98 of g1/connection.py), both of which are new code introduced in this PR. Adding a timeout parameter (e.g. self.connection_ready.wait(timeout=30)) and raising a TimeoutError when it returns False would make all three sites fail fast on bad configuration.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

⚠️ JUnit XML file not found

The CLI was unable to find any JUnit XML files to upload.
For more help, visit our troubleshooting guide.

@paul-nechifor paul-nechifor force-pushed the paul/feat/simulation-modules branch from bab5f66 to 7647f65 Compare May 12, 2026 04:29
Comment on lines +93 to +100
@property
def effective_simulator(self) -> str | None:
"""Resolved simulator backend from --simulator or --simulation."""
if self.simulator:
return self.simulator
if self.simulation:
return "mujoco"
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The effective_simulator property should also cover the replay flag so the CLI's with_backend() call fires for replay mode the same way it does for simulation. Without this, --replay silently stays on Go2WebRtcConnection and blocks on a WebRTC handshake.

Suggested change
@property
def effective_simulator(self) -> str | None:
"""Resolved simulator backend from --simulator or --simulation."""
if self.simulator:
return self.simulator
if self.simulation:
return "mujoco"
return None
@property
def effective_simulator(self) -> str | None:
"""Resolved simulator backend from --simulator, --simulation, or --replay."""
if self.replay:
return "replay"
if self.simulator:
return self.simulator
if self.simulation:
return "mujoco"
return None

@paul-nechifor paul-nechifor force-pushed the paul/feat/simulation-modules branch 2 times, most recently from 113994a to 72080c3 Compare May 12, 2026 04:58
@paul-nechifor paul-nechifor force-pushed the paul/feat/simulation-modules branch from 72080c3 to f82d270 Compare May 12, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant