diff --git a/README.md b/README.md
index c319e41a..ae95d0f2 100644
--- a/README.md
+++ b/README.md
@@ -174,6 +174,10 @@ For a full list of extensions and detailed documentation, visit **[robotcontrols
 For full documentation, including advanced installation, modular usage, and API references, please visit:
 👉 **[robotcontrolstack.org](https://robotcontrolstack.org)**
 
+Useful quick-reference pages:
+- **[RCS Conventions](https://robotcontrolstack.org/user_guide/conventions)** for quaternion order, frames, Euler angles, and gripper semantics
+- **[Sim Scene Configuration](https://robotcontrolstack.org/user_guide/scene_configuration)** for `SimEnvCreatorConfig`, scene frames, and example setup patterns
+
 ## 🤝 Contribution
 
 We welcome contributions from the robotics and ML community! For contribution guidelines, please check out **[robotcontrolstack.org/contributing](https://robotcontrolstack.org/contributing)**.
diff --git a/docs/index.md b/docs/index.md
index 3c655216..64950871 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,6 +18,11 @@
 
 ## Documentation
 
+If you are looking for the most common frame, pose, and scene-setup details first, start here:
+
+- [RCS Conventions](user_guide/conventions.md)
+- [Sim Scene Configuration](user_guide/scene_configuration.md)
+
 ```{toctree}
 :maxdepth: 2
 :caption: User Guide
diff --git a/docs/user_guide/conventions.md b/docs/user_guide/conventions.md
new file mode 100644
index 00000000..86fcf4e0
--- /dev/null
+++ b/docs/user_guide/conventions.md
@@ -0,0 +1,221 @@
+# RCS Conventions
+
+This page is the quick reference for coordinate frames, pose encodings, gripper semantics, and units used across RCS.
+```text
+Simple 3D sketch of the same right-handed frame (robot base is at the origin)
+
+    +z (up)
+    ^
+    |
+    o---> +y (left)
+   /
+  /
+ +x (forward)
+```
+
+## At a glance
+
+- **Quaternion order in RCS is `xyzw`** (`[qx, qy, qz, qw]`), matching Eigen's coefficient order.
+- **MuJoCo free-joint `qpos` uses `wxyz`**, so direct MuJoCo state access needs an explicit reorder.
+- **Robot/base frames are right-handed** with `x` forward, `y` left, and `z` up.
+- **Gripper commands are normalized**: `0` means closed and `1` means open.
+- **Euler angles are `roll`, `pitch`, `yaw`** around the `x`, `y`, and `z` axes.
+- **Translations are in meters and angles are in radians** unless explicitly documented otherwise.
+
+If you are working with multi-robot scene composition, also see the [Sim scene configuration guide](scene_configuration.md).
+
+## Frames
+
+### World frame
+
+In simulation, scenes and free objects live in the global **world frame**.
+
+Use world coordinates when you want to place something in the room itself, for example a cube on a table or a fixed camera in the lab.
+
+The core robot API exposes explicit conversions between world and robot coordinates:
+
+- `Robot.get_base_pose_in_world_coordinates()`
+- `Robot.to_pose_in_world_coordinates(...)`
+- `Robot.to_pose_in_robot_coordinates(...)`
+
+### Robot base frame
+
+Kinematics and low-level robot poses are expressed in the robot's **base frame**.
+
+This is the frame assumed by the kinematics backends, for example in `src/rcs/Kinematics.cpp`, where `forward(...)` says the pose is assumed to be in the robot's coordinate frame.
+
+When you interpret or command robot poses, the expected base-frame axis orientation is:
+
+```text
+x = forward
+y = left
+z = up
+```
+
+This is also the convention documented for Franka Quest teleoperation in `examples/teleop/README.md`.
+
+### End-effector frame: `attachment_site`
+
+Each robot config defines an `attachment_site`. This is the end-effector frame used by the kinematics stack.
+
+Common examples in the repository are:
+
+- `attachment_site_0` for FR3 / Panda
+- `attachment_site` for UR5e / XArm7
+- `gripper` for SO101
+
+If you are unsure which frame a robot uses, check its config or the relevant example scene in `python/rcs/envs/configs.py`.
+
+### Tool frame: `tcp_offset`
+
+`tcp_offset` is applied on top of the attachment site to define the actual tool center point (TCP) used by motion commands and IK.
+
+In practice:
+
+- `attachment_site` = frame coming from the robot model
+- `tcp_offset` = extra transform from that frame to the tool you actually want to control
+
+A typical example is a wrist-mounted gripper where the tool center point is slightly in front of the model's attachment site.
+
+## Pose representations
+
+RCS uses several pose encodings. The important ones are:
+
+### `Pose`
+
+`rcs.common.Pose` is the main transform type. It supports construction from:
+
+- translation only
+- quaternion + translation
+- `RPY` + translation
+- rotation matrix + translation
+
+### Quaternion order
+
+Within RCS, quaternions are stored in **xyzw** order.
+
+This is visible in two places:
+
+- `src/rcs/Pose.cpp`: `rotation_q()` returns `this->m_rotation.coeffs()`
+- `python/tests/test_common.py` checks that the identity quaternion is `[0, 0, 0, 1]`
+
+So the convention is:
+
+```text
+[qx, qy, qz, qw]
+```
+
+This matches Eigen's quaternion coefficient layout.
+
+### `tquat`
+
+`tquat` means translation plus quaternion and is used by the environment API.
+
+The value layout is:
+
+```text
+[x, y, z, qx, qy, qz, qw]
+```
+
+This comes directly from `python/rcs/envs/base.py`, where `tquat` is built from `pose.translation()` followed by `pose.rotation_q()`.
+
+### `xyzrpy`
+
+`xyzrpy` is the translation plus roll-pitch-yaw representation used by the environment API.
+
+The value layout is:
+
+```text
+[x, y, z, roll, pitch, yaw]
+```
+
+The `RPY` type in `python/rcs/_core/common.pyi` exposes the fields in exactly that order:
+
+- `roll`
+- `pitch`
+- `yaw`
+
+These angles are rotations around the `x`, `y`, and `z` axes, respectively.
+
+### Rotation vector / `rotvec`
+
+Some hardware integrations, notably UR, also use a 6D rotation-vector pose:
+
+```text
+[x, y, z, rx, ry, rz]
+```
+
+You can see this in `extensions/rcs_ur5e/src/rcs_ur5e/hw.py`, where `common.RotVec(...).as_quaternion_vector()` is converted into an RCS `Pose`, and `Pose.rotvec()` is sent back to the robot.
+
+## Gripper convention
+
+Gripper actions and widths are normalized to the range `[0, 1]`.
+
+The convention used across RCS is:
+
+```text
+0 = closed
+1 = open
+```
+
+This is documented directly in multiple places:
+
+- `python/rcs/envs/base.py`: `# 0 for closed, 1 for open (>=0.5 for open)`
+- `src/sim/SimGripper.h`: `// normalized width of the gripper, 0 is closed, 1 is open`
+- `extensions/rcs_fr3/src/hw/FrankaHand.h`: `// normalized width of the gripper, 0 is closed, 1 is open`
+
+For binary grippers, the environment wrapper rounds and clips the command, and values `>= 0.5` are treated as open.
+
+## Units
+
+Unless a specific API says otherwise:
+
+- translations, distances, and Cartesian offsets are in **meters**
+- joint angles and Euler angles are in **radians**
+- camera depth is stored as **metric depth scaled by 1000** in `uint16`, so values are effectively in **millimeters**
+
+In practice, that means:
+
+- moving `0.01` in `tquat[0]` means moving **1 cm** in `x`
+- `np.pi / 2` means **90 degrees**
+- a depth value of `723` means about **0.723 m**
+
+The relevant camera path is:
+
+- `python/rcs/camera/sim.py` converts the MuJoCo depth buffer to **meters** when `physical_units=True`
+- `python/rcs/camera/interface.py` defines `BaseCameraSet.DEPTH_SCALE = 1000`
+- the resulting metric depth is multiplied by that scale factor and stored as `uint16`
+
+The same scaled-metric convention is also used by camera integrations such as RealSense and ZED.
+
+## MuJoCo caveat: free-joint quaternions use `wxyz`
+
+A common source of confusion is that **RCS uses `xyzw`**, but MuJoCo free-joint `qpos` stores the quaternion as **`wxyz`**.
+
+RCS exposes both forms explicitly:
+
+- `Pose.rotation_q()` returns `xyzw`
+- `Pose.rotation_q_wxyz()` returns `wxyz`
+
+You can see the `wxyz` form used when writing MuJoCo-facing state in files such as `python/rcs/sim/sim.py` and `python/rcs/envs/tasks.py`.
+
+So:
+
+- **RCS `Pose` / env API**: `xyzw`
+- **MuJoCo free-joint state**: `wxyz`
+
+If you manipulate MuJoCo state directly, convert between the two explicitly.
+
+## Practical checklist
+
+When something looks wrong, check these first:
+
+1. Are you working in **world frame** or **robot/base frame**?
+2. Does your robot/base frame follow **x forward, y left, z up**?
+3. Is your end-effector frame the correct `attachment_site`?
+4. Did you apply the right `tcp_offset`?
+5. Are your quaternions **xyzw** in RCS?
+6. Are you accidentally feeding **MuJoCo `wxyz`** into an RCS API?
+7. Are your Euler angles ordered as **roll, pitch, yaw** around **x, y, z**?
+8. Are your gripper commands using **0 = closed, 1 = open**?
+9. Are your translations in **meters** and your angles in **radians**?
diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md
index 739caecc..f3e5081f 100644
--- a/docs/user_guide/index.md
+++ b/docs/user_guide/index.md
@@ -2,10 +2,17 @@
 
 The User Guide provides in-depth information about the core concepts and components of the Robot Control Stack.
 
+If you want the quickest orientation first, start with:
+
+- [RCS Conventions](conventions.md)
+- [Sim Scene Configuration](scene_configuration.md)
+
 ```{toctree}
 :maxdepth: 2
 
 architecture
 gym_interface
 low_level_api
+conventions
+scene_configuration
 ```
diff --git a/docs/user_guide/scene_configuration.md b/docs/user_guide/scene_configuration.md
new file mode 100644
index 00000000..aeddac88
--- /dev/null
+++ b/docs/user_guide/scene_configuration.md
@@ -0,0 +1,367 @@
+# Sim Scene Configuration
+
+This page explains the simulation scene API in `python/rcs/envs/scenes.py` and shows how the example scene configs in `python/rcs/envs/configs.py` fit together.
+
+If you are new to the frame conventions first, read [RCS Conventions](conventions.md).
+
+```text
+world frame
+└── root frame
+    ├── all composed scene objects live here
+    └── shared base frame
+        (same kinematic node as root frame)
+        (common robot-coordinate frame)
+        ├── i-th robot base
+        │   └── i-th robot attachment_site
+        ├── j-th robot base
+        │   └── j-th robot attachment_site
+        └── ...
+```
+
+## The simple mental model
+
+When building an RCS sim scene, it helps to think about five frames:
+
+1. **World frame**
+   - The global MuJoCo frame.
+   - This is the outermost reference for the whole scene.
+
+2. **Root frame**
+   - The scene-local frame for the composed robot setup.
+   - In the usual composed setup, all robot-scene objects are placed from here.
+
+3. **Shared base frame**
+   - The common coordinate frame used for all robot actions and observations.
+   - It is attached to the same kinematic node as the root frame, but represents the common robot coordinate convention.
+
+4. **i-th robot base frame**
+   - The base frame of one specific robot.
+   - Low-level kinematics and Cartesian commands are expressed here.
+
+5. **i-th robot `attachment_site`**
+   - The end-effector mounting frame for that robot, before any optional `tcp_offset`.
+   - Use this for wrist-mounted objects, cameras, and tools.
+
+A good rule of thumb is:
+
+- want the outer global reference -> **world frame**
+- want the composed scene placement frame -> **root frame**
+- want one common coordinate frame for all robots -> **shared base frame**
+- want per-robot kinematics -> **i-th robot base frame**
+- want wrist or tool mounting -> **i-th robot `attachment_site`**
+
+If needed, extra `world_frame_objects` can still be placed directly in world coordinates.
+
+## How the main placement frames work
+
+For composed robot scenes, `SimEnvCreator.create_model()` combines three scene-placement transforms in this order:
+
+```python
+robot2world = root_frame_to_world * shared_base_frame_to_root_frame * robot_to_shared_base_frame[robot_name]
+```
+
+In simple terms:
+
+- `root_frame_to_world` places the whole rig into the MuJoCo world
+- `shared_base_frame_to_root_frame` defines the shared robot coordinate frame relative to the root frame
+- `robot_to_shared_base_frame` places each robot relative to that shared robot frame
+
+### Single-robot intuition
+
+For a simple single-arm scene, all three can often be identity transforms:
+
+```python
+root_frame_to_world = rcs.common.Pose()
+shared_base_frame_to_root_frame = rcs.common.Pose()
+robot_to_shared_base_frame = {"robot": rcs.common.Pose()}
+```
+
+That means the robot base, shared base, root frame, and world origin all coincide.
+
+### Dual-arm intuition
+
+The FR3 duo example in `python/rcs/envs/configs.py` uses these frames more meaningfully:
+
+- `root_frame_to_world` keeps the whole duo rig aligned with the world
+- `shared_base_frame_to_root_frame` lifts the shared base to the duo mount height
+- `robot_to_shared_base_frame` offsets the left and right robots from the center
+
+That is why the dual-arm example can expose one common action frame while still placing each robot correctly.
+
+## `SimEnvCreatorConfig` keys
+
+This is the main top-level config for the scene API.
+
+| Key | What it controls | Typical use |
+| --- | --- | --- |
+| `robot_cfgs` | Maps robot names to `SimRobotConfig` | Define one robot or multiple named robots such as `"left"` and `"right"` |
+| `sim_cfg` | MuJoCo runtime settings (`SimConfig`) | Realtime, async control, frequency, convergence behavior |
+| `control_mode` | Action representation | For example `ControlMode.CARTESIAN_TQuat` |
+| `task_cfg` | Optional task-specific config | Add pick/place or other task logic |
+| `scene` | Base scene XML path or scene key | Usually from `SCENE_PATHS[...]` |
+| `gripper_cfgs` | Optional gripper config per robot | Add one gripper per robot |
+| `camera_cfgs` | Optional camera config dictionary | Define resolution, type, and frame rate for named cameras |
+| `max_relative_movement` | Relative action limit | Limit per-step Cartesian delta |
+| `relative_to` | Relative action reference | Usually `RelativeTo.LAST_STEP` or `RelativeTo.NONE` |
+| `robot_to_shared_base_frame` | Per-robot offset relative to the shared base frame | Multi-robot layouts |
+| `add_gravcomp` | Add gravity compensation to the composed scene | Often useful for manipulation scenes |
+| `wrapper_cfg` | Wrapper behavior flags | Binary gripper, home-on-reset, depth output |
+| `headless` | GUI toggle | `True` for no GUI |
+| `shared_base_frame_to_root_frame` | Offset from shared base frame to root frame | Move shared command origin inside the rig |
+| `root_frame_to_world` | Offset from root frame to MuJoCo world | Place the whole setup in the room |
+| `alternative_combined_robot_mjcf` | Use a pre-combined robot MJCF instead of composing robots one by one | Advanced custom scenes |
+| `world_frame_objects` | Objects placed directly in world coordinates | Loose props, room-fixed assets |
+| `root_frame_objects` | Objects placed in root-frame coordinates | Tables, mounts, fixtures that should move with the rig |
+| `robot_frame_objects` | Objects attached in a robot attachment-site frame | Wrist mounts, end-effector payloads |
+| `camera_adds` | Cameras to add to the scene | Fixed overhead cameras or wrist cameras |
+| `gripper_offsets` | Pose offsets for mounted grippers | Align visual or tool frames |
+| `_original_cfg` | Internal helper used after prefixing | Usually ignore this in user code |
+
+## What usually lives inside the nested configs
+
+The scene config mostly wires together three lower-level config types.
+
+### `robot_cfgs: dict[str, SimRobotConfig]`
+
+Each robot entry usually defines things such as:
+
+- robot type
+- kinematic model path
+- `attachment_site`
+- `tcp_offset`
+- joint names and actuator names
+- base link name
+- degrees of freedom, joint limits, and `q_home`
+
+The single-arm and dual-arm examples in `python/rcs/envs/configs.py` are good templates for this.
+
+### `gripper_cfgs: dict[str, SimGripperConfig]`
+
+Each gripper entry usually defines:
+
+- gripper type
+- gripper joint names and actuator name
+- min/max width or actuator range
+- collision geometry settings
+- callback timing
+
+### `camera_cfgs: dict[str, SimCameraConfig]`
+
+Each camera entry usually defines:
+
+- camera identifier
+- camera type
+- resolution
+- frame rate
+
+A useful pattern is:
+
+- `camera_cfgs` defines the camera runtime properties
+- `camera_adds` defines where that camera is placed in the scene
+
+## The most important nested configs
+
+### `WrapperConfig`
+
+`WrapperConfig` controls behavior of the environment wrappers around the raw simulation.
+
+| Key | Meaning |
+| --- | --- |
+| `binary_gripper` | If `True`, gripper commands are treated as open/close instead of continuous width |
+| `home_on_reset` | If `True`, the robot returns home during reset |
+| `include_depth` | If `True`, camera wrappers include depth images. These are metric depth values scaled by `BaseCameraSet.DEPTH_SCALE = 1000` and stored as `uint16`, so they are effectively in millimeters |
+
+### `CameraAdderConfig`
+
+`CameraAdderConfig` describes how a camera is added to the scene.
+
+| Key | Meaning |
+| --- | --- |
+| `xml_path` | Optional camera XML asset to insert directly |
+| `fovy` | Camera field of view, used when creating a camera directly |
+| `offset` | Camera pose offset |
+| `attachment_site` | Attachment site to use if mounted on a robot |
+| `robot_name` | If set, mount the camera on that robot; otherwise add it as a scene camera |
+
+The important frame detail is:
+
+- if `robot_name` is **not** set, `offset` is interpreted in the **root frame** and then moved into world by `root_frame_to_world`
+- if `robot_name` **is** set, `offset` is interpreted relative to that robot's **attachment site**
+
+## Easy examples
+
+### Minimal single-robot scene
+
+This is the basic shape used by `EmptyWorldFR3` in `python/rcs/envs/configs.py`:
+
+```python
+cfg = SimEnvCreatorConfig(
+    robot_cfgs={"robot": robot_cfg},
+    sim_cfg=SimConfig(async_control=False, realtime=True, frequency=1),
+    control_mode=ControlMode.CARTESIAN_TQuat,
+    scene=SCENE_PATHS["empty_world"],
+    gripper_cfgs={"robot": gripper_cfg},
+    camera_cfgs={"bird_eye": bird_eye_cfg, "wrist": wrist_cfg},
+    robot_to_shared_base_frame={"robot": rcs.common.Pose()},
+    shared_base_frame_to_root_frame=rcs.common.Pose(),
+    root_frame_to_world=rcs.common.Pose(),
+)
+```
+
+What this means in plain language:
+
+- there is one robot named `robot`
+- it uses Cartesian `tquat` actions
+- the base scene is the empty world
+- there is one gripper on the robot
+- there are two cameras
+- all high-level frames start at the same origin
+
+### Dual-arm scene
+
+This is the important part of the `EmptyWorldFR3Duo` example:
+
+```python
+robot_cfgs = {"left": robot_cfg_left, "right": robot_cfg_right}
+
+robot_to_shared_base_frame = {
+    "left": DEFAULT_TRANSFORMS["FR3_DUOMOUNT_LEFT_ROBOT"],
+    "right": DEFAULT_TRANSFORMS["FR3_DUOMOUNT_RIGHT_ROBOT"],
+}
+
+shared_base_frame_to_root_frame = DEFAULT_TRANSFORMS["FR3_DUOMOUNT_HEIGHT_OFFSET"]
+root_frame_to_world = rcs.common.Pose()
+```
+
+In plain language:
+
+- the shared base frame sits at the logical center of the duo setup
+- the left and right robot bases are offset from that center
+- the whole setup can still be moved together by changing `root_frame_to_world`
+
+### Object placement: which dictionary should I use?
+
+#### `world_frame_objects`
+
+Use this when the object should stay fixed in the room.
+
+```python
+world_frame_objects = {
+    "cube": (OBJECT_PATHS["green_cube"], rcs.common.Pose(translation=np.array([0.5, 0.0, 0.2]))),
+}
+```
+
+Example meaning: place a cube at a fixed world position.
+
+#### `root_frame_objects`
+
+Use this when the object belongs to the rig and should move together with it.
+
+```python
+root_frame_objects = {
+    "duo_mount": (OBJECT_PATHS["fr3_duo_mount"], DEFAULT_TRANSFORMS["FR3_DUOMOUNT_BASE"]),
+}
+```
+
+Example meaning: the duo mount is part of the setup, not a free world object.
+
+#### `robot_frame_objects`
+
+Use this when the object should be attached to one robot's tool frame.
+
+```python
+robot_frame_objects = {
+    "left": {
+        "left_d405_mount": (
+            OBJECT_PATHS["robotiq_d405_mount"],
+            DEFAULT_TRANSFORMS["FR3_ROBOTIQ_WRIST_D405_MOUNT"],
+        )
+    }
+}
+```
+
+Example meaning: attach a wrist mount to the left robot only.
+
+## Camera depth units
+
+When depth is enabled, the camera wrapper exposes depth images as scaled metric depth:
+
+- sim depth is first converted to **meters** in `python/rcs/camera/sim.py`
+- camera frames use `BaseCameraSet.DEPTH_SCALE = 1000`
+- depth is then stored as `uint16`
+
+So in practice:
+
+- divide by `1000` to get **meters**
+- or read the values directly as **millimeters**
+
+Example:
+
+- `depth[y, x] == 1500` means the point is about **1.5 m** away from the camera
+
+### Camera placement
+
+#### Fixed scene camera
+
+This pattern from `EmptyWorldFR3` adds an overhead camera:
+
+```python
+camera_adds = {
+    "bird_eye": CameraAdderConfig(
+        fovy=60.0,
+        offset=rcs.common.Pose(
+            translation=np.array([0.271, 0.0, 2.080]),
+            quaternion=np.array([0.0060, -0.0060, -0.7067, 0.7074]),
+        ),
+    )
+}
+```
+
+Because `robot_name` is not set, this pose is interpreted in the **root frame**.
+
+#### Wrist camera
+
+This pattern mounts a camera to a robot:
+
+```python
+camera_adds = {
+    "wrist": CameraAdderConfig(
+        fovy=60.0,
+        offset=some_pose,
+        robot_name="robot",
+    )
+}
+```
+
+Because `robot_name` is set, `offset` is interpreted relative to that robot's **attachment site**.
+
+## Common mistakes
+
+1. **Mixing up world and root frame**
+   - If the whole rig should move together, use `root_frame_to_world` or `root_frame_objects`, not `world_frame_objects`.
+
+2. **Using the wrong frame for camera offsets**
+   - Scene cameras use root-frame offsets.
+   - Robot-mounted cameras use attachment-site offsets.
+
+3. **Forgetting matching camera names**
+   - If a camera is added without `xml_path`, its name must also exist in `camera_cfgs`.
+
+4. **Putting wrist-mounted assets into world objects**
+   - Use `robot_frame_objects` for things that should follow the robot wrist.
+
+5. **Using `alternative_combined_robot_mjcf` without the expected prefixes**
+   - The docstring in `scenes.py` requires names like `robot{robot_name}`.
+
+## A practical workflow
+
+When building a new scene, this usually works well:
+
+1. Start with one robot and identity transforms.
+2. Add `root_frame_objects` for mounts or fixtures.
+3. Add `world_frame_objects` only for room-fixed props.
+4. Add `robot_frame_objects` for wrist payloads.
+5. Add cameras with `camera_adds`.
+6. Only then introduce non-trivial `shared_base_frame_to_root_frame` and `robot_to_shared_base_frame` offsets.
+
+That order keeps the frame reasoning much easier.