Skip to content

Feat/medkit integration#1

Open
mfaferek93 wants to merge 15 commits into
mainfrom
feat/medkit-integration
Open

Feat/medkit integration#1
mfaferek93 wants to merge 15 commits into
mainfrom
feat/medkit-integration

Conversation

@mfaferek93
Copy link
Copy Markdown

@mfaferek93 mfaferek93 commented May 12, 2026

Summary

Wires ros2_medkit_fault_reporter into manymove_cpp_trees so every BT action node reports operational faults directly to a FaultManager, instead of routing through /diagnostics. This is the native-instrumentation path the medkit upstream design recommends.

What changes

Plumbing

  • fault_codes.hpp (101 lines): catalogue of MANYMOVE_* constants grouped by subsystem.
  • fault_reporting.hpp (110 lines): FaultReporting capability class. Pulls the reporter from the blackboard, same pattern action nodes already use for node_.
  • installFaultReporter helper in main_imports_helper.hpp, wired into all 15 BT executable mains.
  • package.xml + CMakeLists.txt: hard dep on ros2_medkit_fault_reporter and ros2_medkit_msgs. No CMake toggle (fork-only).

Instrumentation (21 action node classes)

Subsystem Sites Notable
planner 4 collision, e-stop, retry attempt (paired with reportFaultPassed), retries exhausted
signals 5 IO set/get, robot NOT_READY (CRITICAL), wait-input timeout
objects 5 add / remove / attach / get-pose / wait-timeout
gripper 4 command + traj operational failures
logic 2 TF lookup, WaitForKeyBool timeout
isaac 2 FoundationPose detection timeouts

Condition and decorator nodes are intentionally left out: their FAILURE return is normal control flow. Programmer-error FAILURE returns (missing inputs, wrong vector sizes) are also skipped.

Tests

  • fake_fault_manager.hpp: matches the existing FakeXxxServer pattern in the test tree.
  • test_fault_reporting.cpp: end-to-end roundtrip covering reporter install, FAILED forwarding, PASSED healing and the silent no-op path when no reporter is present.

Docs + CI

  • docs/FAULT_CODES.md: full catalogue with severity, trigger and pairing semantics.
  • README.md + CHANGELOG.rst entries.
  • CI workflow installs ros2_medkit_* debs from rosdistro so the overlay builds.

Cleanup

  • bt_client_*, collision_test and tutorials get unique node names so source_id is distinguishable per executable.
  • Drop unused topic_based_ros2_control declaration from manymove_cpp_trees (kept where actually used in manymove_bringup).

Notes

  • The pre-existing test_action_nodes_gripper failure (convertFromString registration) is upstream and unrelated to this change.
  • Per-site assertions would require full action server fakes for every BT node; deferred to the demo integration work.

mfaferek93 added 14 commits May 9, 2026 22:14
Declared in package.xml and CMakeLists.txt but never #included anywhere
in src/, include/ or test/. Build is green without it; manymove_bringup
keeps the dep where it is actually used.
All bt_client_*, collision_test and tutorials shared "bt_client_node",
so source_id was indistinguishable across them. Each now matches its
executable; the static logger in tree_helper.cpp and two log-level
remaps in manymove_bringup launch files moved with them.
No CMake option, no conditional <depend>: this fork exists for the
medkit integration. If we ever upstream to pastoriomarco/manymove, put
it behind a toggle so vanilla users do not pull medkit transitively.
…tall helper

Three pieces for BT action node instrumentation: fault_codes.hpp
(MANYMOVE_* constants), fault_reporting.hpp (the FaultReporting
capability class), and installFaultReporter in main_imports_helper.hpp
wired into all 15 BT entry-point mains.

Capability class instead of direct FaultReporter member because BT.CPP
factory constructs nodes with fixed signatures, so we pull the reporter
from the blackboard in the FaultReporting ctor (same pattern as `node_`
already fetched in every action node).
Adds FaultReporting as a second base class to every Action node across
planner/objects/signals/gripper/logic/isaac, plus the matching
ctor init list call. Condition and decorator nodes are intentionally
left out: their FAILURE return is normal control flow.
…austion

Four reportFault sites in MoveManipulatorAction:
 - kPlannerCollisionDetected on the onStart collision branch
 - kPlannerEstopTriggered when stop_execution is set before motion
 - kPlannerRetryAttempt on every failed attempt (soft fault, throttled
   locally by LocalFilter)
 - kPlannerRetriesExhausted once max_tries is reached

reportFaultPassed pairs kPlannerRetryAttempt in the success branch so
the local filter resets when a retry eventually succeeds.

Skipped the input/blackboard-missing FAILURE returns (lines 96, 147,
174 and ResetTrajectories@356); those are programmer errors caught at
startup, not operational faults.
…ait timeout

Five reportFault sites covering SetOutput/GetInput action failures,
CheckRobotState NOT_READY (CRITICAL), WaitForInput server-unavailable
and the wait-timeout path. reportFaultPassed pairs kRobotNotReady on
CheckRobotState success and kSignalWaitInputTimeout on the WaitForInput
desired-value branch so LocalFilter resets cleanly.
…timeout

Five reportFault sites cover the operational failures of
AddCollisionObjectAction, RemoveCollisionObjectAction,
AttachDetachObjectAction, GetObjectPoseAction and WaitForObjectAction.
WaitForObject pairs reportFaultPassed in the success branch.

CheckObjectExistsAction is intentionally not instrumented: its FAILURE
on \"object missing\" is normal control flow used by callers as a
condition, not an operational fault.
Four reportFault sites: GripperCommand server timeout and goal-not-reached,
GripperTraj server unavailable and trajectory-failed result.
PublishJointStateAction throws on bad inputs and never returns FAILURE,
so it has nothing to instrument.
GetLinkPoseAction reports kTfLookupFailed when tf2 raises a
TransformException. WaitForKeyBool reports kWaitKeyTimeout on the
timeout branch and pairs reportFaultPassed in the success path.

Adds kWaitKeyTimeout to fault_codes.hpp.

Configuration-error FAILUREs (missing inputs, wrong vector sizes) are
left out, as are condition-style nodes (CheckPoseDistance,
CheckPoseBounds, CheckKeyBoolValue) where FAILURE is expected control
flow.
Two reportFault sites in FoundationPoseAlignmentNode: detection topic
silent for too long, and no detection passing the target/min_score
filter within the timeout. Both raise kIsaacFoundationPoseFailed.

GetEntityPoseNode/SetEntityPoseNode service-error paths and Isaac TF
timeouts are left for the isaac demo work in Phase 3, where they'll
have specific fault codes once the demo wiring exposes them.
…ager

fake_fault_manager.hpp is a small fixture that hosts the
/fault_manager/report_fault service and records every accepted request,
matching the existing FakeXxxServer pattern in this test directory.

test_fault_reporting.cpp uses it to verify the end-to-end plumbing:
 - installFaultReporter puts a reporter on the blackboard
 - FaultReporting::reportFault forwards FAILED events with the right
   fault_code, severity, description and source_id (FQN of the BT
   client node)
 - reportFaultPassed forwards a PASSED event for healing
 - mixin silently no-ops when no reporter is installed (e.g. unit
   tests building a tree without bt_client scaffolding)

Per-instrumentation-site assertions are out of scope here; those would
need to drive every BT action node through real action server fakes
and belong with the demo work in Phase 3.
- docs/FAULT_CODES.md: full table of MANYMOVE_* codes with severity,
  trigger condition and pairing semantics, grouped by subsystem.
- README.md: short note pointing fork users at the medkit dependency
  and the catalogue.
- CHANGELOG.rst: entry under Forthcoming summarising the BT node
  rename, the hard medkit dep and the dropped topic_based_ros2_control
  declaration.
- ament_uncrustify --reformat run across the modified manymove_cpp_trees
  sources.
- Lint fixes for new files (cpplint header order / memory include /
  guard style; ament_copyright BSD-3 boilerplate match).

The remaining test_action_nodes_gripper failure is a pre-existing
convertFromString registration issue in upstream manymove (verified by
running the test from the parent commit before any medkit changes).
After wiring ros2_medkit_fault_reporter into manymove_cpp_trees the CI
job needs the medkit packages on the runner. Pulling them via apt from
rosdistro keeps the workflow simple (no source overlay clone).
- Wire kRobotResetFailed across ResetRobotStateAction failure branches
  (goal rejection + non-SUCCEEDED result for reset, unload-traj and
  load-traj steps).
- Instrument Isaac TF transform timeouts with kIsaacFoundationPoseFailed
  to match the sibling no-message and no-valid-detection timeouts.
- Replace silent FAILURE in GripperTrajAction::onRunning !goal_sent_
  with BT::RuntimeError, matching the programmer-error policy.
- Add reportFaultPassed on onHalted for WaitForInputAction and
  WaitForObjectAction so abandoned waits do not leave kSignalWaitInputTimeout
  / kObjectWaitTimeout lingering in FaultManager.
- Drop orphan codes kPlannerMissingMoveId (programmer-error path uses
  throw) and kObjectExistsCheckFailed (CheckObjectExists is intentionally
  not instrumented; FAILURE is normal control flow there).
- Pin medkit apt packages to MEDKIT_VERSION=0.4.0-1 so an upstream bump
  does not silently change CI behaviour.
- Drop humble from the CI matrix until ros-humble-ros2-medkit-* debs
  appear on packages.ros.org (rosdistro entry exists; debs do not yet).
- Update docs/FAULT_CODES.md: add kRobotResetFailed, expand the Isaac
  FoundationPose entry to cover TF timeouts, and document the
  reportFaultPassed-on-halt pairing for both wait codes.
Copy link
Copy Markdown

@bburda bburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration uses the FaultReporter API correctly and the FaultReporting mixin is a clean abstraction. Inline comments cover a few real concerns: an unthrottled fault flood in ResetRobotStateAction, fault codes that never heal (collision, e-stop, retry on halt), two sites that emit unconditional PASSED events, an unused mixin inheritance on two BT nodes, and a missing override on installFaultReporter. The last point also touches an upstream contradiction between the report_passed docstring and LocalFilter::should_forward_passed behavior that affects the reliability of the pairing pattern used throughout this PR.

name().c_str());
action_result_.success = false;
result_received_ = true;
reportFault(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResetRobotStateAction calls reportFault(kRobotResetFailed, kSeverityError, ...) from 8 separate callback paths (rejection + result-failure across unload-trajectory, reset-robot-state, and load-trajectory). All sites use kSeverityError, which equals the default bypass_severity=2 and therefore bypasses LocalFilter entirely.

A single failed reset can drive up to 6 ReportFault.srv calls into the FaultManager. Because each FAILED event decrements the AUTOSAR-DEM debounce counter, one logical incident pushes the fault aggressively past confirmation_threshold and biases the lifecycle.

Guard with a per-run bool fault_reported_ set on first emission and cleared in onStart, or downgrade these sites to kSeverityWarn so LocalFilter coalesces them within its window.

config().blackboard->set(robot_prefix_ + "collision_detected", false);
config().blackboard->set(robot_prefix_ + "stop_execution", true);

reportFault(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three planner faults never emit a paired reportFaultPassed, leaving them CONFIRMED in the FaultManager after the underlying condition clears on hardware:

  1. kPlannerCollisionDetected (this line) - never healed.
  2. kPlannerEstopTriggered (L137) - never healed even after operator releases the e-stop.
  3. kPlannerRetryAttempt - healed on the onRunning success path (L231) but not in onHalted (L280), so a halted retry loop leaves a lingering soft fault. Compare with WaitForInputAction::onHalted and WaitForObjectAction::onHalted which do emit PASSED on halt.

Emit reportFaultPassed(kPlannerCollisionDetected) / reportFaultPassed(kPlannerEstopTriggered) on the early path of onStart when the corresponding flag reads false, and add reportFaultPassed(kPlannerRetryAttempt) in onHalted.


RCLCPP_INFO(node_->get_logger(), "[MoveManipulatorAction] success => returning SUCCESS");
// Heal the per-attempt soft fault if any retry was reported earlier.
reportFaultPassed(fault_codes::kPlannerRetryAttempt);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues at this site:

  1. The PASSED is unconditional. On the first-attempt success path current_try_ == 0, so no kPlannerRetryAttempt was ever emitted, yet this still pushes a PASSED event. Guard with if (current_try_ > 0).

  2. ros2_medkit_fault_reporter::LocalFilter::should_forward_passed applies the same threshold/window logic as FAILED events, even though FaultReporter::report_passed's docstring states PASSED bypasses local filtering. With the default threshold=3, a single PASSED following a single FAILED will not propagate to the FaultManager. The pairing pattern used here and at kRobotNotReady, kSignalWaitInputTimeout, kObjectWaitTimeout, kWaitKeyTimeout only heals reliably under repeated retry cycles. Worth fixing upstream so PASSED truly bypasses the filter, but in the meantime callers should know healing is best-effort.

// mid-wait, the timeout condition no longer holds. Without this,
// any earlier `kSignalWaitInputTimeout` would linger in FaultManager
// until the next successful wait.
reportFaultPassed(fault_codes::kSignalWaitInputTimeout);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emits reportFaultPassed(kSignalWaitInputTimeout) on every halt, even when no timeout was previously reported. Combined with the PASSED counter behavior in LocalFilter::should_forward_passed, this accumulates spurious PASSED events that bias subsequent healing decisions.

Track a bool timeout_reported_ set at the FAILED emission site (L883) and only fire the PASSED here if it is set.

PublishJointStateAction::PublishJointStateAction(
const std::string & name, const BT::NodeConfiguration & config)
: BT::SyncActionNode(name, config)
: BT::SyncActionNode(name, config), FaultReporting(config.blackboard)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PublishJointStateAction inherits FaultReporting but never calls reportFault (the same applies to ResetTrajectories in action_nodes_planner.cpp:354).

Either drop the inheritance to keep the surface minimal, or add a short comment justifying the wiring (e.g. reserved for a planned future site). As-is, the empty base looks like an oversight at the call site.

// SOVD apps[].ros_binding linkage used by the medkit gateway.
inline void installFaultReporter(BT::Blackboard::Ptr blackboard, rclcpp::Node::SharedPtr node)
{
auto reporter = std::make_shared<ros2_medkit_fault_reporter::FaultReporter>(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FaultReporter's constructor accepts a third service_name argument (defaults to /fault_manager/report_fault, rooted at /). This helper hard-codes that default and offers no override.

For namespaced test rigs or multi-robot setups where the FaultManager runs under a namespace, callers cannot remap without bypassing this helper. Add an optional parameter:

inline void installFaultReporter(
  BT::Blackboard::Ptr blackboard, rclcpp::Node::SharedPtr node,
  const std::string & service_name = "/fault_manager/report_fault")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants