Migrate Nav2 timers & rates to use steady clocks and be sim-time-aware #6063
Migrate Nav2 timers & rates to use steady clocks and be sim-time-aware #6063tonynajjar wants to merge 25 commits intoros-navigation:mainfrom
Conversation
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
|
To be tested! I opened the PR to see if I find obvious issues in CI first |
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
… update sleep mechanism for interruptibility Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
|
One risk I see with this PR: The default clock type of By replacing However for I do think that e.g. controller loop should be using RCL_STEADY_TIME i.e. it shouldn't be affected by e.g. NTP changes. What do you think? |
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
There was a problem hiding this comment.
Especially for the controller server, costmap rates, can you comment on the previous discussions on this subject why we left them the way that they are?
I do think that e.g. controller loop should be using RCL_STEADY_TIME i.e. it shouldn't be affected by e.g. NTP changes. What do you think?
This is a key part of that, I think. As well as for many algorithms that cannot/should not be run at multiples of rate in simulation due to inability to realistically ever keep up - causing incorrect behaviors when running at > 1x speed which are not deterministic at 1x speed.
I mentioned in the ticket that #3325 (comment) is a good summary of the current state
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Regarding your previous conclusion here:
I mean it highly depends on the computational complexity of the plugins and the machine's specs no? Without hard numbers yet, but it seems like I can run MPPI on 10 Hz at 5x speed on my Intel(R) Core(TM) Ultra 9 285H.
In case you remember, would be helpful to share more details so I can try to reproduce. The setup I tested so far is quite a simple one with loopback simulation
For this, it's not a blocker, we should just find out a way to switch between "/clock time" and RCL_STEADY_TIME instead of RCL_SYSTEM_TIME (As mentioned, a RCL_ROS_TIME clock will currently choose between "/clock time" and RCL_SYSTEM_TIME according to |
|
I don't recall the specific issues I ran into from that long ago unfortunately. I do remember frustrating number of system tests became flaky though. I'll rerun CI a few times and see if that's still the case here.
On this point for TimedBehavior, ControllerServer, Costmap2DROS, VelocitySmoother, and WaypointFollower: I generally agree. However, I think that the following is too brute force in the application code I don't think we can set the clock type when not simulation time in the NodeOptions unfortunately. Maybe though something to consider adding? The next best thing would be to create a Are there any cases for Rate we want to use NTP corrections? I kind of doubt it. Maybe missed a few?
Given the BT.CPP changes required, can we change this to a draft until that is merged by Davide? |
it's merged already :D We still need to keep BT.CPP in the vendor.repos until next bloom release I presume |
|
btw looks like the Jazzy and Kilted workflows don't build the underlay from vendor.repos? Does that mean that if this get merged the workflows will remain broken? Is that new to you? |
|
Yes, and that is technically expected. I'm on the fence about it, but tentatively going to update the job to use the repos file for now. The main issue is that what happens if changes are made to a key dependency like BT.CPP which are not backward compatible and cannot/will not be updated in Humble or Jazzy. Our CI works for those distributions, but users themselves cannot do it without also updating something like BT.CPP which from v3 to v4 may fundamentally not work with their existing code. That's not an issue today since Davide appears to release v4 up to date in every ROS distribution, so I'll kick this can down the road for another time if this becomes an issue. Here's a PR to resolve: #6069 |
|
ros2/rclcpp#3122 made an issue for rclcpp but will will need to find a local solution in the meantime - I'm working on that |
… nav2::Rate and nav2::create_timer for improved simulation time support Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
…reate_timer and nav2::Rate for improved simulation time handling Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
|
Don't review yet |
|
Actually in terms of changes that's pretty much it so you can take a look already. Just didn't have the time to make a nice description but basically, with the help of some custom Nav2 wrappers of |
There was a problem hiding this comment.
Otherwise the code that's changed LGTM.
I will say though that I need to look at this again with fresh eyes. Does this imply that we should actually update all Rate's or Timers to use this as well? Is there a downside to that in any case? I do like the idea of abstracting even more rclcpp::'s into nav2::'s but only if that's really the right answer.
nav2_behavior_tree/include/nav2_behavior_tree/utils/loop_rate.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Updates Nav2 timing primitives to be robust to system clock jumps and simulation-time execution by selecting an appropriate clock (ROS/sim clock when use_sim_time=true, otherwise steady/monotonic time) for rates and timers.
Changes:
- Added
nav2::selectClock()andnav2::WallRateand migrated several rate-limited loops to use it. - Added sim-time-aware
create_wall_timerhelpers (free function +nav2::LifecycleNodeshadow) and migratedLifecycleManagertimers. - Updated BT execution loop (
LoopRate+BehaviorTreeEngine) to use an injected clock and a wakeup-driven wait loop.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/underlay.repos | Pins BehaviorTree.CPP to a specific commit to pick up Tree::wakeUpSignal() support. |
| nav2_waypoint_follower/src/waypoint_follower.cpp | Migrates waypoint follower loop rate to nav2::WallRate. |
| nav2_ros_common/include/nav2_ros_common/wall_rate.hpp | Introduces nav2::selectClock() and nav2::WallRate for sim-time-aware rate limiting. |
| nav2_ros_common/include/nav2_ros_common/lifecycle_node.hpp | Shadows create_wall_timer() to use the selected clock for lifecycle nodes. |
| nav2_ros_common/include/nav2_ros_common/interface_factories.hpp | Adds nav2::create_wall_timer() helper using clock selection. |
| nav2_lifecycle_manager/src/lifecycle_manager.cpp | Migrates timers (init_timer_, bond timers) to nav2::create_wall_timer(). |
| nav2_costmap_2d/src/costmap_2d_ros.cpp | Migrates costmap update loop to nav2::WallRate. |
| nav2_controller/src/controller_server.cpp | Migrates controller compute loop to nav2::WallRate. |
| nav2_behaviors/include/nav2_behaviors/timed_behavior.hpp | Migrates behavior cycle loop to nav2::WallRate. |
| nav2_behavior_tree/test/plugins/action/test_bt_action_node.cpp | Updates test to new LoopRate constructor signature (explicit clock). |
| nav2_behavior_tree/src/behavior_tree_engine.cpp | Injects selected clock into BT loop rate handling. |
| nav2_behavior_tree/include/nav2_behavior_tree/utils/loop_rate.hpp | Reworks sleeping to poll an injected clock using BT wakeup signal. |
| nav2_behavior_tree/include/nav2_behavior_tree/behavior_tree_engine.hpp | Adds rate_clock_ member to support sim/steady timing for BT loop. |
| nav2_amcl/src/amcl_node.cpp | Minor include cleanup and explicitly calls this->create_wall_timer(). |
| while (clock_->now() < next_interval) { | ||
| // Sleep using the wake-up signal directly so we can poll the target clock. | ||
| // tree_->sleep() always waits in wall-clock time, which diverges from sim | ||
| // time. Instead we do short wall-clock waits and re-check our clock. | ||
| if (wake_up->waitFor(poll_interval)) { // Preempted by emitWakeUpSignal() |
There was a problem hiding this comment.
LoopRate::sleep() uses a fixed 10ms wall-clock polling interval (poll_interval) when waiting for next_interval. This can oversleep when the requested period is <10ms (and adds up to 10ms of jitter even for slightly larger periods), which breaks the contract of sleeping until next_interval for steady-clock operation. Consider computing the wait timeout from the remaining time until next_interval (e.g., wait for min(poll_interval, remaining)), and only fall back to a coarse poll interval when the underlying clock is ROS/sim time.
| while (clock_->now() < next_interval) { | |
| // Sleep using the wake-up signal directly so we can poll the target clock. | |
| // tree_->sleep() always waits in wall-clock time, which diverges from sim | |
| // time. Instead we do short wall-clock waits and re-check our clock. | |
| if (wake_up->waitFor(poll_interval)) { // Preempted by emitWakeUpSignal() | |
| const bool is_ros_time = clock_->get_clock_type() == RCL_ROS_TIME; | |
| while ((now = clock_->now()) < next_interval) { | |
| // Sleep using the wake-up signal directly so we can poll the target clock. | |
| // tree_->sleep() always waits in wall-clock time, which diverges from sim | |
| // time. For ROS/sim time, poll in short wall-clock intervals and re-check | |
| // the target clock. For steady/system clocks, wait only for the remaining | |
| // time to avoid oversleeping past next_interval. | |
| const auto remaining = next_interval - now; | |
| const auto remaining_ns = std::chrono::nanoseconds(remaining.nanoseconds()); | |
| const auto wait_duration = | |
| is_ros_time && remaining_ns > poll_interval ? poll_interval : remaining_ns; | |
| if (wake_up->waitFor(wait_duration)) { // Preempted by emitWakeUpSignal() |
|
|
||
| auto start_time = clock_->now(); | ||
| rclcpp::WallRate loop_rate(cycle_frequency_); | ||
| nav2::WallRate loop_rate(node_.lock(), cycle_frequency_); |
There was a problem hiding this comment.
node_ is a WeakPtr, but nav2::WallRate loop_rate(node_.lock(), ...) does not validate the result of lock() before dereferencing it inside selectClock(). If the parent node expires (e.g., during shutdown/transition races), this becomes a null dereference. Prefer locking once into a local auto node = node_.lock(); and handling the null case (similar to plugins/drive_on_heading.hpp).
| nav2::WallRate loop_rate(node_.lock(), cycle_frequency_); | |
| auto node = node_.lock(); | |
| if (!node) { | |
| RCLCPP_ERROR( | |
| logger_, | |
| "Failed to run %s because the parent node is no longer available.", | |
| behavior_name_.c_str()); | |
| result->error_msg = behavior_name_ + " failed: parent node expired"; | |
| result->total_elapsed_time = clock_->now() - start_time; | |
| onActionCompletion(result); | |
| action_server_->terminate_current(result); | |
| return; | |
| } | |
| nav2::WallRate loop_rate(node, cycle_frequency_); |
| template<typename NodeT> | ||
| rclcpp::Clock::SharedPtr selectClock(NodeT node) | ||
| { | ||
| bool use_sim_time = false; | ||
| auto params = node->get_node_parameters_interface(); | ||
| if (params->has_parameter("use_sim_time")) { | ||
| use_sim_time = params->get_parameter("use_sim_time").as_bool(); | ||
| } | ||
| if (use_sim_time) { | ||
| return node->get_clock(); | ||
| } | ||
| return std::make_shared<rclcpp::Clock>(RCL_STEADY_TIME); | ||
| } |
There was a problem hiding this comment.
New clock-selection behavior (selectClock) is core to the PR’s sim-time/steady-time guarantees, but there are no unit tests validating the selection logic (e.g., use_sim_time=true returns the node clock, false returns a steady clock) and that WallRate uses the chosen clock. Adding focused tests in nav2_ros_common/test would help prevent regressions, especially around parameter overrides and lifecycle state transitions.
| * @return The appropriate clock | ||
| */ | ||
| template<typename NodeT> | ||
| rclcpp::Clock::SharedPtr selectClock(NodeT node) |
There was a problem hiding this comment.
Topic Naming: I think we should be specific as to what clock this means / used for. Maybe selectSteadyClock or something? Not sure steady is technically correct for simulation time though
|
Otherise just this topic:
|
I'm going to use the terminology in ros2/rclcpp#3122 (comment) For rates and timers, I think always RCL_ROS_STEADY_TIME is wanted over RCL_ROS_TIME or RCL_SYSTEM_TIME but probably there are a few places, e.g. in tests or maybe rviz panels, where we need RCL_STEADY_TIME (not sim-time aware). I would need to do a deep dive |
|
I think as part of this, we should, since its an easy grep + replace if it turns out there's no real reason to use the other now. |
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
|
Ok I did another pass and found more
I'm torn because they all have some drawbacks but on the other hand whatever we choose is a (hopefully) temporary option until ros2/rclcpp#3122 in implemented. Because of that, I tend towards option 1 |
|
Are they all tests? If so, then I think we can convert them since NTP jumps don't matter for that really. I think we may be able to also rename the |
Yes, the question is with which method from those listed above |
Hmm so maybe nav2::Rate is our RCL_ROS_STEADY_TIME and nav2::WallRate is RCL_STEADY_TIME. It's diverging quite a bit from rclcpp, I hope it's not too confusing... |
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
|
|
I mean switch everything to use the nav2 wall rate but change the name to drop "wall" if we're using it globally. Just 1 object. |
…d Rate for consistency Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Yes but we need a secondary entity to represent RCL_STEADY_TIME I think I found a nice solution, here is the current state Rates
Timers
What Is Used WhereProduction code
Tests
|
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
Signed-off-by: Tony Najjar <tony.najjar@dexory.com>
These changes make timer and rate-related code more robust, especially when switching between simulation and real time, and help prevent issues related to system clock jumps or simulation time discrepancies.
Basic Info
Description of contribution in a few bullet points
nav2::WallRate(nav2_ros_common/wall_rate.hpp): a drop-in replacement forrclcpp::WallRatethat uses the node's ROS clock (simulation time) whenuse_sim_time=true, andRCL_STEADY_TIME(monotonic clock) whenuse_sim_time=false— avoiding both sim-time blindness and NTP clock-jump issues.nav2::selectClock(node): a helper that returns the node's ROS clock when in simulation, or a steady clock otherwise.nav2::create_wall_timer(node, period, callback): a drop-in replacement forrclcpp::Node::create_wall_timerthat uses the same sim-time-aware clock selection.create_wall_timerinnav2::LifecycleNode: so that all Nav2 lifecycle nodes automatically get sim-time-aware timers without any call-site changes.nav2::WallRateinstead ofrclcpp::WallRate:ControllerServer::computeControl()Costmap2DROS::mapUpdateLoop()TimedBehavior(behaviors)WaypointFollower::followWaypointsHandler()LifecycleManager(init_timer_,bond_timer_,bond_respawn_timer_) to usenav2::create_wall_timer.LoopRate(BT execution loop) to accept an external clock and use a sim-time-aware polling loop viatree->wakeUpSignal()instead oftree->sleep(), which was hardwired to wall-clock time.rate_clock_toBehaviorTreeEngineinitialized vianav2::selectClock()and passed to theLoopRate.Dependencies
Requires BehaviorTree.CPP#1127 (
Tree::wakeUpSignal()getter).Description of documentation updates required from your changes
Description of how this change was tested
Future work that may be required in bullet points
For Maintainers:
backport-*.