-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469
Comments
I'm still digging into this issue. |
The issue is in Beluga, not in the Lifecle manager nor the bond_core library. The reason the lifecycle manager node kills beluga regardless of the That happens because the Beluga side of the bond connection hardcodes the Because of that, the heartbeat timeout timer in Beluga becomes is triggered after 4 seconds, while Beluga is still working generating the likelihood map. When this activation gets processed right after the freeze ends, it kills the Beluga side of the bond connection. Beluga then stops sending Multiple variants of Beluga MCL variants are affected (e.g. NDT). This will affect any system that takes longer than 4 seconds to generate the likelihood map, which by my experience is not a really large map, probably about 150m x 150m on my I9, and probably a lot less on a lesser machine. Solutions:
|
This potential problem was atenuated with the fix to #468 , which improved the performance of the algorithm calculating the likelihood map from the occupancy grid of the environment. The performance depends on the map, however, and for some maps the time it takes to build the likelihood map will still be much greater than the 4 seconds timeout identified as the problem in this ticket above. This is true for both Beluga and Nav2, since both of them use the same hard-coded timeout value. An example map which would cause both Beluga and Nav2 to trigger this problem is a tiled map like this: After discussing this, we agreed that the best solution is to make the timeout value a ROS parameter. This is a WIP. |
Addresses #469 Signed-off-by: Gerardo Puga <[email protected]> Co-authored-by: Michel Hidalgo <[email protected]>
Bug description
Related to #468 , which discusses the beluga_amcl node becoming unresponsive for long periods of time when receiving large maps.
This issue is about a complementary issue that seems to be triggered by that behavior, which is that because of the node becoming unresponsive, the Nav2 Lifecycle manager loses track of the connection to the node and kills the node after
bond_timeout
seconds (parameter to the lifecycle manager), and proceeds to restart the beluga amcl node.Notice that this happens regardless of the value of
bond_timeout
. Setting this to a very large value (e.g. 100 seconds) and providing the node with a 600m x 200m @ 0.05p/m map (which takes 22 seconds to initialize), the following sequence will happen.bond_timeout
seconds after the creation of the bond, lifecycle manager complains about"[lifecycle_manager]: CRITICAL FAILURE: SERVER amcl IS DOWN after not receiving a heartbeat for NNNN ms"
The fact that the lifecycle manager does not detect the amcl node returning back to operational status may not be an issue within beluga_amcl, and might be in the lifecycle manager itself, but beluga_amcl certainly triggers this failure. The same failure mode does not happen if everything else is the same but nav2_amcl is used instead of beluga_amcl.See below, the problem is in Beluga itself.
Platform (please complete the following information):
Beluga
version: 2.0.2How to reproduce
Run any system using beluga and a large map. 200mx600m at 0.05 m/pixel resolution is large enough.
Expected behavior
The node should not go unresponsive for so long.
Actual behavior
Node freezes, lifecycle node complains and actually fails.
Additional context
The text was updated successfully, but these errors were encountered: