Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469

Open
glpuga opened this issue Feb 4, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@glpuga
Copy link
Collaborator

glpuga commented Feb 4, 2025

Bug description

Related to #468 , which discusses the beluga_amcl node becoming unresponsive for long periods of time when receiving large maps.

This issue is about a complementary issue that seems to be triggered by that behavior, which is that because of the node becoming unresponsive, the Nav2 Lifecycle manager loses track of the connection to the node and kills the node after bond_timeout seconds (parameter to the lifecycle manager), and proceeds to restart the beluga amcl node.

Notice that this happens regardless of the value of bond_timeout. Setting this to a very large value (e.g. 100 seconds) and providing the node with a 600m x 200m @ 0.05p/m map (which takes 22 seconds to initialize), the following sequence will happen.

  • The lifecycle manager and the amcl node get initialized.
  • the lifecycle manager and the amcl node connect to eacht other with bond.
  • amcl receives the map, and becomes unresponsive for approx. 22 seconds.
  • amcl completes the initialization after processing the likelihood map and becomes fully operational.
  • after bond_timeout seconds after the creation of the bond, lifecycle manager complains about
    "[lifecycle_manager]: CRITICAL FAILURE: SERVER amcl IS DOWN after not receiving a heartbeat for NNNN ms"
  • The lifecycle node restarts the amcl node.

Image

The fact that the lifecycle manager does not detect the amcl node returning back to operational status may not be an issue within beluga_amcl, and might be in the lifecycle manager itself, but beluga_amcl certainly triggers this failure. The same failure mode does not happen if everything else is the same but nav2_amcl is used instead of beluga_amcl.

See below, the problem is in Beluga itself.

Platform (please complete the following information):

  • OS: Seen in ROS Humble.
  • Beluga version: 2.0.2

How to reproduce

Run any system using beluga and a large map. 200mx600m at 0.05 m/pixel resolution is large enough.

Expected behavior

The node should not go unresponsive for so long.

Actual behavior

Node freezes, lifecycle node complains and actually fails.

Additional context

@glpuga glpuga added the bug Something isn't working label Feb 4, 2025
@glpuga glpuga self-assigned this Feb 4, 2025
@glpuga
Copy link
Collaborator Author

glpuga commented Feb 4, 2025

I'm still digging into this issue.

@glpuga
Copy link
Collaborator Author

glpuga commented Feb 7, 2025

The issue is in Beluga, not in the Lifecle manager nor the bond_core library.

The reason the lifecycle manager node kills beluga regardless of the heartbeat_timeout value that is configured is because, after freezing during likelihood map creation, beluga stops publishing messages through the /bond topic. This is true even after Beluga becomes responsive again and starts working normally.

That happens because the Beluga side of the bond connection hardcodes the heartbeat_timeout to just 4 seconds. While beluga is frozen it is obviously is not sending Status messages through /bond, but more importantly is also not processing the heartbeat messages received through the same topic from the Lifecycle manager node.

Because of that, the heartbeat timeout timer in Beluga becomes is triggered after 4 seconds, while Beluga is still working generating the likelihood map. When this activation gets processed right after the freeze ends, it kills the Beluga side of the bond connection.

Beluga then stops sending Status messages, despite being active, and the Lifecycle node eventually times out as well (depending on the value of the heartbeat_timeout parameter of the node) and kills the node.

Multiple variants of Beluga MCL variants are affected (e.g. NDT).

This will affect any system that takes longer than 4 seconds to generate the likelihood map, which by my experience is not a really large map, probably about 150m x 150m on my I9, and probably a lot less on a lesser machine.

Solutions:

@glpuga
Copy link
Collaborator Author

glpuga commented Feb 13, 2025

This potential problem was atenuated with the fix to #468 , which improved the performance of the algorithm calculating the likelihood map from the occupancy grid of the environment.

The performance depends on the map, however, and for some maps the time it takes to build the likelihood map will still be much greater than the 4 seconds timeout identified as the problem in this ticket above.

This is true for both Beluga and Nav2, since both of them use the same hard-coded timeout value.

An example map which would cause both Beluga and Nav2 to trigger this problem is a tiled map like this:
Image

After discussing this, we agreed that the best solution is to make the timeout value a ROS parameter. This is a WIP.

glpuga added a commit that referenced this issue Feb 17, 2025
Addresses #469 

Signed-off-by: Gerardo Puga <[email protected]>
Co-authored-by: Michel Hidalgo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant