When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469

glpuga · 2025-02-04T19:25:13Z

Bug description

Related to #468 , which discusses the beluga_amcl node becoming unresponsive for long periods of time when receiving large maps.

This issue is about a complementary issue that seems to be triggered by that behavior, which is that because of the node becoming unresponsive, the Nav2 Lifecycle manager loses track of the connection to the node and kills the node after bond_timeout seconds (parameter to the lifecycle manager), and proceeds to restart the beluga amcl node.

Notice that this happens regardless of the value of bond_timeout. Setting this to a very large value (e.g. 100 seconds) and providing the node with a 600m x 200m @ 0.05p/m map (which takes 22 seconds to initialize), the following sequence will happen.

The lifecycle manager and the amcl node get initialized.
the lifecycle manager and the amcl node connect to eacht other with bond.
amcl receives the map, and becomes unresponsive for approx. 22 seconds.
amcl completes the initialization after processing the likelihood map and becomes fully operational.
after bond_timeout seconds after the creation of the bond, lifecycle manager complains about
"[lifecycle_manager]: CRITICAL FAILURE: SERVER amcl IS DOWN after not receiving a heartbeat for NNNN ms"
The lifecycle node restarts the amcl node.

The fact that the lifecycle manager does not detect the amcl node returning back to operational status may not be an issue within beluga_amcl, and might be in the lifecycle manager itself, but beluga_amcl certainly triggers this failure. The same failure mode does not happen if everything else is the same but nav2_amcl is used instead of beluga_amcl.

See below, the problem is in Beluga itself.

Platform (please complete the following information):

OS: Seen in ROS Humble.
Beluga version: 2.0.2

How to reproduce

Run any system using beluga and a large map. 200mx600m at 0.05 m/pixel resolution is large enough.

Expected behavior

The node should not go unresponsive for so long.

Actual behavior

Node freezes, lifecycle node complains and actually fails.

Additional context

The text was updated successfully, but these errors were encountered:

glpuga · 2025-02-04T19:26:55Z

I'm still digging into this issue.

glpuga · 2025-02-07T15:10:48Z

The issue is in Beluga, not in the Lifecle manager nor the bond_core library.

The reason the lifecycle manager node kills beluga regardless of the heartbeat_timeout value that is configured is because, after freezing during likelihood map creation, beluga stops publishing messages through the /bond topic. This is true even after Beluga becomes responsive again and starts working normally.

That happens because the Beluga side of the bond connection hardcodes the heartbeat_timeout to just 4 seconds. While beluga is frozen it is obviously is not sending Status messages through /bond, but more importantly is also not processing the heartbeat messages received through the same topic from the Lifecycle manager node.

Because of that, the heartbeat timeout timer in Beluga becomes is triggered after 4 seconds, while Beluga is still working generating the likelihood map. When this activation gets processed right after the freeze ends, it kills the Beluga side of the bond connection.

Beluga then stops sending Status messages, despite being active, and the Lifecycle node eventually times out as well (depending on the value of the heartbeat_timeout parameter of the node) and kills the node.

Multiple variants of Beluga MCL variants are affected (e.g. NDT).

This will affect any system that takes longer than 4 seconds to generate the likelihood map, which by my experience is not a really large map, probably about 150m x 150m on my I9, and probably a lot less on a lesser machine.

Solutions:

Short term, raising the default timeout in Beluga to 30 seconds. This won't change the fact that the timeout will have to be raised in the lifecycle manager configuration for things to work, but will make at least the problem fixable for the user without a Beluga rebuild (currently it is not).
Long term, fixing When using large maps, the likelihood map constructions takes very long time #468

glpuga · 2025-02-13T21:06:53Z

This potential problem was atenuated with the fix to #468 , which improved the performance of the algorithm calculating the likelihood map from the occupancy grid of the environment.

The performance depends on the map, however, and for some maps the time it takes to build the likelihood map will still be much greater than the 4 seconds timeout identified as the problem in this ticket above.

This is true for both Beluga and Nav2, since both of them use the same hard-coded timeout value.

An example map which would cause both Beluga and Nav2 to trigger this problem is a tiled map like this:

After discussing this, we agreed that the best solution is to make the timeout value a ROS parameter. This is a WIP.

Addresses #469 Signed-off-by: Gerardo Puga <[email protected]> Co-authored-by: Michel Hidalgo <[email protected]>

glpuga added the bug Something isn't working label Feb 4, 2025

glpuga self-assigned this Feb 4, 2025

glpuga mentioned this issue Feb 4, 2025

When using large maps, the likelihood map constructions takes very long time #468

Closed

glpuga mentioned this issue Feb 12, 2025

When there are no obstacles in the map, make_likelihood_field() turns the map into one large block of obstacles #472

Closed

glpuga mentioned this issue Feb 13, 2025

Add ros parameter for bond timeout #473

Merged

7 tasks

glpuga added a commit that referenced this issue Feb 17, 2025

Add ros parameter for bond timeout (#473)

7856463

Addresses #469 Signed-off-by: Gerardo Puga <[email protected]> Co-authored-by: Michel Hidalgo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469

When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469

glpuga commented Feb 4, 2025 •

edited

Loading

glpuga commented Feb 4, 2025

glpuga commented Feb 7, 2025 •

edited

Loading

glpuga commented Feb 13, 2025 •

edited

Loading

When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469

When using large maps, the lifecycle node kills amcl_node because of missing heartbeat #469

Comments

glpuga commented Feb 4, 2025 • edited Loading

Bug description

How to reproduce

Additional context

glpuga commented Feb 4, 2025

glpuga commented Feb 7, 2025 • edited Loading

glpuga commented Feb 13, 2025 • edited Loading

glpuga commented Feb 4, 2025 •

edited

Loading

glpuga commented Feb 7, 2025 •

edited

Loading

glpuga commented Feb 13, 2025 •

edited

Loading