Fix race condition on Publisher shutdown #2812

iche033 · 2020-08-04T21:21:59Z

The race condition problem occurs when the Publisher::OnPublishComplete callback is invoked from a different thread after the Publisher object is destroyed, causing a crash. There's been a few attempts to fixing this race condition, see this commit and this PR, but the issue persists and it is just harder to reproduce. When successfully reproduced, I get a assertion error in this line when locking the mutex.

The fix is to bind the callback function along with a shared pointer instead of raw pointer this so that the object is kept alive for a little longer until after the callback is complete. In the case of Publisher, this should be safe to do as the OnPublishComplete function is mainly just doing some bookkeeping work. Note that I added a PublisherPrivate class and stored instances of this class in a static variable to avoid breaking ABI.

To reproduce this issue:

Save this model as box.sdf file:

<?xml version="1.0" ?>
<sdf version="1.6">
  <model name="box">
    <pose>0 0 0.5 0 0 0</pose>
    <link name="link">
      <collision name="collision">
        <geometry>
          <box>
            <size>1 1 1</size>
          </box>
        </geometry>
      </collision>
      <visual name="visual">
        <geometry>
          <box>
            <size>1 1 1</size>
          </box>
        </geometry>
      </visual>
    </link>
  </model>
</sdf>

launch gazebo using roslaunch:

roslaunch gazebo_ros empty_world.launch verbose:=true

Run the following script that spawns and deletes the box model 500 times:

#!/bin/bash
for i in {0..500}; do
    echo "$i"
    rosparam set pizza --textfile box.sdf
    rosrun gazebo_ros spawn_model -param pizza -sdf -model box -x 0 -y 0.5 -z 0
    sleep 0.5
    rosservice call gazebo/delete_model '{model_name: box}'
done

Without changes in this PR, the crash occurs on spawn after a couple hundred iterations for me

Signed-off-by: Ian Chen [email protected]

Signed-off-by: Ian Chen <[email protected]>

chapulina

Wow, nice ABI gymnastics!

The fix is to bind the callback function along with a shared pointer instead of raw pointer this so that the object is kept alive for a little longer

Interesting, in this case I guess the global static map helped, because if you had used a regular dataPtr as unique_ptr, it would have died together with the Publisher.

chapulina · 2020-08-07T20:54:18Z

gazebo/transport/Publisher.cc

@@ -188,12 +250,14 @@ void Publisher::SendMessage()
    {
      // Send the latest message.
      int result = this->publication->Publish(*iter,
-          boost::bind(&Publisher::OnPublishComplete, this, _1), *pubIter);
+          boost::bind(&PublisherPrivate::OnPublishComplete,


Could use the opportunity to change to std::bind

done. c1e40a5

chapulina · 2020-08-07T21:06:56Z

gazebo/transport/Publisher.cc

@@ -268,6 +308,9 @@ void Publisher::Fini()
    TopicManager::Instance()->Unadvertise(this->topic, this->id);

  this->node.reset();
+
+  std::lock_guard<std::mutex> lock(pubMapMutex);
+  publisherMap.erase(this->id);


I remember having a reason to destroy all publishers before the node, but I can't remember exactly why. Looking through the code, this pattern seems common:

this->pub.reset(); this->node->Fini(); this->node.reset();

If you don't have a strong reason for terminating the publisher after the node, I'd recommend switching order here.

moved erase before resetting node c1e40a5

Signed-off-by: Ian Chen <[email protected]>

iche033 · 2020-08-07T22:44:16Z

Interesting, in this case I guess the global static map helped, because if you had used a regular dataPtr as unique_ptr, it would have died together with the Publisher.

yep the publisher object has to be a shared ptr in a static map. I added a comment in ce2b9df

chapulina

The latest updates look good too 👍

add publisher private class to hold OnPublishComplete callback

4bbe5eb

Signed-off-by: Ian Chen <[email protected]>

iche033 requested review from mjcarroll and scpeters August 4, 2020 21:21

chapulina added the 9 Gazebo 9 label Aug 6, 2020

chapulina approved these changes Aug 7, 2020

View reviewed changes

iche033 added 2 commits August 7, 2020 15:31

std bind and rearrage pub node fini

c1e40a5

Signed-off-by: Ian Chen <[email protected]>

add comment on PublisherPrivate

ce2b9df

Signed-off-by: Ian Chen <[email protected]>

chapulina approved these changes Aug 7, 2020

View reviewed changes

iche033 merged commit 5881f82 into gazebo9 Aug 7, 2020

iche033 deleted the publisher_complete branch August 7, 2020 23:05

scpeters mentioned this pull request Jan 25, 2021

Merge 9.14.0 -> 10 #2925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition on Publisher shutdown #2812

Fix race condition on Publisher shutdown #2812

iche033 commented Aug 4, 2020

chapulina left a comment

chapulina Aug 7, 2020

iche033 Aug 7, 2020

chapulina Aug 7, 2020

iche033 Aug 7, 2020

iche033 commented Aug 7, 2020

chapulina left a comment

Fix race condition on Publisher shutdown #2812

Fix race condition on Publisher shutdown #2812

Conversation

iche033 commented Aug 4, 2020

chapulina left a comment

Choose a reason for hiding this comment

chapulina Aug 7, 2020

Choose a reason for hiding this comment

iche033 Aug 7, 2020

Choose a reason for hiding this comment

chapulina Aug 7, 2020

Choose a reason for hiding this comment

iche033 Aug 7, 2020

Choose a reason for hiding this comment

iche033 commented Aug 7, 2020

chapulina left a comment

Choose a reason for hiding this comment