-
Notifications
You must be signed in to change notification settings - Fork 497
Fix race condition on Publisher shutdown #2812
Conversation
Signed-off-by: Ian Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice ABI gymnastics!
The fix is to bind the callback function along with a shared pointer instead of raw pointer this so that the object is kept alive for a little longer
Interesting, in this case I guess the global static map helped, because if you had used a regular dataPtr
as unique_ptr
, it would have died together with the Publisher
.
gazebo/transport/Publisher.cc
Outdated
@@ -188,12 +250,14 @@ void Publisher::SendMessage() | |||
{ | |||
// Send the latest message. | |||
int result = this->publication->Publish(*iter, | |||
boost::bind(&Publisher::OnPublishComplete, this, _1), *pubIter); | |||
boost::bind(&PublisherPrivate::OnPublishComplete, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use the opportunity to change to std::bind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. c1e40a5
gazebo/transport/Publisher.cc
Outdated
@@ -268,6 +308,9 @@ void Publisher::Fini() | |||
TopicManager::Instance()->Unadvertise(this->topic, this->id); | |||
|
|||
this->node.reset(); | |||
|
|||
std::lock_guard<std::mutex> lock(pubMapMutex); | |||
publisherMap.erase(this->id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember having a reason to destroy all publishers before the node, but I can't remember exactly why. Looking through the code, this pattern seems common:
this->pub.reset();
this->node->Fini();
this->node.reset();
If you don't have a strong reason for terminating the publisher after the node, I'd recommend switching order here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved erase before resetting node c1e40a5
Signed-off-by: Ian Chen <[email protected]>
Signed-off-by: Ian Chen <[email protected]>
yep the publisher object has to be a shared ptr in a static map. I added a comment in ce2b9df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest updates look good too 👍
The race condition problem occurs when the
Publisher::OnPublishComplete
callback is invoked from a different thread after thePublisher
object is destroyed, causing a crash. There's been a few attempts to fixing this race condition, see this commit and this PR, but the issue persists and it is just harder to reproduce. When successfully reproduced, I get a assertion error in this line when locking the mutex.The fix is to bind the callback function along with a shared pointer instead of raw pointer
this
so that the object is kept alive for a little longer until after the callback is complete. In the case of Publisher, this should be safe to do as theOnPublishComplete
function is mainly just doing some bookkeeping work. Note that I added aPublisherPrivate
class and stored instances of this class in a static variable to avoid breaking ABI.To reproduce this issue:
box.sdf
file:Signed-off-by: Ian Chen [email protected]