Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of tests failing on Windows #174

Closed
ivanpauno opened this issue Apr 29, 2020 · 18 comments · Fixed by #176
Closed

Lots of tests failing on Windows #174

ivanpauno opened this issue Apr 29, 2020 · 18 comments · Fixed by #176
Labels
bug Something isn't working

Comments

@ivanpauno
Copy link
Member

See for example:

https://ci.ros2.org/view/nightly/job/nightly_win_rel/1535/#showFailuresLink

@ivanpauno ivanpauno added the bug Something isn't working label Apr 29, 2020
@ivanpauno
Copy link
Member Author

@eboasson FYI.

I will try to narrow down what commit in rmw_cyclonedds or cyclonedds generated this.

@eboasson
Copy link
Collaborator

Thanks @ivanpauno!

@dirk-thomas
Copy link
Member

This is unfortunately not new but was introduced last week between Wednesday and Friday in two stages.

@ivanpauno
Copy link
Member Author

After reverting #162 and #145 it started working again, so #145 seems to have introduced the error.
I did run full CI in that job before merging, so IDK why the failures didn't appear.

Probably, as those jobs were using a .repos file and in those cases CI doesn't automerge with master before testing (which does happen when a branch is specified, and a custom .repos file is not used), the error wasn't triggered.

@ivanpauno
Copy link
Member Author

@eboasson I'm trying to fix the issue, but if you have any clues help is welcomed.

@eboasson
Copy link
Collaborator

@ivanpauno @dirk-thomas: a few hours later ... rmw_cyclonedds_cpp currently shuts down everything in the rmw_context_impl_t destructor but it appears that Windows "helpfully" kills all spawned threads before running the global destructors. This causes the destructor to hang somewhere in deleting a DDS entity because things that must happen suddenly no longer happen. The best solution is obviously to stop supporting Windows altogether.

I guess second best would be starting and stopping Cyclone on node creation/destruction instead of on context creation/destruction: that's when it used to do it before #145, so in practice that should work well enough. Apart from having to rework some of the RMW implementation, there's no downside to that change.

@ivanpauno
Copy link
Member Author

@ivanpauno @dirk-thomas: a few hours later ... rmw_cyclonedds_cpp currently shuts down everything in the rmw_context_impl_t destructor but it appears that Windows "helpfully" kills all spawned threads before running the global destructors

lovely ...

This causes the destructor to hang somewhere in deleting a DDS entity because things that must happen suddenly no longer happen. The best solution is obviously to stop supporting Windows altogether.

That doesn't sound like something we would want to do.

I guess second best would be starting and stopping Cyclone on node creation/destruction instead of on context creation/destruction: that's when it used to do it before #145, so in practice that should work well enough. Apart from having to rework some of the RMW implementation, there's no downside to that change.

The approach taken in rmw_fastrtps_cpp is similar to that one:

  • One participant is created per context.
  • It's created after the first Node is created in that Context.
  • It's destructed after the last Node is destroyed in that `Context.

Doing the same thing should avoid the problem.

@ivanpauno
Copy link
Member Author

ivanpauno commented Apr 30, 2020

@eboasson I was able to reproduce the problem (hanging on ctrl-c).
I think I will be able to propose a fix in a couple of hours.

@rotu
Copy link
Collaborator

rotu commented Apr 30, 2020

I'm sorry I'm coming to the party late. ~rmw_context_impl_t seems overly complicated, reaching in to destroy the rmw_dds_common::Context object. Cleaning up its properties (.graph_guard_condition, .pub, .sub, stopping the discovery thread) should be done in ~rmw_dds_common::Context and we should not be acquiring a lock in a destructor.

https://github.com/eboasson/rmw_cyclonedds/blob/dda187a972686f338dd0518698b6adcf7d46d3f4/rmw_cyclonedds_cpp/src/rmw_node.cpp#L265

@ivanpauno
Copy link
Member Author

There's a fix proposed in #176.

@ivanpauno
Copy link
Member Author

ivanpauno commented Apr 30, 2020

I'm sorry I'm coming to the party late. ~rmw_context_impl_t seems overly complicated, reaching in to destroy the rmw_dds_common::Context object. Cleaning up its properties (.graph_guard_condition, .pub, .sub, stopping the discovery thread) should be done in ~rmw_dds_common::Context and we should not be acquiring a lock in a destructor.

That can't be done in ~rmw_dds_common::Context, because e.g. you don't have a definition of rmw_destroy_guard_condition when rmw_dds_common is built.

@rotu
Copy link
Collaborator

rotu commented Apr 30, 2020

I'm sorry I'm coming to the party late. ~rmw_context_impl_t seems overly complicated, reaching in to destroy the rmw_dds_common::Context object. Cleaning up its properties (.graph_guard_condition, .pub, .sub, stopping the discovery thread) should be done in ~rmw_dds_common::Context and we should not be acquiring a lock in a destructor.

That can't be done in ~rmw_dds_common::Context, because e.g. you don't have a definition of rmw_destroy_guard_condition when rmw_dds_common is built.

It can be done if you implement it in a header:

// context.hpp
struct Context
{
     ~Context();
//...
}

#include "context_impl.hpp"
// context_impl.hpp
extern rmw_ret_t rmw_destroy_guard_condition(rmw_guard_condition_t * guard_condition);
Context::~Context(){
  rmw_destroy_guard_condition(graph_guard_condition);
  // ...
}

This will make it obvious (as I think you've discovered) that the current design has a chicken-and-egg problem with Contexts versus Nodes. That is, rmw_init is supposed to create the rmw_dds_common::Context but in order to start the publisher and subscriber in the Context constructor, you need to first create a Node (but this should have been created after the Context).

I think the design should be revised to either allow publishers/subscribers to live outside a node, or to create a node whose sole purpose is discovery.

@ivanpauno
Copy link
Member Author

This will make it obvious (as I think you've discovered) that the current design has a chicken-and-egg problem with Contexts versus Nodes. That is, rmw_init is supposed to create the rmw_dds_common::Context but in order to start the publisher and subscriber in the Context constructor, you need to first create a Node (but this should have been created after the Context).

No, it doesn't work like that. Please read the implementation.

It can be done if you implement it in a header:

That's true. I considered to do that, but I finally didn't.

@rotu
Copy link
Collaborator

rotu commented Apr 30, 2020

No, it doesn't work like that. Please read the implementation.

I did. It has to use functions like create_publisher(), which is not part of the public interface, to circumvent rmw_create_publisher because a node is not ready.

It seems a pretty core structural feature of the RMW API that subscribers, publishers, and guard conditions exist within the context (and lifetime) of a node. Allowing them to outlive nodes is asking for trouble.

I think this is really unsound and rmw_dds_common needs to be modified to live on top of the RMW layer (which might also need adjustments) instead of being so tightly coupled with each RMW's implementation details.

@rotu
Copy link
Collaborator

rotu commented Apr 30, 2020

Also it occurs to me that maybe some of the thread weirdness has to do with parts of the RMW layer outliving the c++ runtime.

@ivanpauno
Copy link
Member Author

No, it doesn't work like that. Please read the implementation.

I did.

Sorry, my comment was unnecessary.

I think this is really unsound and rmw_dds_common needs to be modified to live on top of the RMW layer (which might also need adjustments) instead of being so tightly coupled with each RMW's implementation details.

That's a good point and it was discussed before starting the implementation.
Implementing this in the rmw implementation is a LOT more painful that doing it above rmw (i.e.: in rcl).
The decision was, that the rmw implementation should choose how to map entities, and there should be not assumption in rcl.
i.e.: If a non DDS based rmw implementation has already a lightweight entity that can be mapped to a node, they could choose to do that directly (the whole motivation of mapping participant to contexts is the bad performance of creating many of them).

It has to use functions like create_publisher(), which is not part of the public interface, to circumvent rmw_create_publisher because a node is not ready.

I don't see where is the issue.
Our API only allows to create a publisher for a node, and it that was done, the publisher should be destroyed before it.
DDS allows to create a publisher from a Participant, and that's what we're doing here.
It's true that we're using the abstraction provided in rmw, but that was just done to allow having common code in rmw_dds_common (which I finally didn't add there).

It seems a pretty core structural feature of the RMW API that subscribers, publishers, and guard conditions exist within the context (and lifetime) of a node. Allowing them to outlive nodes is asking for trouble.

As commented above, DDS allows to create all those entities from a Participant and that's what we're doing.
In some of the cases, we're using internally the rmw abstractions, but they don't need the node handle to do what they're doing (and in any case, we could choose to use directly the equivalent DDS entity instead of the rmw abstraction).

Also it occurs to me that maybe some of the thread weirdness has to do with parts of the RMW layer outliving the c++ runtime.

I'm not sure what you're referring to.
If that's happening, it can be corrected perfectly with the current approach (though, it's true that things might be harder to review).

@rotu
Copy link
Collaborator

rotu commented Apr 30, 2020

Sorry, my comment was unnecessary.

It's all good. I get carried away sometimes too, especially when I feel like my hard work is being dismissed. I came in with criticism, and I don't mean to trivialize your valuable contributions!

The decision was, that the rmw implementation should choose how to map entities, and there should be not assumption in rcl.

There was also no requirement beforehand that each node needs to perform its own discovery. A node is just a logical grouping in ROS and need not have an equivalent in the DDS layer.

It has to use functions like create_publisher(), which is not part of the public interface, to circumvent rmw_create_publisher because a node is not ready.

I don't see where is the issue.
Our API only allows to create a publisher for a node, and it that was done, the publisher should be destroyed before it.

This is an issue because previously you could associate all the bookkeeping with a publisher/subscriber/guard condition to a node. Now there has to be special handling for these entities with no associated node.

It also means that by implementing the RMW interface, you no longer get ROS2 compatibility "for free". It's harder to understand what middleware must do in order to be considered "correct" and how to test the public interface.

Some of this can be recovered pretty easily - I think much of RMW layer abstraction can be regained by adding an implicit node created when the context is created and destroyed when it is deleted. Is there some reason that wasn't what was done?

I'm not sure what you're referring to.
If that's happening, it can be corrected perfectly with the current approach (though, it's true that things might be harder to review).

This is wild speculation. I thought maybe it could explain @eboasson's observation that "Windows 'helpfully' kills all spawned threads before running the global destructors" if the threading is part of the C++ runtime but the destructor is called from C after the C++ runtime is torn down and so threading is unavailable.

@ivanpauno
Copy link
Member Author

There was also no requirement beforehand that each node needs to perform its own discovery. A node is just a logical grouping in ROS and need not have an equivalent in the DDS layer.

Yeap, but there are a lot of graph related functions that allows to list all the publishers of a node, etc.
Without this extra information, that couldn't being recovered.

P.S.: I think there should be an API to disable/enable this extra graph information, as it's mainly used for debugging.

This is an issue because previously you could associate all the bookkeeping with a publisher/subscriber/guard condition to a node. Now there has to be special handling for these entities with no associated node.

It also means that by implementing the RMW interface, you no longer get ROS2 compatibility "for free". It's harder to understand what middleware must do in order to be considered "correct" and how to test the public interface.

The requirements of the rmw interface haven't changed, if an implementation reuses some of the abstractions in an "unclean" way, that's up to the implementation itself.
I took the approach that I saw easier to do a complex refactor, but that doesn't mean it can't be improved. If you have suggested improvements I can help reviewing them.

Some of this can be recovered pretty easily - I think much of RMW layer abstraction can be regained by adding an implicit node created when the context is created and destroyed when it is deleted. Is there some reason that wasn't what was done?

I don't see a big advantage of that, but it doesn't sound bad either.

This is wild speculation. I thought maybe it could explain @eboasson's observation that "Windows 'helpfully' kills all spawned threads before running the global destructors" if the threading is part of the C++ runtime but the destructor is called from C after the C++ runtime is torn down and so threading is unavailable.

It's not wild speculation.
The problem was that in rclcpp we have a "default context".
Destruction of objects of static lifetime in different translation units can't be controlled ... and there begins the destruction order nightmare.
Before all this change, if you tried to create a global node the same kind of problem could be experienced.


Using a topic is not the only available mechanism to provide this information.
As discussed here ros2/design#250 (comment), UserData of writters/readers/participants could be used instead of a custom topic.

That's probably the cleanest solution, but I didn't implement it because of lack of support of some of the DDS vendors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants