Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct rcl entity lifecycles and fix spurious test failures #386

Merged
merged 20 commits into from
Apr 5, 2024

Conversation

mxgrey
Copy link
Collaborator

@mxgrey mxgrey commented Apr 4, 2024

For some time now we've noticed spurious segmentation faults and other test failures in CI. This PR aims to eliminate those entirely. I'll break down the problems we observed, followed by how they are being solved by this PR.

Segmentation Faults

Our initial suspicion was that the segmentation faults were being caused by flaws in the drop order of various rcl entities (e.g. rcl_context_t, rcl_node_t, rcl_publisher_t, etc). This suspicion had a lot of merit since rcl specifies that all nodes and middleware primitives must be cleaned up (finied) before the rcl_context_t that they came from gets finied. However, the way the rcl objects' lifecycles were being managed, if a Node struct gets dropped before its associated primitives (Subscription, Publisher, Service, Client) are dropped, then their rcl_context_t will get dropped while the rcl_node_t and various primitive rcl objects are still alive. This violates the contract that rcl stipulates and seemed like a very reasonable explanation for spurious segfaults.

Initial solution

To address this, I introduced a [_]Handle struct specifically to manage the lifecycle of each rcl binding object that gets used by rclrs. For example we now have ContextHandle, NodeHandle, PublisherHandle, etc. We already had a SubscriptionHandle which served a similar purpose, and now all the other middleware entity types have a similar handle for themselves. The handle has two roles:

  1. Ensure that the rcl object that it's responsible for is dropped correctly by implementing the Drop trait such that it cleans up the rcl object correctly.
  2. Ensure that any lifecycle dependencies (e.g. primitives must not outlive their nodes, and nodes must not outlive their contexts) are guaranteed by the [_]Handle struct via RAII.

Applying this pattern consistently to all types in rclrs provides us with a sound and easily maintained way to ensure that we're always respecting the documented lifecycle contracts of rcl. If our initial suspicion about drop order flaws were correct, this would completely resolve the segmentation faults that we were experiencing. So with bated breath I ran the tests through an aggressive loop and ...it still segfaulted....

The actual cause

While the Initial Solution was a worthwhile and legitimate thing to fix, it ultimately turned out to not be the cause of the segfaults we were observing (although it does fix other segfaults that could have happened in other scenarios). Ultimately the conclusive cause of the segfaults was this:

rcutils and (much more so) FastDDS perform unprotected mutations to global variables inside of a variety of entity initialization and cleanup functions.

rcutils has some ongoing efforts to weed out its use of unprotected mutations to global variables, but (1) the direction of those efforts is controversial, so there's no telling when they'll be merged, if ever, and (2) the vast vast majority of segmentation faults are originating in FastDDS, which suggests they are making far more extensive use of global variables and might not be likely to change that aspect of their implementation any time soon. Taking all of that into account, I am left to conclude that we must accommodate the assumption that all calls to rcl entity initialization or cleanup functions may be unsafe to run concurrently no matter their arguments nor their relationship to each other.

To deal with this, I've introduced a global (but thread-safe) static ENTITY_LIFECYCLE_MUTEX which must be locked while any entity initialization or cleanup function is running. This ensures that no matter what unprotected global variables might be getting modified by rcutils or the rmw implementation, there will not be any data races or race conditions between those calls.

Since initialization and cleanup are both extremely low-frequency occurrences and also do not block for any substantial amount of time, I think it is reasonable to have a single global lock for them. The alternative, which is to risk segmentation faults depending on the choice of rmw implementation and the mood that rcutils happens to be in, is something I would consider unacceptable.

Empty Graph Tests

Besides segmentation faults, we also observed sporadic failures in the test_graph_empty test case. This is easily explained by two facts:

  1. cargo test by default will run as many tests at once as possible, according to the number of threads available on the current system
  2. test_graph_empty assumes that the system's node graph will be empty because there are no nodes, topics, etc being constructed within the scope of the test

The assumption made by (2) is trivially violated by (1) if test_graph_empty ever happens to be run at the same time as any other test that does happen to create nodes or middleware primitives.

Prior to this PR, the only way to change the domain ID of a context programmatically within an rclrs application was to set the environment variable ROS_DOMAIN_ID using std::env::set_var. That would not help in this case because changing the environment variable anywhere within the application would modify it across all new contexts within the application, and all the tests being run in parallel are all being run within the same application.

To solve this, I've introduced an API for providing an InitOptions struct while creating a new Context. The InitOptions struct is the rclrs equivalent to rcl_init_options_t, which allows users to specify a domain ID for the context that may be different than what is specified by the ROS_DOMAIN_ID environment variable. Taking advantage of this, we can set the domain ID of the context used by test_graph_empty to be something different from the rest of the tests.

I've arbitrarily chosen 99, but that means tests may fail if the system running the tests happens to have its ROS_DOMAIN_ID set to 99. We could consider putting in a check and remedy for this if anyone is concerned about making sure our tests have absolutely no possible edge cases. Edit: If the ROS_DOMAIN_ID environment variable is unset or set to anything besides 99, then test_graph_empty will use 99 as its domain ID. But if ROS_DOMAIN_ID happens to be set to 99 then we will use 98 for the domain ID of test_graph_empty to avoid conflicts with any other tests that may run simultaneously.

jhdcs
jhdcs previously approved these changes Apr 4, 2024
Copy link
Collaborator

@jhdcs jhdcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extensive and thorough refactor. Thank you for this!

I find it unfortunate that we have to resort to a global mutex, as well as adding in a dependency on another crate, but I'm also not really seeing an alternative to the situation at the present. Perhaps in future we'll determine a more elegant solution, but for now this should allow ROS 2 Rust programs to be memory-safe again.

Signed-off-by: Michael X. Grey <[email protected]>
@mxgrey
Copy link
Collaborator Author

mxgrey commented Apr 5, 2024

as well as adding in a dependency on another crate

Fair point that we should try to minimize our dependencies. I decided to look into this more, and I found out that the guidance I originally followed for using lazy_static was actually out of date. Recent changes to the stable Rust have made it obsolete, and we can actually take advantage of new language features (const fn new() for Mutex) to get everything we need.

I've update the PR to remove the use of lazy_static.

@mxgrey
Copy link
Collaborator Author

mxgrey commented Apr 5, 2024

As for the use of a global mutex, I agree that it's icky, but it should be extremely low-impact since it's only used very briefly when initializing or dropping a middleware entity, which in almost all applications is extremely seldom, only happens during wind-up/wind-down, and will almost never happen more than once at a time.

There's definitely no realistic application where simultaneous locks of that mutex will happen at a higher frequency than our rclrs test suite, and I've seen no qualitative difference in how fast the tests run after adding the global mutex.

maspe36
maspe36 previously approved these changes Apr 5, 2024
Copy link
Collaborator

@maspe36 maspe36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To quote your comment from the matrix chat,

A major selling point of Rust is fearless parallelism, so I don't want us to be in a position where we put implicit restrictions on users about what they can do across threads

I completely agree with this. We do have a higher bar of correctness simply because of the language. With that said, this problem can be mitigated, but ultimately I don't believe it is solvable unless our dependencies protect the mutation of these globals.

Long term, it is not unreasonable to think that consumers of rclrs may have multiple packages in a workspace. Rust packages will still be built as independent crates with cargo under the hood, which means we'll have multiple downloads and potentially multiple versions of rclrs installed, but still linking to the single version of rcl or rmw found in the (ROS 2) workspace. A classic example of a configuration management problem.

This "global" lock may end up being a package level lock, which is unfortunate for some features, like enabling colcon test in colcon-ros-cargo.

I think this is about as good as we can do for now. Long term, I think it is in our best interest to raise issues (and potentially contribute) upstream.

(A little late for me, apologies for any incoherence)

rclrs/src/node.rs Outdated Show resolved Hide resolved
rclrs/src/node/builder.rs Show resolved Hide resolved
rclrs/src/node/graph.rs Outdated Show resolved Hide resolved
rclrs/src/parameter.rs Outdated Show resolved Hide resolved
rclrs/src/client.rs Outdated Show resolved Hide resolved
/// This is locked whenever initializing or dropping any middleware entity
/// because we have found issues in RCL and some RMW implementations that
/// make it unsafe to simultaneously initialize and/or drop various types of
/// entities. It seems these C and C++ based libraries will regularly use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What types of entites do we need to be careful around besides logging (rcl) and middleware (rmw)? If it truly is all entities we can initialize/drop from our FFI deps, then perhaps we're can be more direct and say something to the effect of ... and/or drop entites from these libraries

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as tests indicate, it's only what is mentioned here: middleware entities from RCL and RWM implementation (not rmw itself) libraries. There are some RCL data structures that we do not need to worry about, like rcl_init_options_t because its initialization and cleanup do not touch any global variables (..as far as I can tell).

I will probably participate in a discussion with rcl maintainers to improve the documentation around the use of global variables within rcl functions, and if that effort is fruitful then we'll be able to say with more confidence which exact functions to be concerned about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the documentation to clarify the full situation a bit: 0efa831

@mxgrey
Copy link
Collaborator Author

mxgrey commented Apr 5, 2024

This "global" lock may end up being a package level lock, which is unfortunate for some features, like enabling colcon test in colcon-ros-cargo.

What you're describing here wouldn't happen with this implementation. The "global" lock is only global within a single executable because it lives in static program memory, which is unique per executable. What you're describing would only be an issue if the lock were in shared memory, which is not the case. The global lock only needs to protect against unprotected use of global variables in static program memory which are being performed by rcutils and potentially by any RMW implementation that happens to get used. If there were any unprotected use of shared memory then that would be a critical bug in whatever shared memory library is being used, and that would need to be handled upstream, not by a ROS client library.

That being said, even if it were shared between all the programs on a system that are based on rclrs, it would only have a very minor impact on performance as evidenced by its negligible effect on the rclrs test suite, which you could think of as 48 rclrs programs running simultaneously and sharing this global lock.

Regarding your feedback on the documentation, I'll try to make the changes you've asked for later today.

Copy link
Collaborator

@maspe36 maspe36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM Grey. Thank you for all your hard work and purist of excellence in rclrs :)

// * This function does not store the ephemeral init_options and c_args pointers.
// * Passing in a zero-initialized rcl_context is mandatory.
// * The entity lifecycle mutex is locked to protect against the risk of global variables
// in the rmw implementation being unsafely modified during initialization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This would actually be in rcl, not an rmw implementation right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the comment is correct as-is. To be more concrete about it: Most of the segfaults we were seeing are coming from FastDDS via rmw_fastrtps_cpp, which is an RMW implementation.

The segfaults happen because that particular RMW implementation makes unsafe use of global variables. That problem would exist whether or not RCL is thread-safe.

Note that "RMW implementation" =/= "the implementation of the rmw library", although the human language aspect of all this is admittedly very confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the unsafe block below, we are calling rcl_init(), not an RMW implementation function. I'm not as familiar with the internals of rcl as I should be, but perhaps rcl_init() is then calling an RMW implementation function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but perhaps rcl_init() is then calling an RMW implementation function?

Correct, client libraries like rclrs do not ever call any rmw functions directly. Instead rcl is the abstraction layer that all client libraries are supposed to interface with. Many (but not all) rcl functions will then call rmw functions as needed, and those rmw functions will hook the calls into whatever RMW implementation has been loaded.

As a rule of thumb, any rcl function that involves sending information out over a middleware (including to form connections and perform discovery) will call into the RMW implementation. But this is something that I would recommend rcl to officially document.

// * The rcl_context is kept alive by the ContextHandle because it is a dependency of the node.
// * The entity lifecycle mutex is locked to protect against the risk of
// global variables in the rmw implementation being unsafely modified during cleanup.
let _lifecycle_lock = ENTITY_LIFECYCLE_MUTEX.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to by consistent with when we acquire this lock. Sometimes if we have what would be a one-liner unsafe function, we acquire the lock inside the unsafe scope. Functionally this doesn't really do anything, but it does increase the amount of unsafe code we have.

But maybe this is just a broader refactor we can do after this PR is merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that here we just need to put a fresh new scope around the entity lock and unsafe block to avoid holding the entity lock longer than what's actually needed.

@maspe36 maspe36 merged commit 646869b into ros2-rust:main Apr 5, 2024
3 checks passed
@maspe36
Copy link
Collaborator

maspe36 commented Apr 5, 2024

This implementation is at a good point right now. I think any future work (i.e. moving the locks out of the unsafe blocks in a few cases) should be done in a follow up PR.

Thanks again for the great contributions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants