Correct rcl entity lifecycles and fix spurious test failures #386

mxgrey · 2024-04-04T17:02:48Z

For some time now we've noticed spurious segmentation faults and other test failures in CI. This PR aims to eliminate those entirely. I'll break down the problems we observed, followed by how they are being solved by this PR.

Segmentation Faults

Our initial suspicion was that the segmentation faults were being caused by flaws in the drop order of various rcl entities (e.g. rcl_context_t, rcl_node_t, rcl_publisher_t, etc). This suspicion had a lot of merit since rcl specifies that all nodes and middleware primitives must be cleaned up (finied) before the rcl_context_t that they came from gets finied. However, the way the rcl objects' lifecycles were being managed, if a Node struct gets dropped before its associated primitives (Subscription, Publisher, Service, Client) are dropped, then their rcl_context_t will get dropped while the rcl_node_t and various primitive rcl objects are still alive. This violates the contract that rcl stipulates and seemed like a very reasonable explanation for spurious segfaults.

Initial solution

To address this, I introduced a [_]Handle struct specifically to manage the lifecycle of each rcl binding object that gets used by rclrs. For example we now have ContextHandle, NodeHandle, PublisherHandle, etc. We already had a SubscriptionHandle which served a similar purpose, and now all the other middleware entity types have a similar handle for themselves. The handle has two roles:

Ensure that the rcl object that it's responsible for is dropped correctly by implementing the Drop trait such that it cleans up the rcl object correctly.
Ensure that any lifecycle dependencies (e.g. primitives must not outlive their nodes, and nodes must not outlive their contexts) are guaranteed by the [_]Handle struct via RAII.

Applying this pattern consistently to all types in rclrs provides us with a sound and easily maintained way to ensure that we're always respecting the documented lifecycle contracts of rcl. If our initial suspicion about drop order flaws were correct, this would completely resolve the segmentation faults that we were experiencing. So with bated breath I ran the tests through an aggressive loop and ...it still segfaulted....

The actual cause

While the Initial Solution was a worthwhile and legitimate thing to fix, it ultimately turned out to not be the cause of the segfaults we were observing (although it does fix other segfaults that could have happened in other scenarios). Ultimately the conclusive cause of the segfaults was this:

rcutils and (much more so) FastDDS perform unprotected mutations to global variables inside of a variety of entity initialization and cleanup functions.

rcutils has some ongoing efforts to weed out its use of unprotected mutations to global variables, but (1) the direction of those efforts is controversial, so there's no telling when they'll be merged, if ever, and (2) the vast vast majority of segmentation faults are originating in FastDDS, which suggests they are making far more extensive use of global variables and might not be likely to change that aspect of their implementation any time soon. Taking all of that into account, I am left to conclude that we must accommodate the assumption that all calls to rcl entity initialization or cleanup functions may be unsafe to run concurrently no matter their arguments nor their relationship to each other.

To deal with this, I've introduced a global (but thread-safe) static ENTITY_LIFECYCLE_MUTEX which must be locked while any entity initialization or cleanup function is running. This ensures that no matter what unprotected global variables might be getting modified by rcutils or the rmw implementation, there will not be any data races or race conditions between those calls.

Since initialization and cleanup are both extremely low-frequency occurrences and also do not block for any substantial amount of time, I think it is reasonable to have a single global lock for them. The alternative, which is to risk segmentation faults depending on the choice of rmw implementation and the mood that rcutils happens to be in, is something I would consider unacceptable.

Empty Graph Tests

Besides segmentation faults, we also observed sporadic failures in the test_graph_empty test case. This is easily explained by two facts:

cargo test by default will run as many tests at once as possible, according to the number of threads available on the current system
test_graph_empty assumes that the system's node graph will be empty because there are no nodes, topics, etc being constructed within the scope of the test

The assumption made by (2) is trivially violated by (1) if test_graph_empty ever happens to be run at the same time as any other test that does happen to create nodes or middleware primitives.

Prior to this PR, the only way to change the domain ID of a context programmatically within an rclrs application was to set the environment variable ROS_DOMAIN_ID using std::env::set_var. That would not help in this case because changing the environment variable anywhere within the application would modify it across all new contexts within the application, and all the tests being run in parallel are all being run within the same application.

To solve this, I've introduced an API for providing an InitOptions struct while creating a new Context. The InitOptions struct is the rclrs equivalent to rcl_init_options_t, which allows users to specify a domain ID for the context that may be different than what is specified by the ROS_DOMAIN_ID environment variable. Taking advantage of this, we can set the domain ID of the context used by test_graph_empty to be something different from the rest of the tests.

I've arbitrarily chosen 99, but that means tests may fail if the system running the tests happens to have its ROS_DOMAIN_ID set to 99. We could consider putting in a check and remedy for this if anyone is concerned about making sure our tests have absolutely no possible edge cases. Edit: If the ROS_DOMAIN_ID environment variable is unset or set to anything besides 99, then test_graph_empty will use 99 as its domain ID. But if ROS_DOMAIN_ID happens to be set to 99 then we will use 98 for the domain ID of test_graph_empty to avoid conflicts with any other tests that may run simultaneously.

Signed-off-by: Michael X. Grey <[email protected]>

…ID set to 99 Signed-off-by: Michael X. Grey <[email protected]>

Signed-off-by: Michael X. Grey <[email protected]>

…to rcl_lifecycles

Signed-off-by: Michael X. Grey <[email protected]>

jhdcs

An extensive and thorough refactor. Thank you for this!

I find it unfortunate that we have to resort to a global mutex, as well as adding in a dependency on another crate, but I'm also not really seeing an alternative to the situation at the present. Perhaps in future we'll determine a more elegant solution, but for now this should allow ROS 2 Rust programs to be memory-safe again.

Signed-off-by: Michael X. Grey <[email protected]>

mxgrey · 2024-04-05T04:21:25Z

as well as adding in a dependency on another crate

Fair point that we should try to minimize our dependencies. I decided to look into this more, and I found out that the guidance I originally followed for using lazy_static was actually out of date. Recent changes to the stable Rust have made it obsolete, and we can actually take advantage of new language features (const fn new() for Mutex) to get everything we need.

I've update the PR to remove the use of lazy_static.

mxgrey · 2024-04-05T04:27:07Z

As for the use of a global mutex, I agree that it's icky, but it should be extremely low-impact since it's only used very briefly when initializing or dropping a middleware entity, which in almost all applications is extremely seldom, only happens during wind-up/wind-down, and will almost never happen more than once at a time.

There's definitely no realistic application where simultaneous locks of that mutex will happen at a higher frequency than our rclrs test suite, and I've seen no qualitative difference in how fast the tests run after adding the global mutex.

maspe36

To quote your comment from the matrix chat,

A major selling point of Rust is fearless parallelism, so I don't want us to be in a position where we put implicit restrictions on users about what they can do across threads

I completely agree with this. We do have a higher bar of correctness simply because of the language. With that said, this problem can be mitigated, but ultimately I don't believe it is solvable unless our dependencies protect the mutation of these globals.

Long term, it is not unreasonable to think that consumers of rclrs may have multiple packages in a workspace. Rust packages will still be built as independent crates with cargo under the hood, which means we'll have multiple downloads and potentially multiple versions of rclrs installed, but still linking to the single version of rcl or rmw found in the (ROS 2) workspace. A classic example of a configuration management problem.

This "global" lock may end up being a package level lock, which is unfortunate for some features, like enabling colcon test in colcon-ros-cargo.

I think this is about as good as we can do for now. Long term, I think it is in our best interest to raise issues (and potentially contribute) upstream.

(A little late for me, apologies for any incoherence)

rclrs/src/node.rs

rclrs/src/node/builder.rs

rclrs/src/node/graph.rs

rclrs/src/parameter.rs

rclrs/src/client.rs

maspe36 · 2024-04-05T06:07:27Z

rclrs/src/context.rs

+/// This is locked whenever initializing or dropping any middleware entity
+/// because we have found issues in RCL and some RMW implementations that
+/// make it unsafe to simultaneously initialize and/or drop various types of
+/// entities. It seems these C and C++ based libraries will regularly use


What types of entites do we need to be careful around besides logging (rcl) and middleware (rmw)? If it truly is all entities we can initialize/drop from our FFI deps, then perhaps we're can be more direct and say something to the effect of ... and/or drop entites from these libraries

As far as tests indicate, it's only what is mentioned here: middleware entities from RCL and RWM implementation (not rmw itself) libraries. There are some RCL data structures that we do not need to worry about, like rcl_init_options_t because its initialization and cleanup do not touch any global variables (..as far as I can tell).

I will probably participate in a discussion with rcl maintainers to improve the documentation around the use of global variables within rcl functions, and if that effort is fruitful then we'll be able to say with more confidence which exact functions to be concerned about.

I've updated the documentation to clarify the full situation a bit: 0efa831

mxgrey · 2024-04-05T09:51:01Z

This "global" lock may end up being a package level lock, which is unfortunate for some features, like enabling colcon test in colcon-ros-cargo.

What you're describing here wouldn't happen with this implementation. The "global" lock is only global within a single executable because it lives in static program memory, which is unique per executable. What you're describing would only be an issue if the lock were in shared memory, which is not the case. The global lock only needs to protect against unprotected use of global variables in static program memory which are being performed by rcutils and potentially by any RMW implementation that happens to get used. If there were any unprotected use of shared memory then that would be a critical bug in whatever shared memory library is being used, and that would need to be handled upstream, not by a ROS client library.

That being said, even if it were shared between all the programs on a system that are based on rclrs, it would only have a very minor impact on performance as evidenced by its negligible effect on the rclrs test suite, which you could think of as 48 rclrs programs running simultaneously and sharing this global lock.

Regarding your feedback on the documentation, I'll try to make the changes you've asked for later today.

Signed-off-by: Michael X. Grey <[email protected]>

…mpty Signed-off-by: Michael X. Grey <[email protected]>

Signed-off-by: Michael X. Grey <[email protected]>

maspe36

This LGTM Grey. Thank you for all your hard work and purist of excellence in rclrs :)

maspe36 · 2024-04-05T17:42:33Z

rclrs/src/context.rs

+            // * This function does not store the ephemeral init_options and c_args pointers.
+            // * Passing in a zero-initialized rcl_context is mandatory.
+            // * The entity lifecycle mutex is locked to protect against the risk of global variables
+            //   in the rmw implementation being unsafely modified during initialization.


nit: This would actually be in rcl, not an rmw implementation right?

No the comment is correct as-is. To be more concrete about it: Most of the segfaults we were seeing are coming from FastDDS via rmw_fastrtps_cpp, which is an RMW implementation.

The segfaults happen because that particular RMW implementation makes unsafe use of global variables. That problem would exist whether or not RCL is thread-safe.

Note that "RMW implementation" =/= "the implementation of the rmw library", although the human language aspect of all this is admittedly very confusing.

But in the unsafe block below, we are calling rcl_init(), not an RMW implementation function. I'm not as familiar with the internals of rcl as I should be, but perhaps rcl_init() is then calling an RMW implementation function?

but perhaps rcl_init() is then calling an RMW implementation function?

Correct, client libraries like rclrs do not ever call any rmw functions directly. Instead rcl is the abstraction layer that all client libraries are supposed to interface with. Many (but not all) rcl functions will then call rmw functions as needed, and those rmw functions will hook the calls into whatever RMW implementation has been loaded.

As a rule of thumb, any rcl function that involves sending information out over a middleware (including to form connections and perform discovery) will call into the RMW implementation. But this is something that I would recommend rcl to officially document.

maspe36 · 2024-04-05T17:47:34Z

rclrs/src/node/builder.rs

+            // * The rcl_context is kept alive by the ContextHandle because it is a dependency of the node.
+            // * The entity lifecycle mutex is locked to protect against the risk of
+            //   global variables in the rmw implementation being unsafely modified during cleanup.
+            let _lifecycle_lock = ENTITY_LIFECYCLE_MUTEX.lock().unwrap();


It would be nice to by consistent with when we acquire this lock. Sometimes if we have what would be a one-liner unsafe function, we acquire the lock inside the unsafe scope. Functionally this doesn't really do anything, but it does increase the amount of unsafe code we have.

But maybe this is just a broader refactor we can do after this PR is merged.

We can do that here we just need to put a fresh new scope around the entity lock and unsafe block to avoid holding the entity lock longer than what's actually needed.

maspe36 · 2024-04-05T20:10:18Z

This implementation is at a good point right now. I think any future work (i.e. moving the locks out of the unsafe blocks in a few cases) should be done in a follow up PR.

Thanks again for the great contributions

mxgrey added 13 commits April 3, 2024 04:21

Reworking the lifecycle management of rcl bindings

b7dd6a2

Signed-off-by: Michael X. Grey <[email protected]>

Manage all rcl bindings with Handle structs

d3df842

Signed-off-by: Michael X. Grey <[email protected]>

Keep context alive for guard conditions

10b2bcb

Signed-off-by: Michael X. Grey <[email protected]>

Introduce InitOptions to allow manually setting domain ID

81c065e

Signed-off-by: Michael X. Grey <[email protected]>

Ensure that mutex guards are not being dropped prematurely

f30a284

Signed-off-by: Michael X. Grey <[email protected]>

Apply lifecycle lock to all middleware entities

2adcf3e

Signed-off-by: Michael X. Grey <[email protected]>

Run rustfmt

4dc88b1

Signed-off-by: Michael X. Grey <[email protected]>

Run clippy

1b6eeb4

Signed-off-by: Michael X. Grey <[email protected]>

Ensure that test_graph_empty works even if the system has ROS_DOMAIN_…

f2ca13e

…ID set to 99 Signed-off-by: Michael X. Grey <[email protected]>

Run rustfmt

44d6166

Signed-off-by: Michael X. Grey <[email protected]>

Use usize instead of u8 for domain id

8b6e825

Signed-off-by: Michael X. Grey <[email protected]>

Merge branch 'rcl_lifecycles' of ssh://github.com/mxgrey/ros2_rust in…

90bdd05

…to rcl_lifecycles

Satisfy clippy

6d90431

Signed-off-by: Michael X. Grey <[email protected]>

jhdcs previously approved these changes Apr 4, 2024

View reviewed changes

Remove the need for lazy_static

4caa208

Signed-off-by: Michael X. Grey <[email protected]>

mxgrey dismissed jhdcs’s stale review via 4caa208 April 5, 2024 04:17

maspe36 previously approved these changes Apr 5, 2024

View reviewed changes

mxgrey dismissed maspe36’s stale review via ae6d5b8 April 5, 2024 16:16

Update documentation and safety info on rcl entity lifecycles

5abbb34

Signed-off-by: Michael X. Grey <[email protected]>

mxgrey force-pushed the rcl_lifecycles branch from ae6d5b8 to 5abbb34 Compare April 5, 2024 16:23

mxgrey added 5 commits April 5, 2024 16:27

Rename to avoid confusion with Handle pattern

cf0d434

Signed-off-by: Michael X. Grey <[email protected]>

Improve the documentation for the domain ID situation of test_graph_e…

21d3b35

…mpty Signed-off-by: Michael X. Grey <[email protected]>

Update documentation on ENTITY_LIFECYCLE_MUTEX

0efa831

Signed-off-by: Michael X. Grey <[email protected]>

Get rid of doc links to private structs

8a16367

Signed-off-by: Michael X. Grey <[email protected]>

Fix link formatting

58b2c66

Signed-off-by: Michael X. Grey <[email protected]>

jhdcs approved these changes Apr 5, 2024

View reviewed changes

maspe36 approved these changes Apr 5, 2024

View reviewed changes

maspe36 merged commit 646869b into ros2-rust:main Apr 5, 2024
3 checks passed

mxgrey mentioned this pull request Apr 7, 2024

Audit drop ordering for rcl handles #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct rcl entity lifecycles and fix spurious test failures #386

Correct rcl entity lifecycles and fix spurious test failures #386

mxgrey commented Apr 4, 2024 •

edited

Loading

jhdcs left a comment

mxgrey commented Apr 5, 2024

mxgrey commented Apr 5, 2024

maspe36 left a comment

maspe36 Apr 5, 2024

mxgrey Apr 5, 2024

mxgrey Apr 5, 2024

mxgrey commented Apr 5, 2024 •

edited

Loading

maspe36 left a comment

maspe36 Apr 5, 2024

mxgrey Apr 5, 2024

maspe36 Apr 5, 2024

mxgrey Apr 5, 2024

maspe36 Apr 5, 2024

mxgrey Apr 5, 2024

maspe36 commented Apr 5, 2024

Correct rcl entity lifecycles and fix spurious test failures #386

Correct rcl entity lifecycles and fix spurious test failures #386

Conversation

mxgrey commented Apr 4, 2024 • edited Loading

Segmentation Faults

Initial solution

The actual cause

rcutils and (much more so) FastDDS perform unprotected mutations to global variables inside of a variety of entity initialization and cleanup functions.

Empty Graph Tests

jhdcs left a comment

Choose a reason for hiding this comment

mxgrey commented Apr 5, 2024

mxgrey commented Apr 5, 2024

maspe36 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxgrey commented Apr 5, 2024 • edited Loading

maspe36 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maspe36 commented Apr 5, 2024

mxgrey commented Apr 4, 2024 •

edited

Loading

mxgrey commented Apr 5, 2024 •

edited

Loading