Redesigning the MultiThreadedExecutor #1276

audrow · 2020-08-12T22:50:50Z

Adds a document for discussing how the MultiThreadedExecutor should be redesigned to avoid calling execute more than one time on one executable object. This discussion was prompted by #1241 and #1275.

Signed-off-by: Audrow <[email protected]>

rclcpp/doc/multi_threaded_executor_redesign.rst

Signed-off-by: Audrow Nash <[email protected]> Co-authored-by: Geoffrey Biggs <[email protected]>

ivanpauno

Thanks for triggering the discussion @audrow !

rclcpp/doc/multi_threaded_executor_redesign.rst

ivanpauno · 2020-08-19T20:01:04Z

rclcpp/doc/multi_threaded_executor_redesign.rst

+        }
+
+   :Advantages:
+     * no spurious wake ups


As commented above, this may avoid spurious wake ups.
But for example, it doesn't fix the problem for timers (though, it does fix the problem in all waitables were "taking data" has a meaning).

I still like the idea of somehow "taking the data" early.
I don't think that "taking data" is conceptually correct, more on that in a different comment.

For the entities where "taking data" doesn't' have sense, the problem can be fixed by making them mutually exclusive with themselves (one thread cannot wait or execute the entity if another thread is executing it).

For the entities where "taking data" doesn't' have sense, the problem can be fixed by making them mutually exclusive with themselves (one thread cannot wait or execute the entity if another thread is executing it).

Should this be another solution listed? Would this work for all entity types? If so, could you briefly describe its implementation?

Should this be another solution listed?

Distinguishing between entities that can be executed concurrently (e.g. you can take data from a subscription from different threads) from the ones that cannot sounds like right to me.

Would this work for all entity types?

In entities like timer, waiting on it to be ready on one thread while scheduling it for execution in another thread is a problem.
For other entities (e.g. subscriptions), that would only worsen performance.

If so, could you briefly describe its implementation?

We could have a flag or something that differentiates between "concurrent" and "non-concurrent" entities.

We could create a mutually exclusive callback group for each timer instead of putting them in the default callback group.
Currently when an entity on a ME callback group is being executed, all the entities in the group (including the one scheduled for execution) cannot be waited on in another thread neither be executed.

We could create a mutually exclusive callback group for each timer instead of putting them in the default callback group.

How would that help? It breaks the existing expectation of users which is that callbacks are not called concurrently unless they put them in different callback groups (the same as ROS 1).

How would that help? It breaks the existing expectation of users which is that callbacks are not called concurrently unless they put them in different callback groups (the same as ROS 1).

AFAIU:

Callbacks from different callback groups can be executed concurrently.

Callbacks from the same Reentrant Callback Group (RCG) can be executed concurrently.

Callbacks from the same Mutually Exclusive Callback Group (MECG) cannot be executed concurrently.

For entities that aren't reentrant (e.g. timers) we could do the following:

By default each timer is added to a new MECG that only contains that timer, and that MECG will not be used for other entities later.

The user can pass a custom MECG. Passing a RCG is not allowed.

I think that doesn't break user assumptions. If it does break them, I'm probably miss-understanding how callback groups work.

This might not be the ideal way of implementing non-reentrant entities, maybe having an ad-hoc flag is a better idea.

Why are timers not reentrant? You should be able to have a timer in a reentrant callback group and have timer callbacks being called concurrently, if that's what you desire.

If you put each timer in its own MECG, then it can be run concurrently with subscription callbacks (assuming the user did not specify a custom CG for either), which is different from ROS 1, where the callbacks are mutually exclusive by default (even between subscriptions and timers).

Putting ROS 1 aside, if you right now create a timer and a subscription, and then spin in a multi-threaded executor they are mutually exclusive, your proposed change would change that behavior, breaking their expectation.

Either way, I don't see how this fixes the problem, but that's likely due to me loosing context on the conversation...

Why are timers not reentrant? You should be able to have a timer in a reentrant callback group and have timer callbacks being called concurrently, if that's what you desire.

We have a race condition when trying to wait on timers in a thread while another thread has scheduled its execution, and sometimes the timer is executed twice in a row (here a failure in CI, but it's almost impossible to reproduce the failure, it's extremely flaky).
There's a patch in the executor trying to avoid the race condition, but I think the patch suffers from a race condition too as explained here.
The approach proposed in #1241 doesn't solve the problem and I proposed a custom workaround in #1275.

I think that having the concept of entities that cannot be executed reentrantly could also solve the problem, and I think that's particular suitable for timers.

If you put each timer in its own MECG, then it can be run concurrently with subscription callbacks (assuming the user did not specify a custom CG for either), which is different from ROS 1, where the callbacks are mutually exclusive by default (even between subscriptions and timers).

Sorry, I thought that the default callback group was reentrant and not ME (code here). The change I proposed wouldn't be acceptable.

ivanpauno · 2020-08-19T20:37:01Z

rclcpp/doc/multi_threaded_executor_redesign.rst

+        void take_data(std::shared_ptr<void> & data) {
+          data = buffer->consume_data();
+        }
+
+        void execute(std::shared_ptr<void> & data) {
+          run_any_callback(data_);
+        }


My main concern is how data is type erased here.
I don't think that's a good idea for a user facing API.

I think it's a better idea to have something like:

std::function<void()> get_handle() { return [data=buffer->consume_data()]() { run_any_callback(data); }; }

In that way, data is not type erased.

IIRC @wjwwood raised the concern that this approach forces entities to have a way of "handling" the data, and it would be nice in the future to be able to create, for example, a Subscription where you can directly take the data.

Though that would be nice, all our current API is based on callbacks.
i.e.: you need to actually pass a callback when creating a Subscription/etc.

If we want to add support to a WaitSet+take approach (which I think it would be nice for some users), we will actually need to split entities e.g. Subscription and ExecutableSubscription.

In that approach, I think that waitables actually don't need to have the concept of "taking data" neither.
You only need to be able to check if the waitable is ready, and then you can execute whatever code is needed.

Longer term:

I think we should separate the concept of Waitable/Condition from the concept of Executable.
The current design mixes both, and they aren't necessarily the same.

I would like to write how I think that can be done, but I don't think that's the goal of this discussion.

If we don't erase the type in the interface, I'm convinced that #1275 is a good idea (each time I read the PR I'm a bit confused, but I think I'm convinced now 😂).
Without that, I still don't like the proposal much (sorry for being a bit hardheaded).

Maybe there's a different way of avoiding erasing the type, instead of my get_handle idea. I would like to hear of other alternatives!

I've added the get_handle method in with an example. Let me know what you think.

I'm still convinced #1275 is not the right solution, also sorry for being hardheaded, and I don't have an issue with the type erasure, since the root cause is that we have to deal with type erased data at the the rcl/rmw layer anyways.

If we added the "handle" which contains a type erased pointer, which the callback could re-type internally, I really don't see that issue with what was proposed here, w.r.t. the std::shared_ptr<void> & data.

If we added the "handle" which contains a type erased pointer, which the callback could re-type internally, I really don't see that issue with what was proposed here, w.r.t. the std::shared_ptr & data.

The difference is that you don't expose to the user a void *.

rclcpp/doc/multi_threaded_executor_redesign.rst

ivanpauno · 2020-08-19T20:40:58Z

rclcpp/doc/multi_threaded_executor_redesign.rst

+
+* Add a preparation method to get the event after :code:`take_data`
+* Take a lambda that can be executed later
+* Store conditions in the waitset (like DDS)


I think that having the concept of conditions (like DDS) is nice, but I'm not sure what this item means exactly.

We spoke about it in our meeting. I'm not sure that I understand it well enough to explain.

It was something like, we could do what DDS does, which is use conditions that can be peaked at or read. Once a condition is read, it is removed from the queue, so they don't come up again. I believe this would avoid spurious wakeups. @wjwwood, is this correct?

Currently in a ROS 2 wait set you can add different kind of entities (subscription, service, ...).
In DDS, you can only add conditions to a WaitSet.
Conditions API is quite simple, you can only check if them are ready or dispatch an attached callback.

There's no concept of "reading" or "peaking" a condition AFAIU.
I think that having something similar to DDS conditions will simplify our custom handling of different cases in wait set and executor code a lot (and the abstractions looks nice too 😃 ).

Signed-off-by: Audrow Nash <[email protected]>

fujitatomoya

to avoid spurious wake-up is really nice and cost effective. this can also fix the ros2/ros2#1035 and #1399.

audrow · 2023-12-06T19:24:26Z

I'm going to close this because there hasn't been movement on this in a few years. Feel free to open this if it becomes relevant again.

audrow added 2 commits August 12, 2020 15:35

Add document

9f10f30

Signed-off-by: Audrow <[email protected]>

Fix indents

8d787b4

Signed-off-by: Audrow <[email protected]>

audrow added the documentation label Aug 12, 2020

audrow self-assigned this Aug 12, 2020

Reword first sentences

63017ee

Signed-off-by: Audrow <[email protected]>

gbiggs reviewed Aug 12, 2020

View reviewed changes

rclcpp/doc/multi_threaded_executor_redesign.rst Outdated Show resolved Hide resolved

Make wording clearer

e46398f

Signed-off-by: Audrow Nash <[email protected]> Co-authored-by: Geoffrey Biggs <[email protected]>

audrow force-pushed the audrow/multi-threaded-executor-redesign-doc branch from 15cb8cf to e46398f Compare August 13, 2020 15:58

audrow mentioned this pull request Aug 18, 2020

Add take_data to Waitable and data to AnyExecutable #1241

Merged

ivanpauno reviewed Aug 19, 2020

View reviewed changes

audrow added 4 commits September 15, 2020 12:11

Remove spaces at ends of lines

e17e1f7

Signed-off-by: Audrow Nash <[email protected]>

Add clarification that the second method doesn't fix timers

21b7628

Signed-off-by: Audrow Nash <[email protected]>

Add an example for returning a lambda function

7315b38

Signed-off-by: Audrow Nash <[email protected]>

Add an example for the preparation idea

f280c2c

Signed-off-by: Audrow Nash <[email protected]>

audrow requested review from gbiggs and ivanpauno September 15, 2020 19:25

fujitatomoya reviewed Oct 26, 2020

View reviewed changes

audrow changed the base branch from master to rolling June 28, 2022 14:21

audrow closed this Dec 6, 2023

audrow deleted the audrow/multi-threaded-executor-redesign-doc branch December 6, 2023 19:24

thomasmoore-torc mentioned this pull request Jun 26, 2024

Timer callbacks can be delayed when using simulation time #2535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesigning the MultiThreadedExecutor #1276

Redesigning the MultiThreadedExecutor #1276

audrow commented Aug 12, 2020

ivanpauno left a comment

ivanpauno Aug 19, 2020

audrow Sep 15, 2020

ivanpauno Sep 16, 2020

wjwwood Sep 16, 2020

ivanpauno Sep 17, 2020

wjwwood Sep 17, 2020

ivanpauno Sep 17, 2020

ivanpauno Aug 19, 2020

ivanpauno Aug 19, 2020

audrow Sep 15, 2020

ivanpauno Sep 17, 2020

wjwwood Nov 3, 2020

ivanpauno Nov 4, 2020

ivanpauno Aug 19, 2020

audrow Sep 15, 2020

ivanpauno Sep 16, 2020

fujitatomoya left a comment

audrow commented Dec 6, 2023

Redesigning the MultiThreadedExecutor #1276

Redesigning the MultiThreadedExecutor #1276

Conversation

audrow commented Aug 12, 2020

ivanpauno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujitatomoya left a comment

Choose a reason for hiding this comment

audrow commented Dec 6, 2023