Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesigning the MultiThreadedExecutor #1276

Closed
wants to merge 8 commits into from

Conversation

audrow
Copy link
Member

@audrow audrow commented Aug 12, 2020

Adds a document for discussing how the MultiThreadedExecutor should be redesigned to avoid calling execute more than one time on one executable object. This discussion was prompted by #1241 and #1275.

Signed-off-by: Audrow <[email protected]>
Signed-off-by: Audrow <[email protected]>
@audrow audrow self-assigned this Aug 12, 2020
Signed-off-by: Audrow Nash <[email protected]>

Co-authored-by: Geoffrey Biggs <[email protected]>
Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for triggering the discussion @audrow !

rclcpp/doc/multi_threaded_executor_redesign.rst Outdated Show resolved Hide resolved
}

:Advantages:
* no spurious wake ups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, this may avoid spurious wake ups.
But for example, it doesn't fix the problem for timers (though, it does fix the problem in all waitables were "taking data" has a meaning).

I still like the idea of somehow "taking the data" early.
I don't think that "taking data" is conceptually correct, more on that in a different comment.

For the entities where "taking data" doesn't' have sense, the problem can be fixed by making them mutually exclusive with themselves (one thread cannot wait or execute the entity if another thread is executing it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the entities where "taking data" doesn't' have sense, the problem can be fixed by making them mutually exclusive with themselves (one thread cannot wait or execute the entity if another thread is executing it).

Should this be another solution listed? Would this work for all entity types? If so, could you briefly describe its implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be another solution listed?

Distinguishing between entities that can be executed concurrently (e.g. you can take data from a subscription from different threads) from the ones that cannot sounds like right to me.

Would this work for all entity types?

In entities like timer, waiting on it to be ready on one thread while scheduling it for execution in another thread is a problem.
For other entities (e.g. subscriptions), that would only worsen performance.

If so, could you briefly describe its implementation?

We could have a flag or something that differentiates between "concurrent" and "non-concurrent" entities.

We could create a mutually exclusive callback group for each timer instead of putting them in the default callback group.
Currently when an entity on a ME callback group is being executed, all the entities in the group (including the one scheduled for execution) cannot be waited on in another thread neither be executed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create a mutually exclusive callback group for each timer instead of putting them in the default callback group.

How would that help? It breaks the existing expectation of users which is that callbacks are not called concurrently unless they put them in different callback groups (the same as ROS 1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would that help? It breaks the existing expectation of users which is that callbacks are not called concurrently unless they put them in different callback groups (the same as ROS 1).

AFAIU:

  • Callbacks from different callback groups can be executed concurrently.
  • Callbacks from the same Reentrant Callback Group (RCG) can be executed concurrently.
  • Callbacks from the same Mutually Exclusive Callback Group (MECG) cannot be executed concurrently.

For entities that aren't reentrant (e.g. timers) we could do the following:

  • By default each timer is added to a new MECG that only contains that timer, and that MECG will not be used for other entities later.
  • The user can pass a custom MECG. Passing a RCG is not allowed.

I think that doesn't break user assumptions. If it does break them, I'm probably miss-understanding how callback groups work.

This might not be the ideal way of implementing non-reentrant entities, maybe having an ad-hoc flag is a better idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are timers not reentrant? You should be able to have a timer in a reentrant callback group and have timer callbacks being called concurrently, if that's what you desire.

If you put each timer in its own MECG, then it can be run concurrently with subscription callbacks (assuming the user did not specify a custom CG for either), which is different from ROS 1, where the callbacks are mutually exclusive by default (even between subscriptions and timers).

Putting ROS 1 aside, if you right now create a timer and a subscription, and then spin in a multi-threaded executor they are mutually exclusive, your proposed change would change that behavior, breaking their expectation.

Either way, I don't see how this fixes the problem, but that's likely due to me loosing context on the conversation...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are timers not reentrant? You should be able to have a timer in a reentrant callback group and have timer callbacks being called concurrently, if that's what you desire.

We have a race condition when trying to wait on timers in a thread while another thread has scheduled its execution, and sometimes the timer is executed twice in a row (here a failure in CI, but it's almost impossible to reproduce the failure, it's extremely flaky).
There's a patch in the executor trying to avoid the race condition, but I think the patch suffers from a race condition too as explained here.
The approach proposed in #1241 doesn't solve the problem and I proposed a custom workaround in #1275.

I think that having the concept of entities that cannot be executed reentrantly could also solve the problem, and I think that's particular suitable for timers.

If you put each timer in its own MECG, then it can be run concurrently with subscription callbacks (assuming the user did not specify a custom CG for either), which is different from ROS 1, where the callbacks are mutually exclusive by default (even between subscriptions and timers).

Sorry, I thought that the default callback group was reentrant and not ME (code here). The change I proposed wouldn't be acceptable.

Comment on lines 42 to 48
void take_data(std::shared_ptr<void> & data) {
data = buffer->consume_data();
}

void execute(std::shared_ptr<void> & data) {
run_any_callback(data_);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is how data is type erased here.
I don't think that's a good idea for a user facing API.

I think it's a better idea to have something like:

std::function<void()> get_handle() {
  return [data=buffer->consume_data()]() {
    run_any_callback(data);
  };
}

In that way, data is not type erased.

IIRC @wjwwood raised the concern that this approach forces entities to have a way of "handling" the data, and it would be nice in the future to be able to create, for example, a Subscription where you can directly take the data.

Though that would be nice, all our current API is based on callbacks.
i.e.: you need to actually pass a callback when creating a Subscription/etc.

If we want to add support to a WaitSet+take approach (which I think it would be nice for some users), we will actually need to split entities e.g. Subscription and ExecutableSubscription.

In that approach, I think that waitables actually don't need to have the concept of "taking data" neither.
You only need to be able to check if the waitable is ready, and then you can execute whatever code is needed.


Longer term:

I think we should separate the concept of Waitable/Condition from the concept of Executable.
The current design mixes both, and they aren't necessarily the same.

I would like to write how I think that can be done, but I don't think that's the goal of this discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't erase the type in the interface, I'm convinced that #1275 is a good idea (each time I read the PR I'm a bit confused, but I think I'm convinced now 😂).
Without that, I still don't like the proposal much (sorry for being a bit hardheaded).

Maybe there's a different way of avoiding erasing the type, instead of my get_handle idea. I would like to hear of other alternatives!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the get_handle method in with an example. Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still convinced #1275 is not the right solution, also sorry for being hardheaded, and I don't have an issue with the type erasure, since the root cause is that we have to deal with type erased data at the the rcl/rmw layer anyways.

If we added the "handle" which contains a type erased pointer, which the callback could re-type internally, I really don't see that issue with what was proposed here, w.r.t. the std::shared_ptr<void> & data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we added the "handle" which contains a type erased pointer, which the callback could re-type internally, I really don't see that issue with what was proposed here, w.r.t. the std::shared_ptr & data.

The difference is that you don't expose to the user a void *.

rclcpp/doc/multi_threaded_executor_redesign.rst Outdated Show resolved Hide resolved

* Add a preparation method to get the event after :code:`take_data`
* Take a lambda that can be executed later
* Store conditions in the waitset (like DDS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having the concept of conditions (like DDS) is nice, but I'm not sure what this item means exactly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke about it in our meeting. I'm not sure that I understand it well enough to explain.

It was something like, we could do what DDS does, which is use conditions that can be peaked at or read. Once a condition is read, it is removed from the queue, so they don't come up again. I believe this would avoid spurious wakeups. @wjwwood, is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in a ROS 2 wait set you can add different kind of entities (subscription, service, ...).
In DDS, you can only add conditions to a WaitSet.
Conditions API is quite simple, you can only check if them are ready or dispatch an attached callback.

There's no concept of "reading" or "peaking" a condition AFAIU.
I think that having something similar to DDS conditions will simplify our custom handling of different cases in wait set and executor code a lot (and the abstractions looks nice too 😃 ).

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid spurious wake-up is really nice and cost effective. this can also fix the ros2/ros2#1035 and #1399.

@audrow audrow changed the base branch from master to rolling June 28, 2022 14:21
@audrow
Copy link
Member Author

audrow commented Dec 6, 2023

I'm going to close this because there hasn't been movement on this in a few years. Feel free to open this if it becomes relevant again.

@audrow audrow closed this Dec 6, 2023
@audrow audrow deleted the audrow/multi-threaded-executor-redesign-doc branch December 6, 2023 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants