Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Use 1 thread for all fibers for an actor scheduling queue. #37949

Merged
merged 19 commits into from
Aug 9, 2023

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Jul 31, 2023

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Fixes #33957.
Fixes #38240.

@rynewang rynewang changed the title Use 1 thread for all fibers for an actor scheduling queue. [core] Use 1 thread for all fibers for an actor scheduling queue. Jul 31, 2023
@@ -95,7 +94,6 @@ void ConcurrencyGroupManager<ExecutorType>::Stop() {
}
}

template class ConcurrencyGroupManager<FiberState>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing anything? Shouldn't we only have 1 fiber state (1 thread) per concurrency group. What do you mean by 1 thread per submitter worker?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we have actor scheduling queue per caller worker.

@rynewang
Copy link
Contributor Author

rynewang commented Aug 2, 2023 via email

@rynewang
Copy link
Contributor Author

rynewang commented Aug 2, 2023 via email

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test in the Python layer with the repro script?

@rynewang rynewang force-pushed the single-thread-for-fibers-per-actor branch from 6ed925b to 30738e2 Compare August 2, 2023 18:24
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@@ -18,6 +18,8 @@
#include <chrono>

#include "ray/util/logging.h"
#include "ray/util/macros.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed for RAY_UNUSED. Not sure why it did not complain before.

@rynewang rynewang force-pushed the single-thread-for-fibers-per-actor branch from 16a6333 to 3dcf8ad Compare August 7, 2023 14:44
Signed-off-by: Ruiyang Wang <[email protected]>
@rynewang
Copy link
Contributor Author

rynewang commented Aug 7, 2023

There's a ASAN & TSAN error I believe originates from boostorg/fiber#214. Trying to add flags to solve...

@rynewang
Copy link
Contributor Author

rynewang commented Aug 8, 2023

Turns out the ASAN is not from boost issues but from our own variable lifetime management. Fixed.

@rynewang rynewang force-pushed the single-thread-for-fibers-per-actor branch from 18bd679 to 88c30d0 Compare August 9, 2023 00:49
@@ -124,7 +134,7 @@ class FiberState {
// no fibers can run after this point as we don't yield here.
// This makes sure this thread won't accidentally
// access being destructed core worker.
fiber_stopped_event_.Notify();
fiber_stopped_event->Notify();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a fiber_stopped_event->clear() to explicitly free the pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do clear because lambda captures the shared ptr as const. We can mark it mutable but I guess that's overkill.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
@jjyao jjyao merged commit 0d6126f into ray-project:master Aug 9, 2023
@rynewang rynewang deleted the single-thread-for-fibers-per-actor branch August 9, 2023 20:53
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: NripeshN <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…y-project#37949)

Now we have 1 thread per submitter worker per actor to handle the fibers to submit the actor tasks. This design together with the fact that we don't stop these threads because of lack of means to stop boost fibers, makes the issue that we have unbounded number of threads in a actor process.

This PR makes all fibers for an actor run in a same thread. This makes the number of threads in an actor process bounded.

Also changed fiber_stopped_event to std::condition_variable and std::mutex.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants