Stealing jobs from an existing pool / scope #1052

emilio · 2023-05-30T20:01:39Z

In Firefox, we're using scope_fifo to distribute a bunch of work into a ThreadPool. We've found that stopping our current thread to do other work in the thread pool is a regression in a variety of heterogeneous devices, because the scheduler will schedule the thread-pool in small cores for a while (see here for an analysis).

Luckily, @rocallahan and @cuviper added in_place_scope / in_place_scope_fifo (#844 and related), which (if I'm reading correctly) should allow us to keep the main thread busy at least for a while (so, spawn work into the threadpool, but keep doing work in the main thread).

It seems, however, there would be cases where the main thread will eventually run out of work to do, and would go to sleep. It'd be nice if the main thread could then keep doing work by stealing work posted by the thread pool until the scope is gone.

It seems that should be doable, at a glance, but it might be significant work... I'm not familiar with Rayon internals all that much but I'd be willing to give it a shot with some guidance.

@cuviper does this seem like a reasonable use case? Do you know how hard would it be to provide a way of doing this?

Thanks!

cc: @bholley @jrmuizel @smaug----

The text was updated successfully, but these errors were encountered:

jrmuizel · 2023-05-31T15:04:34Z

This image shows some of the motivation:

The main thread of Firefox "Web Content" is running on CPU 5 (a performance core) at full speed. Just before 3.8ms, we dispatch some work to rayon. It is only scheduled on the efficient cores (CPU 0-3). These cores have previously been sitting idle so it takes a while for them to ramp up to full speed. In the meantime CPU 5 sits idle and gradually drops in frequency from 2.4Ghz down to 1.4GHz. When the rayon work is done, CPU 5 resumes but takes 6ms to get back up to 2.4Ghz.

If the main thread had been kept busy instead of sleeping, things would be better for a number of reasons:

We'd be using 5 cores instead of 4
We'd be using a performance core instead of just the efficient ones.
The performance core would not have clocked down. Thus when the rayon work is complete, it would be running at full speed instead of 1.4Ghz.

cuviper · 2023-06-02T22:03:42Z

@cuviper does this seem like a reasonable use case? Do you know how hard would it be to provide a way of doing this?

I get the motivation, but I'm not really sure how hard it will be. I mostly worry about how to get the completion notifications working for jobs that were queued within the pool, but now have to deal with an "external" thread.

... we dispatch some work to rayon. It is only scheduled on the efficient cores (CPU 0-3).

Is that by chance, because they were idle, or are they deliberately pinned to E cores? Because if you're pinning, I wonder if you could pin one of the rayon threads to the same as the "Web Content", so they naturally exchange the spun-up CPU at the OS level.

Wild idea: what if "Web Content" just ran on a rayon thread in the first place, so it naturally participated in work-stealing? You'd want to make sure that's not sitting on top of any other work, but since you control the application, you should be able to ensure that's the first thing into the pool.

cuviper · 2023-06-03T16:12:01Z

Wild idea: what if "Web Content" just ran on a rayon thread in the first place, so it naturally participated in work-stealing?

It might also be interesting to make that an explicit opt-in during pool creation, e.g. ThreadPoolBuilder::use_current(). That would add the current thread as part of the pool without sending it into the pool's main-loop. We have some experience with that in the wasm fallback (#1019), and that seems to work fine.

Let me know if this approach sounds at all palatable to you!

emilio · 2023-06-04T09:15:34Z

... we dispatch some work to rayon. It is only scheduled on the efficient cores (CPU 0-3).

Is that by chance, because they were idle, or are they deliberately pinned to E cores? Because if you're pinning, I wonder if you could pin one of the rayon threads to the same as the "Web Content", so they naturally exchange the spun-up CPU at the OS level.

That seems by chance. We're experimenting with pinning / affinity tweaks, which seem like it could help as well indeed.

Wild idea: what if "Web Content" just ran on a rayon thread in the first place, so it naturally participated in work-stealing? You'd want to make sure that's not sitting on top of any other work, but since you control the application, you should be able to ensure that's the first thing into the pool.

So, "web content" is the process' "main" thread. In the parent process that'd be the UI thread as well, so it seems like a rather big change.

It might also be interesting to make that an explicit opt-in during pool creation, e.g. ThreadPoolBuilder::use_current(). That would add the current thread as part of the pool without sending it into the pool's main-loop. We have some experience with that in the wasm fallback (#1019), and that seems to work fine.

That seems like an interesting approach that should work indeed! One potential issue is how do spawn() calls from the main thread behave if that thread never ends up yielding to rayon. AIUI they'd end up in the main thread's local queue, but assuming they execute in other rayon threads just fine, that seems worth a try, we'd just need to make sure that there's at least one other thread able to pick up work in the pool.

Would you be able to post a rayon patch with that use_current() method? If so that'd be great, otherwise I can try to give it a shot sometime this week.

cuviper · 2023-06-05T23:57:19Z

Here's a proof of concept for use_current:
master...cuviper:rayon:use_current

emilio · 2023-06-06T16:04:30Z

@cuviper I gave that patch a shot in https://bugzilla.mozilla.org/show_bug.cgi?id=1836948 and it seems to work rather nicely!

cuviper · 2023-06-06T16:31:12Z

@emilio I'm not familiar with your perf system - which result should I be looking at there? I'd be interested in a new image like @jrmuizel showed too, if that can be recorded.

One caution that I want to put in the docs is that the pool-building thread can only involve itself in use_current for a single pool, and I'll need to put in an error for that case. So that's akin to a global resource, and I would generally not recommend a library crate to create pools this way -- but stylo-in-firefox can probably operate with broad context.

There are also possible footguns here, like spawn that's never yielded to (like you mentioned), or the possibility of the use_current thread exiting entirely. But as long as none of these are safety problems, "just" deadlocks or lack of forward progress, then I think we only need to make sure those are problems the user can fix themselves.

emilio · 2023-06-09T20:01:47Z

@cuviper: @jrmuizel posted some screenshots in the bug linked above. There are some parameters to tweak how much work the main thread should retain for itself, but this screenshot for example seems a lot better than the one above: the "Web Content" thread can wake up after running out of work again, and keeps the fast core running. Actually, benchmark results look promising even on desktop.

I want to do some more profiling on Android, and specially try to understand why sometimes the main thread doesn't wake up when the workers aren't getting scheduled (my theory is that the workers are mid-processing some work, such that there's no pending work in the thread pool, but the kernel isn't scheduling them so there's no progress temporarily). But that seems like a side-issue that we can tweak, and over-all use_current() seems useful as implemented above.

nical · 2023-06-15T16:50:21Z

Would it be possible to create a rayon thread pool with a fixed number of extra workers that would be explicitly used by one thread at a time?

Something along the lines of:

let (pool, workers) = ThreadPoolBuilder().with_extra_workers(3);

let worker = wrokers.pop().unwrap();
worker.scope(|| {
      items.par_iter().map( //...etc
});

The API I came up on the spot isn't particularly nice (if sound at all), but more generally it would be great to be able to let multiple threads get the benefits of what use_current provides, and maybe the scoped API or some other trick would allow lifting the limitation of having a single thread pool doing use_current per thread.

cuviper · 2023-06-15T22:01:31Z

@nical I think that's getting into the realm where it might be better with lower-level building blocks. Like maybe with a spawn_handler, instead of calling ThreadBuilder::run there might be something like ThreadBuilder::init(self). So you could arrange to send a few ThreadBuilders to existing threads that only init, while the rest spawn new dedicated threads that run the main pool loop, in whatever balance you see fit. The use_current PoC is basically doing that init on the current thread for index 0 and letting the rest spawn+run normally.

Are you also imagining that the "worker" in your design could be returned and usable elsewhere after its scope?

nical · 2023-06-16T13:15:13Z

Are you also imagining that the "worker" in your design could be returned and usable elsewhere after its scope?

Yes, but I haven't looked into how feasible/expensive it would be to install/uninstall the thread pool during each scope.

I feel that a crate could want to rely on the performance characteristics of having its thread(s) be able to steal jobs (in my experience that's very significant) but it's a difficult choice to make because the side effect of having only a single installed pool for any given thread is not innocuous.

I'm being a bit nit-picky. My main motivation is that in another part of Firefox (WebRender) we use rayon thread pools and schedule work from multiple threads. I'd like to have them benefit from a "use_current"-like thing and that does not require moving the workers between threads in between scopes or having several pool register themselves on the same thread.

cuviper · 2023-06-20T23:49:03Z

Interesting -- I think we could potentially have an API around "floating" WorkerThreads, pre-allocated in the pool and then assigned to TLS during some scope. We'd have to decide some semantic questions, like:

What if you're already in a different pool -- error? leave as-is? save-and-restore to apply the new temp pool?
Do these count in current_num_threads? Do they participate in broadcast calls?
- We've hedged in documentation that these may be dynamic, but I still think that would be hard to accomplish.
What happens to work left in its local queue when one of these threads is unassigned?

emilio · 2023-06-21T15:18:54Z

@nical / @cuviper: Given it's not much work to implement (it's basically already there, and useful), and that it helps us fix the scheduling issues we've found, wdyt about getting use_current that landed and filing a separate issue about the floating worker thread APIs? They seem separate enough and with potentially a lot more unsolved questions...

On that note, if that seems reasonable and the current branch is not landable as-is, what's left? I'd be happy to help moving it forward if possible. Thanks.

cuviper · 2023-06-23T20:10:49Z

If there's agreement that it's good enough for now, and maintainable in the long term (seems ok) -- the function should probably be called use_current_thread (:bike: :house:), and it needs more documentation, and tests as well.

Closes rayon-rs#1052

As per suggestion in rayon-rs#1052.

See discussion in rayon-rs#1052. Closes rayon-rs#1052.