Thread wakeup times have a significant impact on parallel task execution

## What problem does this solve or what need does it fill?
Parallel tasks are currently bottlenecked by thread wake-up times. Even if we can schedule tasks from the parallel system scheduler or from parallel iteration, it can take quite a while to wake all of the threads in in the TaskPools up. This isn't particularly an issue for Io or AsyncCompute Tasks, but it has a measurable impact on Compute oriented tasks.

For example, see this run of `propagate_transforms` in Tracy. There's a 90.43us lag between when the first task and last task starts. This significantly impacts the system's total runtime.
![image](https://user-images.githubusercontent.com/3137680/207288968-cf0b8bcc-f567-46f0-b5ba-872630b7fd40.png)

There are likely multiple factors at work here:

 - `futures_lite::block_on` uses `parking` internally, which yields the thread back to the OS when there is no work to be done in an attempt to conserve CPU power. Waking threads up from this has notable latency.
 - `async_executor` conservatively wakes up only one thread at a time to avoid contention on the global and local queues. This forces threads to wake up in a cascading fashion. This seems to be the norm in both async_executor and tokio.
 - Our current use of `Executor::try_tick` in a hot loop seems to be increasing contention on the global task queue.

## What solution would you like?
This will need some investigation. We can switch off of `futures_lite::block_on` with our own implementation that minimizes yielding, keeping cores hotter, but effectively spin-waiting for new tasks. This is likely a non-solution for battery bound platforms like mobile, but might net some improvements here, at the cost of higher reported idle CPU usage.

We could alternatively try to upstream or fork a change to `async_executor` with a different thread wake-up strategy. Our compute workloads tend to batch spawn tasks all at once, so it may be worth the contention directly schedule tasks in batches inside the executor and wake up an appropriate number of threads simultaneously instead of in a cascade.

Avoiding additional contention on the global executor by removing `Executor::try_tick` is done in #6503, which may improve the lag.

## What alternative(s) have you considered?
Eating the perf cost. This really only affects systems with large core counts. However, this is increasingly becoming the norm as seen on [Steam's hardware survey](https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam), where 40+% of desktop users now have access to a 8+ core machine.

Another alternative is just to parallelize everything we can. Bevy's internal systems aren't very parallel right now with many systems easily bottlenecking further execution. By keeping threads busier, the executor will not have the opportunity to yield to the OS as often, which will naturally eliminate the overhead of yielding and waking up threads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Thread wakeup times have a significant impact on parallel task execution #6941

What problem does this solve or what need does it fill?

What solution would you like?

What alternative(s) have you considered?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Thread wakeup times have a significant impact on parallel task execution #6941

Description

What problem does this solve or what need does it fill?

What solution would you like?

What alternative(s) have you considered?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions