[scheduler] Various optimizations to task stealing #9594

RobinMorisset · 2025-03-17T11:16:53Z

Profiling some services on many-core machines showed very high lock contention in the scheduler, specifically in the task-stealing mechanism.
This PR has a bunch of commits that both shrink the related critical sections, and reduce contention on these locks.

Testing:

make emulator_test ARGS="-suite process_SUITE"
tested in production on a many-core server, it showed very large improvements at low to medium load (on the order of 30% vs 40% cpu utilization), and tiny but measurable improvements at high load (about 1-2% cpu utilization). Lock-counting confirmed a fall in the contention of the runqueue locks.

When a scheduler runqueue is empty, that scheduler attempts to steal work from another (hopefully non-empty) runqueue. Currently, the sequence of events looks like this: - Notice that there is no work to do - Set a flag saying "don't steal work from me, I'm busy stealing" - Remove a process from some other runqueue - Put it back on your own runqueue - goto the beginning of the scheduler loop - Notice that there is now work to do - Take the process from your runqueue (that you just added it to) - Unset the flag - Execute that process The new sequence looks like this: - Notice that there is no work to do - Remove a process from some other runqueue - Execute that process There is no more need for a flag to protect our runqueue from stealing, since the process we stole never spends time in it (and so is never at risk of being stolen back from us which could lead to some kind of livelock)

Currently, the scheduler uses work-stealing and steals a single task at a time. This means that as soon as it is done with that one task, it needs to steal another one, and another, etc... All of this task stealing needs to take locks and can be a bottleneck. As a first step to remove this bottleneck, this patch changes the scheduler to steal half the tasks in a queue at once.

Instead of unlocking at the last moment and locking again right after, we unlock at the beginning of try_steal_task, and only re-lock at the end of try_steal_task. The only access to rq in that time is rq->ix, which is only written to once in erts_init_scheduling, so there should be no new race.

It is always NULL

We now steal up to 100 tasks at once. But we were updating the victim runqueue length by repeated decrements, and updating the beneficiary runqueue length by repeated increments. This is not just extra work: it also increased the length of the critical sections (since these updates were done while holding the relevant locks). This change batches the increments/decrements, doing a single addition/substraction.

Currently we steal half the tasks by alternating between taking a process and leaving it behind. With this commit we instead compute how many tasks to steal, and then steal that many without skipping any non-bound processes. This should halve the number of pointer dereferenced during task stealing, and thus also halve the number of cache misses, making that (highly contended) critical section faster

Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.

If we cannot immediately take the lock required to steal a task, we skip that runqueue for the moment and retry it only once we have looked at all other runqueues. This mitigates the performance hit we see where dozens of threads try to steal from the same runqueue at once, by spreading the load.

Currently we're checking it in every iteration in the fast loops (and this shows in profiling), and not at all in the final loop (which is the only one that can be slow enough to deserve it). This just fixes this small point I had forgotten in the last commit.

github-actions · 2025-03-17T11:17:42Z

CT Test Results

38 files 1 001 suites 7h 31m 43s ⏱️
12 369 tests 11 661 ✅ 702 💤 6 ❌
26 755 runs 24 760 ✅ 1 989 💤 6 ❌

For more details on these failures, see this check.

Results for commit 1b3c8c1.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

The code was structured as follows: ``` if (is_normal_sched) { .. } else { .. qbit = .. } .. if (!is_normal_sched) clear_proc_dirty_queue_bit(p, rq, qbit); ``` and it resulted in an error deep in clear_proc_dirty_queue_bit (probably because of some inlining): "error: ‘qbit’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" This issue was previously hidden because of the control flow leading to this part of the code being a bit different, but that changed with the commit to "directly execute stolen processes", revealing this limitation of the compiler warning. This commit fixes it, simply by pushing the initialization of qbit outside of the "if (is_normal_sched)" condition.

RobinMorisset · 2025-03-20T09:40:32Z

I could not reproduce the failure of one test in monitor_SUITE locally. Is there some way to access detailed logs of what is going wrong? And is this test known to be flaky?

garazdawi · 2025-03-20T09:50:30Z

I could not reproduce the failure of one test in monitor_SUITE locally. And is this test known to be flaky?

We have a couple of failures of that testcase in our lab, so yes it can be a bit flaky. I re-triggered the tests so that we can see if it fails again.

garazdawi · 2025-03-20T10:05:15Z

I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator.

garazdawi · 2025-03-20T10:06:13Z

You can have a look here to see which tests currently fail on a clean master build: https://erlang.org/github-pr/master/ct_logs/

garazdawi · 2025-03-20T10:23:51Z

I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator.

apparently that did not work, it should work next time you push a fix to the PR though.

RobinMorisset · 2025-03-20T12:21:26Z

The tests are all green this time, so I think this PR is ready for review.
Thanks for restarting these tests, and for confirming that the failing test was known to be flaky.

garazdawi · 2025-03-21T08:17:03Z

erts/emulator/beam/erl_process.c

+            continue;
+        }
+        rpq = &vrq->procs.prio[prio_q];
+        // Steal at least one task, even if there is a single one


Nitpick: please use /* .. */ for comments, not //.

garazdawi · 2025-03-21T08:21:17Z

erts/emulator/beam/erl_process.c

-
-	while (proc) {
-	    if (erts_try_change_runq_proc(proc, rq)) {
+	while (proc && max_processes_to_steal) {


In an earlier commit you stole every other process, instead of taking the first 100. I assume this was due to fairness considerations. Did you see a performance increase when changing strategy to take the first 100 instead?

Yes, I remember that there were weird performance pathologies when trying to steal every other process from a very large queue (e.g. 5k or 10k processes), and those disappeared when I put that limit of maximum 100.
I unfortunately don't have the precise numbers at hand anymore.

Ah, I did not realize that the first version stole everyother without any bound. I suppose you have not tried if stealing everyother with a bound also solved the issue? The reason I ask is because by stealing the first 100, you are basically down-prioritizing the processes last 50 processes as the processes left in the queue will be run before them.

Ah right, there were three versions:

every other with no bounds (not committed, had issues)

every other, max 100 (the first commit)

the first half, max 100 (rest of the PR)

I've not observed any issues in practice with going from 2 to 3, and it shortens significantly the critical section (since we only have to walk half as many processes of the linked list). I think of it as not de-prioritizing the processes that were near the back of those stolen, since they would not have executed any faster without stealing; instead I see it as accelerating the processes that were not stolen, as their time to be executed comes up faster.

@garazdawi

To follow a request by @garazdawi at erlang#9594.

Michał suggested this change to me, the idea is that if there is a single task in a queue, stealing it is not balancing the load, it is just exchanging which scheduler is starved for work.

It appears unused. A comment next to it suggested that the sign of rq->waiting indicates whether this condition variable is used. It appears that this sign is always positive (against the comment), so I also removed the code for the negative side, and the comment as well.

tomas-abrahamsson · 2025-03-26T17:20:35Z

This looks very interesting. I took the 8ac8af9f82 Silence (spurious) compiler warning commit for a run on one of our servers, using the commit that this branch is based on, the 57cee85703 as baseline for a comparison. I captured some lock-count profilings, don't know if this is interesting to anyone, but anyways, here goes.

The baseline 28.0 rc was already about 11% faster than 27.3 for our workload, very nice, and this PR adds another 11% Very very nice! :) For the run_queue lock:

'time [us]' is reduced by a factor 10.5
'#collisions' is reduced by a factor 4.5
'#tries' is reduced by a factor 3.2

Baseline commit 57cee85703, the first few lines:

(csim-gte-2363519-19560@somehost3003)1> lcnt:start(),
                                        lcnt:collect(),
                                        timer:sleep(3_000),
                                        lcnt:conflicts(),
                                        lcnt:stop().
Command is taking a long time, type Ctrl+G, then enter 'i' to interrupt
                       lock     id    #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----    ---   ------- ------------ --------------- ---------- -------------
                  run_queue     82 377823485     38788195         10.2662   91615112       32.0361
                  proc_main 247945 106692940      5139515          4.8171    4230013        1.4792
                   pix_lock   1024     28682           66          0.2301    2756360        0.9638
        export_staging_lock      1    777325           37          0.0048    1320580        0.4618
                  proc_msgq 247945 178526735       727267          0.4074     946128        0.3308

The 8ac8af9f82 Silence (spurious) compiler warning commit:

                       lock     id    #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----    ---   ------- ------------ --------------- ---------- -------------
                  run_queue     82 118712446      8677443          7.3096    8749167        4.9924
                  proc_main 248331  38812418      1889950          4.8694    2984591        1.7030
                   pix_lock   1024     28195           71          0.2518    2679418        1.5289
        export_staging_lock      1    447291           33          0.0074    1023174        0.5838
                  proc_msgq 248331  63679742       309627          0.4862     824230        0.4703

The workload is a mixture of communication: a lot of Erlang processes, around 250k, that communicate both via direct message passing, and also via many TCP and SCTP connections. The hardware in this case was a server with 2×Xeon Gold 6148 with 20 cores each, in total 80 logical CPUs (2 threads per core). We have some newer machines too with more cores, both dual and single socket servers, but I haven't checked those.

This is a tiny change that should improve cache locality a bit during task stealing.

RobinMorisset · 2025-03-27T10:28:50Z

This looks very interesting. I took the 8ac8af9f82 Silence (spurious) compiler warning commit for a run on one of our servers, using the commit that this branch is based on, the 57cee85703 as baseline for a comparison. I captured some lock-count profilings, don't know if this is interesting to anyone, but anyways, here goes.

The baseline 28.0 rc was already about 11% faster than 27.3 for our workload, very nice, and this PR adds another 11% Very very nice! :) For the run_queue lock:

Thanks a lot for these detailed numbers, they are very encouraging! Do you remember roughly what the CPU utilisation was when you took these measurements? We sadly observed smaller wins when the CPU is above 90% (since there is much less task stealing going on, this is not very surprising).

tomas-abrahamsson · 2025-03-29T19:55:35Z

Ok that's useful to know, and makes sense. The CPU load varied over the logical CPUs. For maybe 55-60 of the 80, it was around 80-90%, while for the remaining it was lower, much lower for a few cases. (I think we have some more bottlenecks to find in our code.)

rickard-green

Thanks very nice!

@garazdawi

To follow a request by @garazdawi at erlang#9594.

RobinMorisset added 10 commits March 17, 2025 03:55

Remove the last argument of add2runq

26a49b4

It is always NULL

Steal tasks in blocks

ef8430c

Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.

Trivial refactoring of try_steal_task

427d7e6

garazdawi added team:VM Assigned to OTP team VM full-build-and-check Run a more thorough testing of this PR labels Mar 20, 2025

garazdawi added the testing currently being tested, tag is used by OTP internal CI label Mar 21, 2025

garazdawi reviewed Mar 21, 2025

View reviewed changes

RobinMorisset added 2 commits March 26, 2025 09:38

Replace line comments by block comments

357597e

To follow a request by @garazdawi at erlang#9594.

Don't steal task when there is a single one

d86d313

Michał suggested this change to me, the idea is that if there is a single task in a queue, stealing it is not balancing the load, it is just exchanging which scheduler is starved for work.

RobinMorisset force-pushed the upstream_sched_optims branch from 698a164 to d86d313 Compare March 26, 2025 16:43

Group the fields of ErtsRunQueue that are often accessed by other cores

1b3c8c1

This is a tiny change that should improve cache locality a bit during task stealing.

rickard-green self-assigned this Mar 26, 2025

rickard-green approved these changes Apr 9, 2025

View reviewed changes

rickard-green merged commit 80829a6 into erlang:master Apr 9, 2025
56 of 60 checks passed

RobinMorisset mentioned this pull request Jun 20, 2025

[scheduler] Another task-stealing optimization #9984

Merged

keynslug pushed a commit to emqx/otp that referenced this pull request Aug 29, 2025

Replace line comments by block comments

72bdab8

To follow a request by @garazdawi at erlang#9594.

keynslug mentioned this pull request Aug 29, 2025

backport: various optimizations to task stealing emqx/otp#75

Draft

[scheduler] Various optimizations to task stealing #9594

[scheduler] Various optimizations to task stealing #9594

Uh oh!

Conversation

RobinMorisset commented Mar 17, 2025

Uh oh!

github-actions bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CT Test Results

Artifacts

Uh oh!

RobinMorisset commented Mar 20, 2025

Uh oh!

garazdawi commented Mar 20, 2025

Uh oh!

garazdawi commented Mar 20, 2025

Uh oh!

garazdawi commented Mar 20, 2025

Uh oh!

garazdawi commented Mar 20, 2025

Uh oh!

RobinMorisset commented Mar 20, 2025

Uh oh!

garazdawi Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

garazdawi Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

RobinMorisset Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

garazdawi Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

RobinMorisset Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomas-abrahamsson commented Mar 26, 2025

Uh oh!

RobinMorisset commented Mar 27, 2025

Uh oh!

tomas-abrahamsson commented Mar 29, 2025

Uh oh!

rickard-green left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Mar 17, 2025 •

edited

Loading

RobinMorisset Mar 27, 2025 •

edited

Loading