Skip to content

Conversation

RobinMorisset
Copy link
Contributor

Profiling some services on many-core machines showed very high lock contention in the scheduler, specifically in the task-stealing mechanism.
This PR has a bunch of commits that both shrink the related critical sections, and reduce contention on these locks.

Testing:

  • make emulator_test ARGS="-suite process_SUITE"
  • tested in production on a many-core server, it showed very large improvements at low to medium load (on the order of 30% vs 40% cpu utilization), and tiny but measurable improvements at high load (about 1-2% cpu utilization). Lock-counting confirmed a fall in the contention of the runqueue locks.

When a scheduler runqueue is empty, that scheduler attempts to steal
work from another (hopefully non-empty) runqueue. Currently, the
sequence of events looks like this:
- Notice that there is no work to do
- Set a flag saying "don't steal work from me, I'm busy stealing"
- Remove a process from some other runqueue
- Put it back on your own runqueue
- goto the beginning of the scheduler loop
- Notice that there is now work to do
- Take the process from your runqueue (that you just added it to)
- Unset the flag
- Execute that process
The new sequence looks like this:
- Notice that there is no work to do
- Remove a process from some other runqueue
- Execute that process

There is no more need for a flag to protect our runqueue from stealing,
since the process we stole never spends time in it (and so is never at
risk of being stolen back from us which could lead to some kind of
livelock)
Currently, the scheduler uses work-stealing and steals a single task at
a time. This means that as soon as it is done with that one task, it
needs to steal another one, and another, etc...
All of this task stealing needs to take locks and can be a bottleneck.
As a first step to remove this bottleneck, this patch changes the
scheduler to steal half the tasks in a queue at once.
Instead of unlocking at the last moment and locking again right after,
we unlock at the beginning of try_steal_task, and only re-lock at the
end of try_steal_task.

The only access to rq in that time is rq->ix, which is only written to
once in erts_init_scheduling, so there should be no new race.
We now steal up to 100 tasks at once. But we were updating the victim
runqueue length by repeated decrements, and updating the beneficiary
runqueue length by repeated increments. This is not just extra work: it
also increased the length of the critical sections (since these updates
were done while holding the relevant locks).

This change batches the increments/decrements, doing a single
addition/substraction.
Currently we steal half the tasks by alternating between taking a
process and leaving it behind. With this commit we instead compute how
many tasks to steal, and then steal that many without skipping any
non-bound processes.
This should halve the number of pointer dereferenced during task
stealing, and thus also halve the number of cache misses, making that
(highly contended) critical section faster
Instead of using a stack to store the tasks being stolen, and turn them
back into a linked list at the end, we can store them as a linked-list
throughout. The main benefit is shrinking the critical section where we
re-enqueue them.
If we cannot immediately take the lock required to steal a task, we skip
that runqueue for the moment and retry it only once we have looked at
all other runqueues. This mitigates the performance hit
we see where dozens of threads try to steal from the same runqueue at
once, by spreading the load.
Currently we're checking it in every iteration in the fast loops (and
this shows in profiling), and not at all in the final loop (which is
the only one that can be slow enough to deserve it). This just fixes
this small point I had forgotten in the last commit.
Copy link
Contributor

github-actions bot commented Mar 17, 2025

CT Test Results

    38 files   1 001 suites   7h 31m 43s ⏱️
12 369 tests 11 661 ✅   702 💤 6 ❌
26 755 runs  24 760 ✅ 1 989 💤 6 ❌

For more details on these failures, see this check.

Results for commit 1b3c8c1.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

The code was structured as follows:
```
if (is_normal_sched) {
  ..
} else {
  ..
  qbit = ..
}
..
if (!is_normal_sched)
  clear_proc_dirty_queue_bit(p, rq, qbit);
```
and it resulted in an error deep in clear_proc_dirty_queue_bit (probably
because of some inlining): "error: ‘qbit’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]"

This issue was previously hidden because of the control flow leading to
this part of the code being a bit different, but that changed with the
commit to "directly execute stolen processes", revealing this limitation
of the compiler warning.

This commit fixes it, simply by pushing the initialization of qbit
outside of the "if (is_normal_sched)" condition.
@RobinMorisset
Copy link
Contributor Author

I could not reproduce the failure of one test in monitor_SUITE locally. Is there some way to access detailed logs of what is going wrong? And is this test known to be flaky?

@garazdawi
Copy link
Contributor

I could not reproduce the failure of one test in monitor_SUITE locally. And is this test known to be flaky?

We have a couple of failures of that testcase in our lab, so yes it can be a bit flaky. I re-triggered the tests so that we can see if it fails again.

@garazdawi garazdawi added team:VM Assigned to OTP team VM full-build-and-check Run a more thorough testing of this PR labels Mar 20, 2025
@garazdawi
Copy link
Contributor

I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator.

@garazdawi
Copy link
Contributor

You can have a look here to see which tests currently fail on a clean master build: https://erlang.org/github-pr/master/ct_logs/

@garazdawi
Copy link
Contributor

I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator.

apparently that did not work, it should work next time you push a fix to the PR though.

@RobinMorisset
Copy link
Contributor Author

The tests are all green this time, so I think this PR is ready for review.
Thanks for restarting these tests, and for confirming that the failing test was known to be flaky.

@garazdawi garazdawi added the testing currently being tested, tag is used by OTP internal CI label Mar 21, 2025
continue;
}
rpq = &vrq->procs.prio[prio_q];
// Steal at least one task, even if there is a single one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: please use /* .. */ for comments, not //.


while (proc) {
if (erts_try_change_runq_proc(proc, rq)) {
while (proc && max_processes_to_steal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier commit you stole every other process, instead of taking the first 100. I assume this was due to fairness considerations. Did you see a performance increase when changing strategy to take the first 100 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I remember that there were weird performance pathologies when trying to steal every other process from a very large queue (e.g. 5k or 10k processes), and those disappeared when I put that limit of maximum 100.
I unfortunately don't have the precise numbers at hand anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not realize that the first version stole everyother without any bound. I suppose you have not tried if stealing everyother with a bound also solved the issue? The reason I ask is because by stealing the first 100, you are basically down-prioritizing the processes last 50 processes as the processes left in the queue will be run before them.

Copy link
Contributor Author

@RobinMorisset RobinMorisset Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, there were three versions:

  • every other with no bounds (not committed, had issues)
  • every other, max 100 (the first commit)
  • the first half, max 100 (rest of the PR)

I've not observed any issues in practice with going from 2 to 3, and it shortens significantly the critical section (since we only have to walk half as many processes of the linked list). I think of it as not de-prioritizing the processes that were near the back of those stolen, since they would not have executed any faster without stealing; instead I see it as accelerating the processes that were not stolen, as their time to be executed comes up faster.

Michał suggested this change to me, the idea is that if there is a
single task in a queue, stealing it is not balancing the load, it is
just exchanging which scheduler is starved for work.
@RobinMorisset RobinMorisset force-pushed the upstream_sched_optims branch from 698a164 to d86d313 Compare March 26, 2025 16:43
It appears unused.

A comment next to it suggested that the sign of rq->waiting indicates
whether this condition variable is used. It appears that this sign is
always positive (against the comment), so I also removed the code for
the negative side, and the comment as well.
@tomas-abrahamsson
Copy link
Contributor

This looks very interesting. I took the 8ac8af9f82 Silence (spurious) compiler warning commit for a run on one of our servers, using the commit that this branch is based on, the 57cee85703 as baseline for a comparison. I captured some lock-count profilings, don't know if this is interesting to anyone, but anyways, here goes.

The baseline 28.0 rc was already about 11% faster than 27.3 for our workload, very nice, and this PR adds another 11% Very very nice! :) For the run_queue lock:

  • 'time [us]' is reduced by a factor 10.5
  • '#collisions' is reduced by a factor 4.5
  • '#tries' is reduced by a factor 3.2

Baseline commit 57cee85703, the first few lines:

(csim-gte-2363519-19560@somehost3003)1> lcnt:start(),
                                        lcnt:collect(),
                                        timer:sleep(3_000),
                                        lcnt:conflicts(),
                                        lcnt:stop().
Command is taking a long time, type Ctrl+G, then enter 'i' to interrupt
                       lock     id    #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----    ---   ------- ------------ --------------- ---------- -------------
                  run_queue     82 377823485     38788195         10.2662   91615112       32.0361
                  proc_main 247945 106692940      5139515          4.8171    4230013        1.4792
                   pix_lock   1024     28682           66          0.2301    2756360        0.9638
        export_staging_lock      1    777325           37          0.0048    1320580        0.4618
                  proc_msgq 247945 178526735       727267          0.4074     946128        0.3308

The 8ac8af9f82 Silence (spurious) compiler warning commit:

                       lock     id    #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----    ---   ------- ------------ --------------- ---------- -------------
                  run_queue     82 118712446      8677443          7.3096    8749167        4.9924
                  proc_main 248331  38812418      1889950          4.8694    2984591        1.7030
                   pix_lock   1024     28195           71          0.2518    2679418        1.5289
        export_staging_lock      1    447291           33          0.0074    1023174        0.5838
                  proc_msgq 248331  63679742       309627          0.4862     824230        0.4703

The workload is a mixture of communication: a lot of Erlang processes, around 250k, that communicate both via direct message passing, and also via many TCP and SCTP connections. The hardware in this case was a server with 2×Xeon Gold 6148 with 20 cores each, in total 80 logical CPUs (2 threads per core). We have some newer machines too with more cores, both dual and single socket servers, but I haven't checked those.

This is a tiny change that should improve cache locality a bit during
task stealing.
@rickard-green rickard-green self-assigned this Mar 26, 2025
@RobinMorisset
Copy link
Contributor Author

This looks very interesting. I took the 8ac8af9f82 Silence (spurious) compiler warning commit for a run on one of our servers, using the commit that this branch is based on, the 57cee85703 as baseline for a comparison. I captured some lock-count profilings, don't know if this is interesting to anyone, but anyways, here goes.

The baseline 28.0 rc was already about 11% faster than 27.3 for our workload, very nice, and this PR adds another 11% Very very nice! :) For the run_queue lock:

Thanks a lot for these detailed numbers, they are very encouraging! Do you remember roughly what the CPU utilisation was when you took these measurements? We sadly observed smaller wins when the CPU is above 90% (since there is much less task stealing going on, this is not very surprising).

@tomas-abrahamsson
Copy link
Contributor

Ok that's useful to know, and makes sense. The CPU load varied over the logical CPUs. For maybe 55-60 of the 80, it was around 80-90%, while for the remaining it was lower, much lower for a few cases. (I think we have some more bottlenecks to find in our code.)

Copy link
Contributor

@rickard-green rickard-green left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very nice!

@rickard-green rickard-green merged commit 80829a6 into erlang:master Apr 9, 2025
56 of 60 checks passed
keynslug pushed a commit to emqx/otp that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full-build-and-check Run a more thorough testing of this PR team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants