-
Notifications
You must be signed in to change notification settings - Fork 3k
[scheduler] Various optimizations to task stealing #9594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scheduler] Various optimizations to task stealing #9594
Conversation
When a scheduler runqueue is empty, that scheduler attempts to steal work from another (hopefully non-empty) runqueue. Currently, the sequence of events looks like this: - Notice that there is no work to do - Set a flag saying "don't steal work from me, I'm busy stealing" - Remove a process from some other runqueue - Put it back on your own runqueue - goto the beginning of the scheduler loop - Notice that there is now work to do - Take the process from your runqueue (that you just added it to) - Unset the flag - Execute that process The new sequence looks like this: - Notice that there is no work to do - Remove a process from some other runqueue - Execute that process There is no more need for a flag to protect our runqueue from stealing, since the process we stole never spends time in it (and so is never at risk of being stolen back from us which could lead to some kind of livelock)
Currently, the scheduler uses work-stealing and steals a single task at a time. This means that as soon as it is done with that one task, it needs to steal another one, and another, etc... All of this task stealing needs to take locks and can be a bottleneck. As a first step to remove this bottleneck, this patch changes the scheduler to steal half the tasks in a queue at once.
Instead of unlocking at the last moment and locking again right after, we unlock at the beginning of try_steal_task, and only re-lock at the end of try_steal_task. The only access to rq in that time is rq->ix, which is only written to once in erts_init_scheduling, so there should be no new race.
It is always NULL
We now steal up to 100 tasks at once. But we were updating the victim runqueue length by repeated decrements, and updating the beneficiary runqueue length by repeated increments. This is not just extra work: it also increased the length of the critical sections (since these updates were done while holding the relevant locks). This change batches the increments/decrements, doing a single addition/substraction.
Currently we steal half the tasks by alternating between taking a process and leaving it behind. With this commit we instead compute how many tasks to steal, and then steal that many without skipping any non-bound processes. This should halve the number of pointer dereferenced during task stealing, and thus also halve the number of cache misses, making that (highly contended) critical section faster
Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.
If we cannot immediately take the lock required to steal a task, we skip that runqueue for the moment and retry it only once we have looked at all other runqueues. This mitigates the performance hit we see where dozens of threads try to steal from the same runqueue at once, by spreading the load.
Currently we're checking it in every iteration in the fast loops (and this shows in profiling), and not at all in the final loop (which is the only one that can be slow enough to deserve it). This just fixes this small point I had forgotten in the last commit.
CT Test Results 38 files 1 001 suites 7h 31m 43s ⏱️ For more details on these failures, see this check. Results for commit 1b3c8c1. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
The code was structured as follows: ``` if (is_normal_sched) { .. } else { .. qbit = .. } .. if (!is_normal_sched) clear_proc_dirty_queue_bit(p, rq, qbit); ``` and it resulted in an error deep in clear_proc_dirty_queue_bit (probably because of some inlining): "error: ‘qbit’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" This issue was previously hidden because of the control flow leading to this part of the code being a bit different, but that changed with the commit to "directly execute stolen processes", revealing this limitation of the compiler warning. This commit fixes it, simply by pushing the initialization of qbit outside of the "if (is_normal_sched)" condition.
I could not reproduce the failure of one test in monitor_SUITE locally. Is there some way to access detailed logs of what is going wrong? And is this test known to be flaky? |
We have a couple of failures of that testcase in our lab, so yes it can be a bit flaky. I re-triggered the tests so that we can see if it fails again. |
I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator. |
You can have a look here to see which tests currently fail on a clean master build: https://erlang.org/github-pr/master/ct_logs/ |
apparently that did not work, it should work next time you push a fix to the PR though. |
The tests are all green this time, so I think this PR is ready for review. |
erts/emulator/beam/erl_process.c
Outdated
continue; | ||
} | ||
rpq = &vrq->procs.prio[prio_q]; | ||
// Steal at least one task, even if there is a single one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: please use /* .. */
for comments, not //
.
|
||
while (proc) { | ||
if (erts_try_change_runq_proc(proc, rq)) { | ||
while (proc && max_processes_to_steal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an earlier commit you stole every other process, instead of taking the first 100. I assume this was due to fairness considerations. Did you see a performance increase when changing strategy to take the first 100 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I remember that there were weird performance pathologies when trying to steal every other process from a very large queue (e.g. 5k or 10k processes), and those disappeared when I put that limit of maximum 100.
I unfortunately don't have the precise numbers at hand anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did not realize that the first version stole everyother without any bound. I suppose you have not tried if stealing everyother with a bound also solved the issue? The reason I ask is because by stealing the first 100, you are basically down-prioritizing the processes last 50 processes as the processes left in the queue will be run before them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, there were three versions:
- every other with no bounds (not committed, had issues)
- every other, max 100 (the first commit)
- the first half, max 100 (rest of the PR)
I've not observed any issues in practice with going from 2 to 3, and it shortens significantly the critical section (since we only have to walk half as many processes of the linked list). I think of it as not de-prioritizing the processes that were near the back of those stolen, since they would not have executed any faster without stealing; instead I see it as accelerating the processes that were not stolen, as their time to be executed comes up faster.
To follow a request by @garazdawi at erlang#9594.
Michał suggested this change to me, the idea is that if there is a single task in a queue, stealing it is not balancing the load, it is just exchanging which scheduler is starved for work.
698a164
to
d86d313
Compare
It appears unused. A comment next to it suggested that the sign of rq->waiting indicates whether this condition variable is used. It appears that this sign is always positive (against the comment), so I also removed the code for the negative side, and the comment as well.
This looks very interesting. I took the 8ac8af9f82 Silence (spurious) compiler warning commit for a run on one of our servers, using the commit that this branch is based on, the 57cee85703 as baseline for a comparison. I captured some lock-count profilings, don't know if this is interesting to anyone, but anyways, here goes. The baseline 28.0 rc was already about 11% faster than 27.3 for our workload, very nice, and this PR adds another 11% Very very nice! :) For the
Baseline commit 57cee85703, the first few lines:
The 8ac8af9f82 Silence (spurious) compiler warning commit:
The workload is a mixture of communication: a lot of Erlang processes, around 250k, that communicate both via direct message passing, and also via many TCP and SCTP connections. The hardware in this case was a server with 2×Xeon Gold 6148 with 20 cores each, in total 80 logical CPUs (2 threads per core). We have some newer machines too with more cores, both dual and single socket servers, but I haven't checked those. |
This is a tiny change that should improve cache locality a bit during task stealing.
Thanks a lot for these detailed numbers, they are very encouraging! Do you remember roughly what the CPU utilisation was when you took these measurements? We sadly observed smaller wins when the CPU is above 90% (since there is much less task stealing going on, this is not very surprising). |
Ok that's useful to know, and makes sense. The CPU load varied over the logical CPUs. For maybe 55-60 of the 80, it was around 80-90%, while for the remaining it was lower, much lower for a few cases. (I think we have some more bottlenecks to find in our code.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very nice!
To follow a request by @garazdawi at erlang#9594.
Profiling some services on many-core machines showed very high lock contention in the scheduler, specifically in the task-stealing mechanism.
This PR has a bunch of commits that both shrink the related critical sections, and reduce contention on these locks.
Testing: