forked from erlang/otp
-
Notifications
You must be signed in to change notification settings - Fork 7
backport: various optimizations to task stealing #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
keynslug
wants to merge
15
commits into
emqx-OTP-27.3.4.2
Choose a base branch
from
backport/task-steal-opt-otp-28
base: emqx-OTP-27.3.4.2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When a scheduler runqueue is empty, that scheduler attempts to steal work from another (hopefully non-empty) runqueue. Currently, the sequence of events looks like this: - Notice that there is no work to do - Set a flag saying "don't steal work from me, I'm busy stealing" - Remove a process from some other runqueue - Put it back on your own runqueue - goto the beginning of the scheduler loop - Notice that there is now work to do - Take the process from your runqueue (that you just added it to) - Unset the flag - Execute that process The new sequence looks like this: - Notice that there is no work to do - Remove a process from some other runqueue - Execute that process There is no more need for a flag to protect our runqueue from stealing, since the process we stole never spends time in it (and so is never at risk of being stolen back from us which could lead to some kind of livelock)
Currently, the scheduler uses work-stealing and steals a single task at a time. This means that as soon as it is done with that one task, it needs to steal another one, and another, etc... All of this task stealing needs to take locks and can be a bottleneck. As a first step to remove this bottleneck, this patch changes the scheduler to steal half the tasks in a queue at once.
Instead of unlocking at the last moment and locking again right after, we unlock at the beginning of try_steal_task, and only re-lock at the end of try_steal_task. The only access to rq in that time is rq->ix, which is only written to once in erts_init_scheduling, so there should be no new race.
It is always NULL
We now steal up to 100 tasks at once. But we were updating the victim runqueue length by repeated decrements, and updating the beneficiary runqueue length by repeated increments. This is not just extra work: it also increased the length of the critical sections (since these updates were done while holding the relevant locks). This change batches the increments/decrements, doing a single addition/substraction.
Currently we steal half the tasks by alternating between taking a process and leaving it behind. With this commit we instead compute how many tasks to steal, and then steal that many without skipping any non-bound processes. This should halve the number of pointer dereferenced during task stealing, and thus also halve the number of cache misses, making that (highly contended) critical section faster
Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.
If we cannot immediately take the lock required to steal a task, we skip that runqueue for the moment and retry it only once we have looked at all other runqueues. This mitigates the performance hit we see where dozens of threads try to steal from the same runqueue at once, by spreading the load.
Currently we're checking it in every iteration in the fast loops (and this shows in profiling), and not at all in the final loop (which is the only one that can be slow enough to deserve it). This just fixes this small point I had forgotten in the last commit.
The code was structured as follows: ``` if (is_normal_sched) { .. } else { .. qbit = .. } .. if (!is_normal_sched) clear_proc_dirty_queue_bit(p, rq, qbit); ``` and it resulted in an error deep in clear_proc_dirty_queue_bit (probably because of some inlining): "error: ‘qbit’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" This issue was previously hidden because of the control flow leading to this part of the code being a bit different, but that changed with the commit to "directly execute stolen processes", revealing this limitation of the compiler warning. This commit fixes it, simply by pushing the initialization of qbit outside of the "if (is_normal_sched)" condition.
To follow a request by @garazdawi at erlang#9594.
Michał suggested this change to me, the idea is that if there is a single task in a queue, stealing it is not balancing the load, it is just exchanging which scheduler is starved for work.
It appears unused. A comment next to it suggested that the sign of rq->waiting indicates whether this condition variable is used. It appears that this sign is always positive (against the comment), so I also removed the code for the negative side, and the comment as well.
This is a tiny change that should improve cache locality a bit during task stealing.
maybe focus on building emqx on otp 28 instead. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See erlang#9594.