backport: various optimizations to task stealing #75

keynslug · 2025-08-29T08:46:40Z

When a scheduler runqueue is empty, that scheduler attempts to steal work from another (hopefully non-empty) runqueue. Currently, the sequence of events looks like this: - Notice that there is no work to do - Set a flag saying "don't steal work from me, I'm busy stealing" - Remove a process from some other runqueue - Put it back on your own runqueue - goto the beginning of the scheduler loop - Notice that there is now work to do - Take the process from your runqueue (that you just added it to) - Unset the flag - Execute that process The new sequence looks like this: - Notice that there is no work to do - Remove a process from some other runqueue - Execute that process There is no more need for a flag to protect our runqueue from stealing, since the process we stole never spends time in it (and so is never at risk of being stolen back from us which could lead to some kind of livelock)

Currently, the scheduler uses work-stealing and steals a single task at a time. This means that as soon as it is done with that one task, it needs to steal another one, and another, etc... All of this task stealing needs to take locks and can be a bottleneck. As a first step to remove this bottleneck, this patch changes the scheduler to steal half the tasks in a queue at once.

Instead of unlocking at the last moment and locking again right after, we unlock at the beginning of try_steal_task, and only re-lock at the end of try_steal_task. The only access to rq in that time is rq->ix, which is only written to once in erts_init_scheduling, so there should be no new race.

It is always NULL

We now steal up to 100 tasks at once. But we were updating the victim runqueue length by repeated decrements, and updating the beneficiary runqueue length by repeated increments. This is not just extra work: it also increased the length of the critical sections (since these updates were done while holding the relevant locks). This change batches the increments/decrements, doing a single addition/substraction.

Currently we steal half the tasks by alternating between taking a process and leaving it behind. With this commit we instead compute how many tasks to steal, and then steal that many without skipping any non-bound processes. This should halve the number of pointer dereferenced during task stealing, and thus also halve the number of cache misses, making that (highly contended) critical section faster

Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.

If we cannot immediately take the lock required to steal a task, we skip that runqueue for the moment and retry it only once we have looked at all other runqueues. This mitigates the performance hit we see where dozens of threads try to steal from the same runqueue at once, by spreading the load.

Currently we're checking it in every iteration in the fast loops (and this shows in profiling), and not at all in the final loop (which is the only one that can be slow enough to deserve it). This just fixes this small point I had forgotten in the last commit.

The code was structured as follows: ``` if (is_normal_sched) { .. } else { .. qbit = .. } .. if (!is_normal_sched) clear_proc_dirty_queue_bit(p, rq, qbit); ``` and it resulted in an error deep in clear_proc_dirty_queue_bit (probably because of some inlining): "error: ‘qbit’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" This issue was previously hidden because of the control flow leading to this part of the code being a bit different, but that changed with the commit to "directly execute stolen processes", revealing this limitation of the compiler warning. This commit fixes it, simply by pushing the initialization of qbit outside of the "if (is_normal_sched)" condition.

@garazdawi

To follow a request by @garazdawi at erlang#9594.

Michał suggested this change to me, the idea is that if there is a single task in a queue, stealing it is not balancing the load, it is just exchanging which scheduler is starved for work.

It appears unused. A comment next to it suggested that the sign of rq->waiting indicates whether this condition variable is used. It appears that this sign is always positive (against the comment), so I also removed the code for the negative side, and the comment as well.

This is a tiny change that should improve cache locality a bit during task stealing.

zmstone · 2025-09-18T08:46:32Z

maybe focus on building emqx on otp 28 instead.

RobinMorisset added 15 commits August 29, 2025 10:44

Remove the last argument of add2runq

fce7641

It is always NULL

Steal tasks in blocks

fb4caeb

Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.

Trivial refactoring of try_steal_task

92910ea

Replace line comments by block comments

72bdab8

To follow a request by @garazdawi at erlang#9594.

Don't steal task when there is a single one

1503904

Michał suggested this change to me, the idea is that if there is a single task in a queue, stealing it is not balancing the load, it is just exchanging which scheduler is starved for work.

Group the fields of ErtsRunQueue that are often accessed by other cores

8dc7614

This is a tiny change that should improve cache locality a bit during task stealing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

backport: various optimizations to task stealing #75

backport: various optimizations to task stealing #75

Uh oh!

keynslug commented Aug 29, 2025

Uh oh!

zmstone commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

backport: various optimizations to task stealing #75

Are you sure you want to change the base?

backport: various optimizations to task stealing #75

Uh oh!

Conversation

keynslug commented Aug 29, 2025

Uh oh!

zmstone commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants