Skip to content

Conversation

keynslug
Copy link

When a scheduler runqueue is empty, that scheduler attempts to steal
work from another (hopefully non-empty) runqueue. Currently, the
sequence of events looks like this:
- Notice that there is no work to do
- Set a flag saying "don't steal work from me, I'm busy stealing"
- Remove a process from some other runqueue
- Put it back on your own runqueue
- goto the beginning of the scheduler loop
- Notice that there is now work to do
- Take the process from your runqueue (that you just added it to)
- Unset the flag
- Execute that process
The new sequence looks like this:
- Notice that there is no work to do
- Remove a process from some other runqueue
- Execute that process

There is no more need for a flag to protect our runqueue from stealing,
since the process we stole never spends time in it (and so is never at
risk of being stolen back from us which could lead to some kind of
livelock)
Currently, the scheduler uses work-stealing and steals a single task at
a time. This means that as soon as it is done with that one task, it
needs to steal another one, and another, etc...
All of this task stealing needs to take locks and can be a bottleneck.
As a first step to remove this bottleneck, this patch changes the
scheduler to steal half the tasks in a queue at once.
Instead of unlocking at the last moment and locking again right after,
we unlock at the beginning of try_steal_task, and only re-lock at the
end of try_steal_task.

The only access to rq in that time is rq->ix, which is only written to
once in erts_init_scheduling, so there should be no new race.
We now steal up to 100 tasks at once. But we were updating the victim
runqueue length by repeated decrements, and updating the beneficiary
runqueue length by repeated increments. This is not just extra work: it
also increased the length of the critical sections (since these updates
were done while holding the relevant locks).

This change batches the increments/decrements, doing a single
addition/substraction.
Currently we steal half the tasks by alternating between taking a
process and leaving it behind. With this commit we instead compute how
many tasks to steal, and then steal that many without skipping any
non-bound processes.
This should halve the number of pointer dereferenced during task
stealing, and thus also halve the number of cache misses, making that
(highly contended) critical section faster
Instead of using a stack to store the tasks being stolen, and turn them
back into a linked list at the end, we can store them as a linked-list
throughout. The main benefit is shrinking the critical section where we
re-enqueue them.
If we cannot immediately take the lock required to steal a task, we skip
that runqueue for the moment and retry it only once we have looked at
all other runqueues. This mitigates the performance hit
we see where dozens of threads try to steal from the same runqueue at
once, by spreading the load.
Currently we're checking it in every iteration in the fast loops (and
this shows in profiling), and not at all in the final loop (which is
the only one that can be slow enough to deserve it). This just fixes
this small point I had forgotten in the last commit.
The code was structured as follows:
```
if (is_normal_sched) {
  ..
} else {
  ..
  qbit = ..
}
..
if (!is_normal_sched)
  clear_proc_dirty_queue_bit(p, rq, qbit);
```
and it resulted in an error deep in clear_proc_dirty_queue_bit (probably
because of some inlining): "error: ‘qbit’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]"

This issue was previously hidden because of the control flow leading to
this part of the code being a bit different, but that changed with the
commit to "directly execute stolen processes", revealing this limitation
of the compiler warning.

This commit fixes it, simply by pushing the initialization of qbit
outside of the "if (is_normal_sched)" condition.
Michał suggested this change to me, the idea is that if there is a
single task in a queue, stealing it is not balancing the load, it is
just exchanging which scheduler is starved for work.
It appears unused.

A comment next to it suggested that the sign of rq->waiting indicates
whether this condition variable is used. It appears that this sign is
always positive (against the comment), so I also removed the code for
the negative side, and the comment as well.
This is a tiny change that should improve cache locality a bit during
task stealing.
@zmstone
Copy link
Member

zmstone commented Sep 18, 2025

maybe focus on building emqx on otp 28 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants