lonely
is a no_std
+alloc
+no TLS
task executor for the Rust async/await
ecosystem, focused on minimizing CPU usage overhead, especially cross-core synchronization overhead.
lonely
is mainly inspired by DPDK
and aims to minimize CPU overhead at the cost of dynamicism in work scheduling. The core of lonely
is Exec
: a single-threaded task executor that runs zero atomic operations as long as there is work to do. A ring buffer implementation inspired by DPDK's is used to send & receive tasks across thread boundaries.
Popular task executors like tokio
and async-std
generally use a work-stealing approach to ensure threads have work to do, which in this context I will, for contrast's sake, call a "pull model": when a worker is out of tasks, it tries to grab work from either a global queue or the task queues of other threads. This is great when workloads are uneven across workers, in which case it can spread work across the whole pool of threads very efficiently.
lonely
implements a "push" model for sending work across thread boundaries: when a worker is out of tasks, it tries to pop a batch of tasks from a work queue dedicated to the worker, meaning it only shares the queue with threads that send work to it - the Task Producers. To distribute work across many workers, lonely
provides a "worker group" abstraction with pluggable load balancing algorithms: ExecGroup
. The default load balancing approach is round robin, where task producers rotate through the worker queues when sending tasks. Users of lonely
can implement custom work distribution schemes which can be very useful for certain applications. Examples can be hashing a key of the task to a specific worker ID in the case of connection or database handling, or limiting queue depth of worker task queues and thus introducing backpressure for task producers.
Comparing these models, it can be reasoned that the pull model is very good as a general approach when the task workload is fairly uneven, as a worker with too much to do will have tasks stolen from its backlog which results in better utilization over the whole worker pool. I recommend using a pull model executor for general workloads. lonely
aims to fill a very specific niche, and you should only use it when you know you need it to fulfill your performance goals.
The pull model implies that the local work queue must support work stealing and therefore a worker may incur cross-core synchronization and cache misses every time it reads its next task. Work-stealing algorithms also inherently scale fairly poorly (CPU usage overhead scales poorly with the number of workers in the pool), since a worker generally has to synchronize with every other worker to find potential work when stealing or grabbing from a global queue.
Use a push model executor when you have tasks that are relatively uniform in duration over a worker group, and you really need to squeeze performance out of your system. DPDK
- the main inspiration - is generally used to implement software like load balancers, packet routers and firewalls, where every cycle matters. If every cycle matters to you, and you can figure out how to distribute your tasks properly over a worker group, and you want to use async/await to cooperatively multiplex work onto a single core, maybe lonely
is for you. lonely
is very explicit about where work is executed and can be a good alternative when you want full control over this, perhaps to guarantee that you handle a network request on the same core that is performing disk IO for that request.
Before std::task::Waker
was stabilized, LocalWaker
was removed in this PR based on reasoning in a blog post by withoutboats
. LocalWaker
was !Send+!Sync
, and could be converted into a Waker
(which is Send+Sync
) with a single function call. I could say a lot of things about the linked blog post (and you can DM me about it if you'd like), but the conclusion is that the removal of LocalWaker
from the stabilized API makes it impossible to soundly implement lonely
's design in stable Rust. The blog post proposes solutions that either require thread-local storage (which lonely explicitly aims to avoid) or atomic instructions in the hot task polling loop. I have opted to do neither, and hence this crate is unsound and should not be used until LocalWaker
is re-introduced. Or you could, as a user, just try to avoid sending a Waker
across a thread boundary and hope nothing breaks.