You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Global MPMC queue (which risks creating contention again)
Modify the Chase-Lev deque to allow push (front or back) or change the data structure
Have workers have another queue which is MPSC to receive tasks. (which risks latency are those tasks are unstealable)
Scheduling, Load balancing and task priority
Users might want to have access to task priorities.
In particular for Proof-of-Stake blockchain, validation and block proposal messages are higher priorities than anything else.
Chase-Lev deque, which are LIFO can be replaced by multiqueues which are FIFO relaxed priority queues :
Note: supporting priorities has heavy impact on the runtime:
LIFO scheduling has better throughput while FIFO has more fairness and reduces latency.
A computational job often spawns many tasks, LIFO maximizes that the data stays hot in cache while it is processing.
Also the queue owner competes only for the last task of the queue with the potential thieves.
With a queue, everyone competes at the tail of the queue.
A priority queue is probably incompatible with batch steal "stealHalf" which significantly reduce overheads.
tp.syncScope:
tp.spawnfn0()
tp.spawnfn1()
let a =...
and ensure that all tasks are finished before exiting the scope, without waiting for tasks out of this scope (with syncAll). Also any thread can call syncScope unlike syncAll. It would also avoid having to wrap proc coming from dependencies so that they return a dummy bool so that we can await them.
Implementation is a straightforward Atomic[int] and the same code as forceFuture, except that we wait until we have no descendant tasks.
Assume a 4 core CPU like a Raspberry Pi, an user might want to spawn blocking operations like reading from stdin on a threadpool because that's a common pattern. We can also consider a long-running computation blocking from the point of view of user interface.
It is possible to have all cores blocked and no progress could be made even though there is work to do.
Solutions:
Indicate that the user can create dedicated blocking threadpools and a main one for non-IO tasks.
Have a dedicated proc spawnBlocking for IO that can block and potentially a numBlockingThread parameter in Taskpool.new(...) for dedicated blocking thread. This requires users to know their workload
Have some detection logic similar to the Go scheduler to detect long-running or blocked threads and create a new thread
Currently, spawning will cause the following allocations:
1 in Task for storing the function call arguments. This is inherent to std/tasks.
1 for the TaskNode
1 for the Flowvar / inner Channel (needs an UncheckedArray)
Solution:
For the upstream problem, we can patch upstream to have a buffer. (std/tasks: "small functions" optimization nim-lang/RFCs#443)
The inner fields take 24 bytes, a 40 byte buffer would cover many use-cases without allocation,
for example, a function with 2 openarrays requires at minimum 32 bytes, which leaves 8 bytes for
a pointer to a result or for a callback.
The TaskNode can be made intrusive to the Flowvar
Overhead with the event notifier / putting thread to sleep waking them up
Currently when the task deques grow, the old ones aren't collected. We might want to consider a memory reclamation scheme or a "garbage collector" just for the runtime, for example Epoch-Based Reclamation.
Alternatively we decide to use a priority queue (a list-based data structure) there is no memory to reclaim anymore.
Allow awaiting threads to sleep
When awaiting, threads don't sleep as there is no way to wakeup a specific thread when a task it was waiting for is finished:
if tp.eventNotifier.getParked() == tp.numThreads -1:
# 2.2.1 all threads besides the current are parked
debugTermination:
log("Worker %2d: syncAll 2.2.1 - termination, all other threads sleeping\n", ctx.id)
foreignThreadsParked =true
else:
# 2.2.2 We don't park as there is no notif for task completion
cpuRelax()
Busy loops waste energy.
About not processing all tasks in our queue when awaiting.
Note: Astute readers will remark that when awaiting, a worker will skip its own unrelated tasks (aka non-children of the awaited task) and instead directly steal from other workers.
If we're awaiting for a future and we have no work to do, that future is dependent on computation in other worker queues. (A dependency has to be spawned after our own task and so if it wasn't enqueued at the tail of our deque, it's in another worker queue)
This decreases latency for the awaited task (at the expense of throughput since a worker with work and bandwidth does not work), and once the awaited Flowvar is resolved the unblocked thread can enqueue work earlier, avoiding starvation for other thread.
Also the worker might otherwise process a task that for sure doesn't help make progress on the awaited task, and takes an indefinite amount of time to resolve.
The technique is called Leapfrogging see https://cseweb.ucsd.edu/~calder/papers/PPoPP-93.pdf
Note: Regarding " Scheduling, Load balancing and task priority", leapfrogging is "unfair" since user awaited-tasks have priority.
Backoff strategies
Alternatively, if parking threads is actually inadequate:
As sleep/wake is very expensive,
Also as condition variables or futexes don't allow precise control of which thread to wake when multiple are awaiting, we would need multiple CVs or futexes, 1 per thread. Assuming we want to have the number of threads vary when some become blocked (Dealing with blocked threads paragraph) we need a thread-safe way to keep track of all active thread CVs or futexes and insert and remove on thread creation or destruction.
Here is an overview of items and features considered for future improvement, with sketches of solutions:
User-facing features
Receiving tasks from any thread
The task queues are SPMC queues hence other threads can't enqueue new jobs.
See #6
Solutions:
Scheduling, Load balancing and task priority
Users might want to have access to task priorities.
In particular for Proof-of-Stake blockchain, validation and block proposal messages are higher priorities than anything else.
Chase-Lev deque, which are LIFO can be replaced by multiqueues which are FIFO relaxed priority queues :
Concurrent Priority Queues
on a Shared-Memory Platform which gives an overview from a third party author which has no incentive to push an implementation forward.
Note: supporting priorities has heavy impact on the runtime:
A computational job often spawns many tasks, LIFO maximizes that the data stays hot in cache while it is processing.
Also the queue owner competes only for the last task of the queue with the potential thieves.
With a queue, everyone competes at the tail of the queue.
see also: mratsim/weave#22
Structured Parallelism
See spec nim-lang/RFCs#347 (comment)
This would allow the following:
and ensure that all tasks are finished before exiting the scope, without waiting for tasks out of this scope (with syncAll). Also any thread can call syncScope unlike syncAll. It would also avoid having to wrap proc coming from dependencies so that they return a dummy bool so that we can await them.
Implementation is a straightforward
Atomic[int]
and the same code as forceFuture, except that we wait until we have no descendant tasks.Event system / Dataflow parallelism
This allows tasks to be scheduled only when an event is triggered (aka task graphs with precise in/out dependencies).
Implementation is standalone and does not touch the internals of taskpools.
https://github.com/mratsim/weave/blob/71dc2d7/weave/cross_thread_com/flow_events.nim
Design and competing designs have been heavily discussed here: mratsim/weave#92 (comment)
Dealing with blocked threads
Assume a 4 core CPU like a Raspberry Pi, an user might want to spawn blocking operations like reading from stdin on a threadpool because that's a common pattern. We can also consider a long-running computation blocking from the point of view of user interface.
It is possible to have all cores blocked and no progress could be made even though there is work to do.
Solutions:
spawnBlocking
for IO that can block and potentially anumBlockingThread
parameter inTaskpool.new(...)
for dedicated blocking thread. This requires users to know their workloadSee Go presentation: https://assets.ctfassets.net/oxjq45e8ilak/48lwQdnyDJr2O64KUsUB5V/5d8343da0119045c4b26eb65a83e786f/100545_516729073_DMITRII_VIUKOV_Go_scheduler_Implementing_language_with_lightweight_concurrency.pdf
Implementation features
Reducing allocations
Solution:
The inner fields take 24 bytes, a 40 byte buffer would cover many use-cases without allocation,
for example, a function with 2 openarrays requires at minimum 32 bytes, which leaves 8 bytes for
a pointer to a result or for a callback.
Overhead with the event notifier / putting thread to sleep waking them up
See #5
Memory reclamation
Currently when the task deques grow, the old ones aren't collected. We might want to consider a memory reclamation scheme or a "garbage collector" just for the runtime, for example Epoch-Based Reclamation.
Alternatively we decide to use a priority queue (a list-based data structure) there is no memory to reclaim anymore.
Allow awaiting threads to sleep
When awaiting, threads don't sleep as there is no way to wakeup a specific thread when a task it was waiting for is finished:
nim-taskpools/taskpools/taskpools.nim
Lines 337 to 345 in 12752b9
Busy loops waste energy.
About not processing all tasks in our queue when awaiting.
Note: Astute readers will remark that when awaiting, a worker will skip its own unrelated tasks (aka non-children of the awaited task) and instead directly steal from other workers.
nim-taskpools/taskpools/taskpools.nim
Lines 274 to 277 in b31b891
Why?
The technique is called Leapfrogging see https://cseweb.ucsd.edu/~calder/papers/PPoPP-93.pdf
Note: Regarding " Scheduling, Load balancing and task priority", leapfrogging is "unfair" since user awaited-tasks have priority.
Backoff strategies
Alternatively, if parking threads is actually inadequate:
We can consider backoff solutions see https://github.com/mratsim/weave/blob/71dc2d7/weave/cross_thread_com/event_notifiers_and_backoff.md, this is an active area of research in particular for WiFi and Bluetooth low-energy as when a station is backed off it can't receive messages, so there are advanced techniques so that a station is up when a message arrives to reduce latency and retries.
However, does Windows has an equivalent to usleep with microsecond resolution?
The text was updated successfully, but these errors were encountered: