-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Dataflow graph parallelism #92
Conversation
…tifier. You will lose memory AND speed.
… parent index bug
A higher-level analysis of the current Promise design and the design space. Storage at node-levelFirst, let's get how promises and their associated tasks are stored out of the way Currently we have a poor's man minimum working example: weave/weave/datatypes/context_thread_local.nim Lines 18 to 61 in 7950c30
This doesn't support multiple dependencies per delayed tasks or a mix between loop and normal dependencies. Instead we should at least have
On the communication patternsCurrent designAt a higher level what is implemented has the following characteristics:
==> In distributed speak we have a pubsub system On the consumer side:
==> This is similar to a pull from an intermediate message broker on the consumer side. Alternative designAlternatively, we could do something like this:
We have gossipsub pattern The interesting parts are:
The potentially dead-on-arrival parts are:
ComparisonThe main theoretical overhead fight becomes
Unadressed partIn both cases, we have to prevent a thread from discarding a promise and then requiring after the fact. |
Let's address the unaddressed part Example 1: a simple producer task and a consumer task pseudocodeproc foo(p: Promise, A, B: Matrix) =
A[0, 0] = 123
p.fulfill()
proc bar(A: Matrix, i, j: int) =
assert A[0, 0] = 123 # This would fail if bar is executed before foo
A[i, j] = A[j, i]
proc main(A, B: Matrix) =
var p = newPromise()
spawn foo(p, A, B)
# foo() may be finished before the next line
spawnDelayed p, bar(A, 1, 2)
# main can exit immediately
# either p is a heap reference
# or it is a unique ID captured by value
init(Weave)
main()
exit(Weave) In this example, the main thread schedules a task, then a delayed task, and exits immediately. In the pubsub caseFor the Promise to be valid it needs to be heap-allocated for a pubsub implementation as channels are on a as-needed basis. As soon as it's passed via spawn, the refcount is incremented so even if the main thread exit we are OK. In the gossipsub caseA promise is just a hash, and can be captured by value so we don't need memory management here, the main thread can exit the proc. However, what can happen here with :
Can we do better? proc main(A, B: Matrix) =
var p = newPromise()
# foo() cannot be finished before the next line
spawnDelayed p, bar(A, 1, 2)
spawn foo(p, A, B)
# main can exit immediately
# either p is a heap reference
# or it is a unique ID captured by value This however still seems quite brittle: users won't get an error message "please declare consumer before producers", it leaks abstraction constraints, and it may not handle nested producer-consumer relationships (or at least it's not obvious to me) |
Some ideas on reliable request-reply from ZeroMQ: http://zguide.zeromq.org/page%3aall#reliable-request-reply |
Another idea, let's colocate the Promise and dependent task on the same worker. Case 1: delayed task scheduled before the promise is fulfilledA promise is an ephemeral MPSC channel. On creation it has no owner. When a worker encounters a delayed task dependent on the promise it sends it to the promise channel. When a worker fulfills a promise (a producer), it becomes the owner of the promise channel it takes the tasks that are in the channel and schedule them. Case 2: promise fulfilled before the delayed task is scheduledWhen a producer fulfills a promise, it needs to set a flag "fulfilled" or "closed" on the channel, new consumers can check it and they can schedule the tasks themselves. Case 3: Memory reclamationAssuming we have interleaving of Consumer 1 (C1), producer (P) and Consumer 2 (C2). Similar to the PubSub protocol, the Promise is refcounted so when only one reference remains whether it's a producer or consumer it can delete the Channel. Case 4: A task depends on 2 promisesWhen a task depends on 2 promises we need a tiebreaker on which will receive the task, for example it will be the channel with the lowest byte representation. The task is sent to that first promise (P1) 2 cases:
AnalysisLatencyCompared to the other 2 schemes, the delayed tasks always enter the normal scheduling cycle ASAP. i.e. as soon as the producer finishes or as soon as the task is created. Overhead
Ergonomics
Slight bonus
UnsureLoops, does that mean 1 channel per loop iteration? "memory is cheap" they said. |
closed by #94 |
Heavy work-in-progress, ideas welcome.
This is research work on implementing data flow parallelism (also called stream parallelism, pipeline parallelism, data-driven task parallelism).
Needs, research, other runtime approaches are detailed starting from the following comment: #31 (comment).
The practical direct goal is to be able to call Weave Matrix Multiplication from a parallel loop as in many cases we might have a batch of small matrices, say 64 images of size 224x224 (the base image size in ImageNet dataset) and it would be more efficient to find parallelism at the batch level instead of intra-level. This is currently impossible in Weave (or OpenMP for that matter).
The reason why is that the current implementation requires a barrier after the outer parallel for that represents data dependencies:
weave/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim
Lines 166 to 182 in 5d90172
But that barrier can only be called from the root task/main thread preventing nesting in another extra parallel region.
Furthermore, precise barriers are fundamentally incompatible with work-stealing because there is no way to know which threads may execute a code path so some may never hit the barrier and we will be deadlocked.
Instead we can properly tell the runtime the actual dependencies: what data is needed to continue computation.
Note that this is pretty much uncharted territory at the moment and I'm already worried about the overhead and it will require several refinements.