Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Load Balancing - Backoff] Waking up children with loop tasks #76

Closed
mratsim opened this issue Dec 28, 2019 · 1 comment · Fixed by #77
Closed

[Load Balancing - Backoff] Waking up children with loop tasks #76

mratsim opened this issue Dec 28, 2019 · 1 comment · Fixed by #77
Labels
bug 🪲 Something isn't working

Comments

@mratsim
Copy link
Owner

mratsim commented Dec 28, 2019

From #68 (comment)

Another interesting log. When a child backs off, the parent will wake it and send it works when it finds it (i.e. lifeline see Saraswat, Lifeline-based Global Load Balancing, http://www.cs.columbia.edu/~martha/courses/4130/au12/p201-saraswat.pdf)

This is what happens in this log, the worker even tries to split tasks for its grandchildren but ultimately only send work to its direct child.

Worker  9: has 1 steal requests
Worker  9: found 6 steal requests addressed to its child 19 and grandchildren
Worker  9: 14 steps left (start: 0, current: 256, stop: 2000, stride: 128, 7 thieves)
Worker  9: Sending [1792, 2000) to worker 19
Worker  9: sending 1 tasks (task.fn 0x121e0a6c) to Worker 19
Worker  9: Continuing with [256, 1792)
Worker  9: waking up child 19
Matrix[1024, 128] (thread 9)
Matrix[1024, 256] (thread 9)
Matrix[1024, 384] (thread 9)
Matrix[1024, 512] (thread 9)
Matrix[1024, 640] (thread 9)
Matrix[1024, 768] (thread 9)
Matrix[1024, 896] (thread 9)
Matrix[1024, 1024] (thread 9)
Matrix[1024, 1152] (thread 9)
Matrix[1024, 1280] (thread 9)
Matrix[1024, 1408] (thread 9)
Matrix[1024, 1536] (thread 9)
Matrix[1024, 1664] (thread 9)
Worker 19: received a task with function address 0x121e0a6c (Channel 0x1c000b60)
Worker 19: running task.fn 0x121e0a6c
Matrix[1024, 1792] (thread 19)
Worker  9: sending own steal request to  0 (Channel 0x1333ce00)
Worker 15: sends state passively WAITING to its parent worker 7
Matrix[1024, 1920] (thread 19)
Worker 19: sending own steal request to  9 (Channel 0x1333d700)

Looking into the code we have in all state machines shareWork followed by handleThieves

weave/weave/victims.nim

Lines 249 to 305 in a1d862b

proc distributeWork(req: sink StealRequest): bool =
## Handle incoming steal request
## Returns true if we found work
## false otherwise
# Send independent task(s) if possible
if not myWorker().deque.isEmpty():
req.dispatchElseDecline()
return true
# TODO - the control flow is entangled here
# since we have a non-empty deque we will never take
# the branch that leads to termination
# and would logically return true
# Otherwise try to split the current one
if myTask().isSplittable():
if req.thiefID != myID():
myTask().splitAndSend(req)
return true
else:
req.forget()
return false
if req.state == Waiting:
# Only children can send us a failed state.
# Request should be saved by caller and
# worker tree updates should be done by caller as well
# TODO: disantangle control-flow and sink the request
postCondition: req.thiefID == myWorker().left or req.thiefID == myWorker().right
else:
decline(req)
return false
proc shareWork*() {.inline.} =
## Distribute work to all the idle children workers
## if we can
while not myWorker().workSharingRequests.isEmpty():
# Only dequeue if we find work
let req = myWorker().workSharingRequests.peek()
ascertain: req.thiefID == myWorker().left or req.thiefID == myWorker.right
if distributeWork(req): # Shouldn't this need a copy?
if req.thiefID == myWorker().left:
ascertain: myWorker().leftIsWaiting
myWorker().leftIsWaiting = false
else:
ascertain: myWorker().rightIsWaiting
myWorker().rightIsWaiting = false
Backoff:
wakeup(req.thiefID)
# Now we can dequeue as we found work
# We cannot access the steal request anymore or
# we would have a race with the child worker recycling it.
discard myWorker().workSharingRequests.dequeue()
else:
break

shareWork will wakeup a thread if distributeWork is successful. distributeWork will first look for plain tasks and then for the current splittable task. When evaluating the amount of work to send since the child is sleeping, it will take into account the whole subtree and so only sends a low amount of tasks to the child if plenty of steals are pending.

Normally, the remainder should be sent in handleThieves but in handleThieves the child is awake so all that worker subtree is not checked again and the tasks are never sent leading to load imbalance.

@mratsim mratsim added the bug 🪲 Something isn't working label Dec 28, 2019
@mratsim
Copy link
Owner Author

mratsim commented Dec 28, 2019

Another potential issue is that when a child is woken up with a task, it tries to wakeup its own children and then runs the task immediately.

In the case of for loop the next scheduling point is the implicit loadBalance after the for loop:

template parallelForWrapper(
idx: untyped{ident},
prologue, loopBody, epilogue,
remoteAccum, resultTy,
returnStmt: untyped): untyped =
## To be called within a loop task
## Gets the loop bounds and iterate the over them
## Also poll steal requests in-between iterations
##
## Loop prologue, epilogue,
## remoteAccum, resultTy and returnStmt
## are unused
block:
let this = myTask()
ascertain: this.isLoop
ascertain: this.start == this.cur
var idx {.inject.} = this.start
this.cur += this.stride
while idx < this.stop:
loopBody
idx += this.stride
this.cur += this.stride
loadBalance(Weave)

Should we add a an extra loadBalance on entry for just stolen tasks?

mratsim added a commit that referenced this issue Dec 29, 2019
* Address #76, send enough loop tasks to woken up children

* Fix idle subtree worker count

* Fix sending too much tasks to subtree

* in the worksharing case we shouldn't add one to the number of thieves estimation, it's already done in the proxy proc

* loadBalance on loop tasks entry to address #76 (comment)

* Fix sending the next iteration to children
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant