-
Notifications
You must be signed in to change notification settings - Fork 330
runner: static runners accept multiple jobs in parallel #3300
Conversation
This modifies `internal/runner` to support accepting multiple jobs in parallel. Not much work here since we always designed the runner struct from the beginning to support this so there are no data races. This modifies `internal/cli` so that runners in non-ODR run in parallel mode by default. ODR doesn't make sense to have any parallelism since they always run exactly one job. Non-ODR runners typically ONLY launch ODR tasks, which are highly IO-bound, so we default to a multiple above CPU count for concurrency. This is a necessary pre-requisite for pipelines since they'll likely perform blocking jobs on the static runners to "watch" tasks. Today, tasks are launched and stopped, but not watched so this is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wohoo!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one question about the semantics of canceling in the worker goroutine.
wg.Add(count) | ||
for i := 0; i < count; i++ { | ||
go func() { | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saying this out loud, should the exit of ANY of the goroutines cause them all to exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually, yes I think this needs to do slightly better, although its edge case-y. Here is the thought:
-
If the user cancels the context, then this is effectively a no-op and all of the goroutines are canceled anyways. No issue.
-
If a goroutine has an error, its likely the error will impact all, because there is considerable retry logic already in each
Accept
call -- including reconnection -- so if it actually errors it is likely unrecoverable. So we DO want to exit all the goroutines.
However, for #2, right now we're canceling the context which just causes a cascade effect to cancel ASAP. I think we can do better by just letting each existing job try to finish gracefully, and then say "don't accept any more jobs thereafter."
I'll work on this Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this is okay for now. Its a bit non-trivial to get this fix in and looking at the possible reasons for a return from AcceptMany
, i do think things are really broken if they exit so cancelling all is okay for now. We can improve this later. I've added a TODO to note it.
This modifies
internal/runner
to support accepting multiple jobs inparallel. Not much work here since we always designed the runner struct
from the beginning to support this so there are no data races.
This modifies
internal/cli
so that runners in non-ODR run in parallelmode by default. ODR doesn't make sense to have any parallelism since
they always run exactly one job. Non-ODR runners typically ONLY launch
ODR tasks, which are highly IO-bound, so we default to a multiple above
CPU count for concurrency.
This is a necessary pre-requisite for pipelines since they'll likely
perform blocking jobs on the static runners to "watch" tasks. Today,
tasks are launched and stopped, but not watched so this is not an issue.