-
Notifications
You must be signed in to change notification settings - Fork 330
feature: add label-based runner targeting for operations #3037
Conversation
5d88b16
to
558feba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall! Only had a few comments before trying this out 😄
8b125b0
to
bb312fa
Compare
remote, err := remoteOpPreferred(ctx, client, project, runnerCfgs, log) | ||
require.Nil(err) | ||
require.True(remote) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a good test around checking that TargetLabels
is a proper subset of the runner labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do the positive case with a few key,vals in each, and then a bad case for when it's not a subset. That should cover it I think!
if val, ok := r.Labels[k]; ok && v == val { | ||
n -= 1 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XX and I paired on this - We can simplify this check for looking at when TargetLabels
has to be a subset of Runners
with a bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commentary was done in Zoom.
} | ||
targeted := false | ||
for _, r := range runners.Runners { | ||
n := len(configRunner.TargetLabels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor feedback: The decreasing n
approach works, though it feels slightly less clear than doing something like the following, which is a more typical pattern we see for this kind of check.
matches := true
for k, v := range configRunner.TargetLabels {
if runner-doesnt-have-label {
matches = false
break
}
}
}, | ||
}, | ||
} | ||
targeted = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you break
after this?
Bigger picture: if I'm understanding this logic right, it'll always select the last matching runner. Is that correct? And I think ListRunners
doesn't guarantee any ordering (including random) so this could cause all label-targeted to always pick the exact same runner. Guaranteeing some sort of loaded fairness is a hard problem and we should probably solve this by just randomizing the order.
@@ -202,6 +202,7 @@ func (c *Project) setupLocalJobSystem(ctx context.Context) (isLocal bool, newCtx | |||
// The receiver must be careful to not block sending to mon as it will | |||
// block the job state processing loop. | |||
func (c *Project) doJobMonitored(ctx context.Context, job *pb.Job, ui terminal.UI, monCh chan pb.Job_State) (*pb.Job_Result, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the "scheduling" logic to leak into the client side. I don't think we should put this here. Historically, all scheduling for jobs happens on the server and I think we should keep it that way.
I saw your message in the PR of why you did this, I'm not worried about server load, I'm worried about this being logically incorrect in relatively simple cases since we can reasonably expect jobs to be queued and for runners to come in and out.
I also think that while this was an "easy" approach, it complicates the overall logic because we've now split scheduling to two locations rather than a centralized location. It's also now split between time and space. Before, scheduling happened at the same time in the same place in the code. Now it happens in two places in the code at two different times.
See the examples below.
Example bad case using T
to represent time:
- T=0: user submits a new job, this searches for matching labels, targets a runner A.
- T=1 The job gets queued because A is busy.
- T=2: A disappears (for whatever reason: crash, operator takes it down, etc.)
- T=3: Job remains queued because A is gone.
- T=4: Operator introduces runner B, C, etc. with matching labels.
- T=infinity: Job remains queued forever because it never reschedules.
Example bad case #2:
- T=0: user submits a new job, this finds runner A with matching labels.
- T=1: The job gets queued because A is busy.
- T=2: Runner A is moved and changes labels, it no longer matches. (either maliciously or purposefully)
- T=3: Job runs on runner A, which broke a constraint. At best, the job works anyways. At worst, this was a big security issue.
The solution to this is to introduce a new Ref
type that refs by label, not by name. See Ref.Runner
which already supports a couple ways. This should add labels as well.
After discussion, client-side targeting is not the right approach, especially because it wouldn't work with polling. Closing this PR and making 3 separate incremental PRs instead for a server-side implementation. |
Now users can also target runners by runner labels. In the Waypoint configs, they can specify:
or
It also adds a "labels" column in
runner list
.Some notes on implementation choices:
Profile takes precedence over labels, so if someone has both a runner profile and label defined, we ignore the labels. I also chose to ignore the project-level runner stanza completely if app-level runner configs are defined, instead of parsing and piecing together labels and profiles. We could do that... but the current product doesn't support targeting distinct runners for individual apps anyway, so we could cross that bridge when we get there? It also felt a little too "nice". heh.
I also opted to put the runner look-up and targeting logic on the client-side instead of server-side (differing from the RFC); i.e. looking up a list of all runners and finding the one with the right labels, as opposed to adding a "by label" lookup in the db. Because.... the infrastructure is already there and it's easy. But listing all the runners might bite us in the butt a few years down the line with database load... but also how many runners can a project have anyway?! I was also debating whether labels are something we want to store in the db, since they feel ephemeral (as in the English word)... but we are storing adoption state, which might have the same degree of permanence. So anyway I'm happy to take the server-side approach if we feel it's better.