Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

feature: add label-based runner targeting for operations #3037

Closed
wants to merge 12 commits into from

Conversation

xiaolin-ninja
Copy link
Contributor

@xiaolin-ninja xiaolin-ninja commented Feb 22, 2022

Now users can also target runners by runner labels. In the Waypoint configs, they can specify:

runner {
  target_labels = {
    "env" = {
        "default" = "test"
        "prod" = "prod"
    }[workspace.name]
  }
}

or

app {
  runner {
    target_labels = {
      "env" = {
        "default" = "test"
        "prod" = "prod"
      }[workspace.name]
    }
  }
}

It also adds a "labels" column in runner list.

❯ waypoint-dev runner list
      ID      |   STATE    |  KIND  |  LABELS   | LAST REGISTERED  
--------------+------------+--------+-----------+------------------
  test        | preadopted | remote | env:test  | 1 second ago     

Some notes on implementation choices:

Profile takes precedence over labels, so if someone has both a runner profile and label defined, we ignore the labels. I also chose to ignore the project-level runner stanza completely if app-level runner configs are defined, instead of parsing and piecing together labels and profiles. We could do that... but the current product doesn't support targeting distinct runners for individual apps anyway, so we could cross that bridge when we get there? It also felt a little too "nice". heh.

I also opted to put the runner look-up and targeting logic on the client-side instead of server-side (differing from the RFC); i.e. looking up a list of all runners and finding the one with the right labels, as opposed to adding a "by label" lookup in the db. Because.... the infrastructure is already there and it's easy. But listing all the runners might bite us in the butt a few years down the line with database load... but also how many runners can a project have anyway?! I was also debating whether labels are something we want to store in the db, since they feel ephemeral (as in the English word)... but we are storing adoption state, which might have the same degree of permanence. So anyway I'm happy to take the server-side approach if we feel it's better.

Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall! Only had a few comments before trying this out 😄

internal/client/job.go Outdated Show resolved Hide resolved
internal/server/singleprocess/service_job.go Outdated Show resolved Hide resolved
remote, err := remoteOpPreferred(ctx, client, project, runnerCfgs, log)
require.Nil(err)
require.True(remote)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a good test around checking that TargetLabels is a proper subset of the runner labels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do the positive case with a few key,vals in each, and then a bad case for when it's not a subset. That should cover it I think!

if val, ok := r.Labels[k]; ok && v == val {
n -= 1
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XX and I paired on this - We can simplify this check for looking at when TargetLabels has to be a subset of Runners with a bool

Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commentary was done in Zoom.

}
targeted := false
for _, r := range runners.Runners {
n := len(configRunner.TargetLabels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor feedback: The decreasing n approach works, though it feels slightly less clear than doing something like the following, which is a more typical pattern we see for this kind of check.

matches := true
for k, v := range configRunner.TargetLabels {
  if runner-doesnt-have-label {
    matches = false
    break
  }
}

},
},
}
targeted = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you break after this?

Bigger picture: if I'm understanding this logic right, it'll always select the last matching runner. Is that correct? And I think ListRunners doesn't guarantee any ordering (including random) so this could cause all label-targeted to always pick the exact same runner. Guaranteeing some sort of loaded fairness is a hard problem and we should probably solve this by just randomizing the order.

@@ -202,6 +202,7 @@ func (c *Project) setupLocalJobSystem(ctx context.Context) (isLocal bool, newCtx
// The receiver must be careful to not block sending to mon as it will
// block the job state processing loop.
func (c *Project) doJobMonitored(ctx context.Context, job *pb.Job, ui terminal.UI, monCh chan pb.Job_State) (*pb.Job_Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes the "scheduling" logic to leak into the client side. I don't think we should put this here. Historically, all scheduling for jobs happens on the server and I think we should keep it that way.

I saw your message in the PR of why you did this, I'm not worried about server load, I'm worried about this being logically incorrect in relatively simple cases since we can reasonably expect jobs to be queued and for runners to come in and out.

I also think that while this was an "easy" approach, it complicates the overall logic because we've now split scheduling to two locations rather than a centralized location. It's also now split between time and space. Before, scheduling happened at the same time in the same place in the code. Now it happens in two places in the code at two different times.

See the examples below.

Example bad case using T to represent time:

  • T=0: user submits a new job, this searches for matching labels, targets a runner A.
  • T=1 The job gets queued because A is busy.
  • T=2: A disappears (for whatever reason: crash, operator takes it down, etc.)
  • T=3: Job remains queued because A is gone.
  • T=4: Operator introduces runner B, C, etc. with matching labels.
  • T=infinity: Job remains queued forever because it never reschedules.

Example bad case #2:

  • T=0: user submits a new job, this finds runner A with matching labels.
  • T=1: The job gets queued because A is busy.
  • T=2: Runner A is moved and changes labels, it no longer matches. (either maliciously or purposefully)
  • T=3: Job runs on runner A, which broke a constraint. At best, the job works anyways. At worst, this was a big security issue.

The solution to this is to introduce a new Ref type that refs by label, not by name. See Ref.Runner which already supports a couple ways. This should add labels as well.

@xiaolin-ninja
Copy link
Contributor Author

After discussion, client-side targeting is not the right approach, especially because it wouldn't work with polling. Closing this PR and making 3 separate incremental PRs instead for a server-side implementation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants