Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to prioritize GitHub Action runners #1665

Open
gajus opened this issue Feb 7, 2022 · 40 comments
Open

Add ability to prioritize GitHub Action runners #1665

gajus opened this issue Feb 7, 2022 · 40 comments
Labels
enhancement New feature or request future Feature work that we haven't prioritized

Comments

@gajus
Copy link

gajus commented Feb 7, 2022

Describe the enhancement

Self-hosted GitHub actions should have an attribute (weight) that allows to prioritize them, i.e. If there are multiple idle runners with matching labels, then the weight attribute would determine which runner to use first, e.g. prioritized in ascending order.

Additional information

For context, the reason this is needed is because the current implementation randomly picks an available runner. However, imagine that you are scaling up and down runners depending on how long they have been idle. Using random allocation mechanism, there is no way to determine (efficiently) how long the runner was not in use. As a result, we have a large portion of VMs runnings that are not in use most of the time.

Prioritization would allow more efficient resource packing.

@gajus gajus added the enhancement New feature or request label Feb 7, 2022
@ruvceskistefan
Copy link
Contributor

Hi @gajus,
Thanks for the reported issue and idea. I'm adding a future label, so that we can work on developing this in near future.

@ruvceskistefan ruvceskistefan added the future Feature work that we haven't prioritized label Feb 8, 2022
@gajus
Copy link
Author

gajus commented Feb 23, 2022

@ruvceskistefan Do you have any update on this? This severely impacts our ability to scale GitHub Runners, costing literally tends of thousands monthly.

@ethomson
Copy link
Contributor

Hi @gajus - can you tell me more about this? It sounds like you're rolling your own autoscaling solution? Does the ephemeral runner option not give you enough control here?

@gajus
Copy link
Author

gajus commented Feb 24, 2022

The original issue text already includes description of how we orchestrate runners.

Whether we use ephemeral option or not, the problem is that there is no way to prioritize which runners will be picked up first. This means that if we have 100 idle runners and we have 20 jobs, then we have no way to say that these 20 should idle runners should be used first. Without weight of such for sorting, this means that a large number of idle runners is just sitting waiting for jobs because they keep getting random jobs (which they would otherwise not get if there was an order assigned).

@ethomson
Copy link
Contributor

ethomson commented Feb 24, 2022

What I don't understand yet is whether these 20 runners in your example are different in some meaningful way that you really want to route to these 20? In other words, does it matter which 20 runners are getting jobs, or do you just want to scale down to any 20 runners? You mention "more efficient resource packing" but I don't feel like I have the full picture yet.

@al2114
Copy link

al2114 commented Feb 26, 2022

Chiming in that we have a pool of machines that we use as runners, though some machines run significantly faster than others (reducing build time significantly). Ideally we want to be able to prioritize allocating jobs to the faster machines to reduce build times, but want to keep the slower machines active so they could pickup jobs while the faster ones are busy. Prioritization would be very beneficial here.

@ethomson
Copy link
Contributor

Thanks @al2114 - are you running static runners? I can understand the need for some more advanced routing there, but I'm trying to better understand the need for weights when auto scaling, or when using some sort of control plane.

@gajus
Copy link
Author

gajus commented Feb 28, 2022

What I don't understand yet is whether these 20 runners in your example are different in some meaningful way that you really want to route to these 20?

No. All machines are identical.

Here is a simple task. You have 100 machines. You have 20 jobs that start every minute and complete in a minute.

What happens in the current setup?

Every minute a random 20 machines will get picked from the pool.

Why is that bad?

Machines that have not been used for 10 minutes are automatically removed from the pool. If resources are randomly assigned, then machines that otherwise would not need to have been used are being used. Therefore, you will always have 100 machines running even though 20 would suffice.

What's desired?

A way to prioritize which machines should get picked first. This way, the oldest machines (as an example) will always get used first and the rest will soon timeout and disconnect from the pool.

@naikrovek
Copy link

naikrovek commented May 5, 2022

your autoscaling solution is probably a bit too naive.

your assumption that the jobs are assigned is also wrong (I'm pretty sure). jobs are pulled by the runners, and not pushed to them. each runner periodically queries to see if any work is available. the first runner to pull that info after it is available gets it, and does the work. in order for a priority system to be implemented, all runners you host would need to talk to each other to know who should poll next.

the GitHub Actions Runner system, in its entirety, is making the assumption that the runner virtual machines are used a single time, then destroyed and replaced with a fresh VM when needed. solutions which do not keep that assumption in mind are going to have a difficult time adjusting to how GHA works.

I wrote an orchestrator in Go which uses workflow_job payloads exclusively to know when to destroy runner VMs and to know when to bring more online. everything is very smooth since I accepted that single-use runners (GitHub call them ephemeral runners) were the correct approach, and stopped fighting it.

@jharris-tc
Copy link

I would like something kind of relevant to this.

I would like to be able to give the runners the ability to opt out of polling, based on some health check. In my instance, I have a a few VMs, each with a few runners running, and I want a runner to be able to recognize that there is, lets say 95% memory usage, and not pick up a job. This will allow a runner on a less congested VM to pick it up. Right now, sometimes a congested VM will still pick up the job, and then oom.

This would not require runners communicating with each other, but basically just polling some endpoint, either through like curl or some file/socket, and if the number is 1 then pick up the job, if 0 then don't

Would be super helpful in distributing jobs

@idyll
Copy link

idyll commented Aug 10, 2022

As another example for this, we have M1 and Intel self hosted Mac runners.

The M1s are so much faster that we'd love a way to give them priority over the intel runners and only send jobs to the intel runners if all the M1 runners are busy.

The weight solution would work but really anything that allows us to set precedence would be great.

@gerhard
Copy link

gerhard commented Aug 25, 2022

I am also interested in this. This is my generic question which lead me here: https://github.com/orgs/community/discussions/30693

This issue seems like a great place to add more context and make it specific so that you can better determine if this is a legit +1

Our Test Universe workflow is highly variable, but it usually takes ~20 minutes to complete. When we run it locally, it completes within 5 minutes:

image

When we parallelise, it completes in less than 2 minutes. That is a significant 10x speed-up. We cannot parallelise it in GitHub Actions because we hit the 7GB memory limit (context). We would like to use self-hosted GitHub Runners in order to achieve this 10x speed-up.

If our own self-hosted GitHub runners are not available (busy, offline, etc.), free GitHub Runners should pick up those jobs. Currently, if we were to use jobs.<job_id>.runs-on: self-hosted, we would be excluding the free GitHub Runners.

I have two questions:

  1. Would it make sense to add jobs.<job_id>.runs-on-preferred: [self-hosted, ubuntu-latest] ?
  2. Can you think of a different way of achieving the above without this feature?

Thank you!

@gerhard
Copy link

gerhard commented Oct 7, 2022

The new larger GitHub Actions hosted runners makes my previous comment a non-issue. This new feature made a huge positive different for us already: dagger/dagger#3277 (comment). Great job everyone! 🤘

@SPONGE-JL
Copy link

Interested this feature, too! Maybe there are some need to change on actions-runner-controller either, like scaledown hook or relocate runner pods. But! I wonder the change from this feature. 🙂

Thanks for every contributors. 🤘

@Martiix
Copy link

Martiix commented Jan 16, 2023

I'm also interested in this. We are using self-hosted runners to provide different testing hardware environments and therefore have some runners with few labels and some runners with many labels. Our issue is that it happens quite often that jobs with fewer labels get picked up by runners with many labels and therefore the jobs that needs more labels have to be queued.
If a weight or priority could be put on the runners with the fewest labels, then we can send the jobs there, and only use the ones with many labels if the rest are full, therefore more likely leaving room for when jobs that require many labels appear.
We don't want to exclude the jobs with only a couple of labels from some runners, but want them to not randomly take up capacity when a more suitable place is available.

@nedrebo
Copy link

nedrebo commented Jan 16, 2023

I also support this use case. With a big heterogenous runner pool (100+ runners with 2-64 core CPUs) with lots of label variations and attached HW it is very hard to utilize all HW efficiently both in high-load and low-load scenarios.

Ideally, the load balancer should use the history of jobs and the history of runners to do dynamic scheduling.

Doing hard-coded weighting, as proposed here, is going to be hard to do correctly at scale, and maintain it when jobs change characteristics or when new types of runners are added. For this solution to work, I think at least it must be exposed in an API so that it is possible to re-weight all runners programmatically at a schedule.

Still, it would be a more impactful feature if GitHub could schedule for us automatically.

@vallabbharath
Copy link

vallabbharath commented Jan 19, 2023

Adding this enhancement would make a lot of sense and help the users a lot. For example, we can prioritize the runners with lesser latency first (we can label it based on location of those VMs) and the other runners (VMs placed elsewhere) as second priority.

@gabriel-samfira
Copy link
Contributor

gabriel-samfira commented Feb 16, 2023

Adding my 2 cents here.

It would be great if we could remove the default labels (self-hosted, linux, x64 for example) attached to runners. In a lot of cases, workflows only include the self-hosted label to target runners. So it doesn't matter what custom labels we set and what pools of self hosted runners we manage, a random runner may pick up the job due to non unique label sets defined in the workflow.

For example, if you have a set of runners with custom labels like:

  • gpu
  • medium-memory

And another with:

  • fpga
  • high-memory

And the workflow author defines just self-hosted (because it's easy and the documentation encourages authors to use it to target self hosted runners), you may get a runner with a gpu when in fact you required one with fpga. Yes, org/enterprise members could be encouraged to target runners using only user defined labels, but when you have hundreds of teams, this can become quite tedious.

Runner groups are only available to enterprise users. Removing the default labels would be useful even for single repos with more collaborators and free tier orgs that a lot of open source projects use.

Allowing us to remove the default labels will give us the ability to define unique label sets and thus schedule jobs more efficiently. It also allows us to better react to queued workflow webhook and pick the right runner type to spin up if we have automation tools that allow us to define multiple pools with different characteristics (like detailed above). This way we don't need to spin up idle runners. We could just spin up one when a queued event is detected. But we can't do that efficiently if we have multiple runner types we define and the workflow just targets self-hosted. We potentially end up with the wrong runner type.

Hoping this makes sense 😄 .

@ChristopherHX
Copy link
Contributor

It would be great if we could remove the default labels (self-hosted, linux, x64 for example) attached to runners.

An unsupported way to remove the default labels is to delete them from the configuration function.

agent.Labels.Add(new AgentLabel("self-hosted", LabelType.System));
agent.Labels.Add(new AgentLabel(VarUtil.OS, LabelType.System));
agent.Labels.Add(new AgentLabel(VarUtil.OSArchitecture, LabelType.System));

After you have deleted these 3 lines, compile the actions/runner and use it to configure all your runners.

Last time this worked just fine as long you have provided your own labels.

You don't have to worry about auto updates as long your runner is already configured, your label change has been stored online and won't change.

@gabriel-samfira
Copy link
Contributor

gabriel-samfira commented Feb 16, 2023

An unsupported way to remove the default labels is to delete them from the configuration function.

Yup. I wanted to create a PR that adds a --no-default-labels knob 😄 . We wrote an auto scaler for self hosted runners which is used by a few folks, and it would help if we could tell them they are able to use the officially supported runners instead of a fork that may stop working at some point.

@naikrovek
Copy link

if a workflow author is not specific with their requirements via the labels, that is on them, in my mind.

I would set up a webhook which sends all workflow runs to a tool which reads them and files a new issue on every repo which runs actions that only specify self-hosted.

our guidelines are to always specify the OS and CPU architecture in labels at a bare minimum.

@gabriel-samfira
Copy link
Contributor

if a workflow author is not specific with their requirements via the labels, that is on them, in my mind.

Yes, it is on them, but in the meantime, they may end up needlessly consuming instances that are more expensive/scarce (like GPU enabled instances). It also makes it difficult to spin up the right instance types, on the right hierarchy level (repo vs org vs enterprise), on-demand.

In any case, I opened a PR here: #2443

It makes the default labels optional (by default they are added), while still ensuring at least one label is added to the runner.

It feels like better UX to add only the labels you want. If that includes the default labels, great. If not, also great.

@xucian
Copy link

xucian commented Feb 20, 2023

any news on the prioritization? seems like a core feature that's missing.
sometimes there are many jobs waiting for machines, and many machines waiting for jobs. this shouldn't happen in any well-designed system

@vallabbharath
Copy link

yes, the PR #2443 is good one. But prioritization is definitely much needed feature.

A workflow author should be able to say "Use runners with these labels if they are available, if they are not available, use runners with another label". Currently that's not possible. As 'xucian' mentioned in previous comment, either we have to wait for the high-resource runners without utilizing the priority 2 (low-resource ) runners if we set this strictly to match one of them. Or if we match the labels to match both of them, we have to live with the compromise of not utilizing the high-resource runner 50% of the time, even though it might be available.

@qoomon
Copy link

qoomon commented Apr 17, 2023

I think it would be even more useful to define required labels when configure/register a new one. e.g. ./config.sh --labels "ubuntu,large:required" .... With this config runners should act like following:

  • A Job with runs-on: ['self-hosted', 'ubuntu', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'ubuntu'] would not be executed on that runner
  • A Job with runs-on: ['self-hosted'] would not be executed on that runner

@xucian
Copy link

xucian commented Apr 18, 2023

I think it would be even more useful to define required labels when configure/register a new one. e.g. ./config.sh --labels "ubuntu,large:required" .... With this config runners should act like following:

  • A Job with runs-on: ['self-hosted', 'ubuntu', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'ubuntu'] would not be executed on that runner
  • A Job with runs-on: ['self-hosted'] would not be executed on that runner

good idea, finer-grained (optional) control is always welcome. before that, we can just have the positions of the labels to implicitly denote priority. I think 99% of the limitations will be solved this way

gerhard added a commit to gerhard/changelog.com that referenced this issue Jul 31, 2023
As you already know, we use Dagger for CI/CD. By default, this runs on
Fly.io (via Docker). In some cases, this can fail.

The last failure was when DNS resolution stopped working after the
Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io
machines), e.g.
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1

As a temporary fix, we had to delete some secrets and re-run the job.
The job ran on GHA free runners & failed for genuine reasons
6 mins later:
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391

While running on the free GHA runners can be 3x-8x slower, it's a good
fall-back. You heard us mention on multiple occasions: "always have
redundancies in place". Since we already have multiple CI runtimes in
place (Fly.io. K8s), let's make our GHA workflow resilient by:
- Run on our preferred back-end by default (Dagger on Fly.io)
  - ✅ If it succeeds, we are done
  - ❌ If it fails, fallback to running on the free GitHub runners
- In forks, use free GitHub runners by default (we cannot share `secrets`)

While this means that a workflow which fails for genuine reasons will
fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems
like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger
on K8s. This uses a self-hosted GitHub runner on K8s which is already
integrated with Dagger. For now, we are using it just to see how the CI
part compares to our primary setup (Dagger on Fly.io). We are not using
Dagger on K8s to deploy the app. Let's see how this setup behaves over a
few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.

Things that could be improved in follow-ups:
- the workflow should succeed if the `dagger-on-github-fallback` job succeeds
  - currently it fails if `dagger-on-fly-docker` fails
- add `dagger-on-k8s` job as secondary fallback
  - GitHub Actions is currently missing actions/runner#1665
- maybe leverage a Dagger cache that works in forks too 😉
- Run Dagger Engine as a Fly Machine (no more Docker)
  - thechangelog#471

Signed-off-by: Gerhard Lazu <[email protected]>
gerhard added a commit to thechangelog/changelog.com that referenced this issue Jul 31, 2023
As you already know, we use Dagger for CI/CD. By default, this runs on
Fly.io (via Docker). In some cases, this can fail.

The last failure was when DNS resolution stopped working after the
Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io
machines), e.g.
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1

As a temporary fix, we had to delete some secrets and re-run the job.
The job ran on GHA free runners & failed for genuine reasons
6 mins later:
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391

While running on the free GHA runners can be 3x-8x slower, it's a good
fall-back. You heard us mention on multiple occasions: "always have
redundancies in place". Since we already have multiple CI runtimes in
place (Fly.io. K8s), let's make our GHA workflow resilient by:
- Run on our preferred back-end by default (Dagger on Fly.io)
  - ✅ If it succeeds, we are done
  - ❌ If it fails, fallback to running on the free GitHub runners
- In forks, use free GitHub runners by default (we cannot share `secrets`)

While this means that a workflow which fails for genuine reasons will
fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems
like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger
on K8s. This uses a self-hosted GitHub runner on K8s which is already
integrated with Dagger. For now, we are using it just to see how the CI
part compares to our primary setup (Dagger on Fly.io). We are not using
Dagger on K8s to deploy the app. Let's see how this setup behaves over a
few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.

Things that could be improved in follow-ups:
- the workflow should succeed if the `dagger-on-github-fallback` job succeeds
  - currently it fails if `dagger-on-fly-docker` fails
- add `dagger-on-k8s` job as secondary fallback
  - GitHub Actions is currently missing actions/runner#1665
- maybe leverage a Dagger cache that works in forks too 😉
- Run Dagger Engine as a Fly Machine (no more Docker)
  - #471

Signed-off-by: Gerhard Lazu <[email protected]>
@Kaspik
Copy link

Kaspik commented Aug 1, 2023

Any update on this ticket? I want to prioritize M2 Pro machines instead of M1 machines (for example).

@naikrovek
Copy link

if there were updates on this issue you would see updates right here.

@QuixThe2nd
Copy link

Im in the same boat. I'd like to prioritise certain servers over others as build times can vary as much as 5x depending on the server.

@naikrovek
Copy link

The only way to do this currently is to label your larger runners with different labels than your smaller runners. And once you do this, your users will discover a new way to make you lose the fight, and everyone will choose the larger runners because they're faster.

@GeunSam2
Copy link

I really need this feature.

@nedrebo
Copy link

nedrebo commented Oct 17, 2023

The only way to do this currently is to label your larger runners with different labels than your smaller runners. And once you do this, your users will discover a new way to make you lose the fight, and everyone will choose the larger runners because they're faster.

Agree, that's not even a solution, just a rabbit hole. And, it's not prioritization, just hard coding and limiting the chance of saturation of runner resources :(

One more viable solution is to implement a daemon to dynamically add and remove labels based on queue size for various labels and runner capabilities.

Another is to use Kubernetes or similar to spin up and down runners on demand based on similar rules.

I have not yet tried implementing any of those as cost/value is not there yet for our case, it is not trivial to implement. In the meantime, I'm crossing my fingers (and paying the HW cost of unused/redundant resources) and hoping GitHub implements something usable before we hit the wall.

Another notable hack is to put some runners on the organization level and some on the repository level, as these have different priorities. However, it is a very limited hack with only one axis and only two levels of "prioritization".

I'm really interested to hear if anyone implemented a real solution to this problem, and they want to share it here.

@naikrovek
Copy link

Another is to use Kubernetes or similar to spin up and down runners on demand based on similar rules.

GitHub have a solution for this called Actions-Runner-Controller. Apparently it works quite well. We're testing it at my employer.

@xucian
Copy link

xucian commented Oct 18, 2023

Another is to use Kubernetes or similar to spin up and down runners on demand based on similar rules.

GitHub have a solution for this called Actions-Runner-Controller. Apparently it works quite well. We're testing it at my employer.

this seems to be more about scaling similar machines (which I think kubernetes is generally good at), whereas prioritizing is a bit different. for ex., I have 3 windows runners, 2 mac runners, 6 linux runners, each with different specs, and I want to have them all running at the same time, and assign priorities at the job-level (some are expected to eat more ram than others, but I still want to run them if a less-preferred runner is active. idling is the worst and sometimes causes cicd deadocks)

@tach200
Copy link

tach200 commented Mar 8, 2024

This would be fantastic

@lordmauve
Copy link

We had this issue and just realised that something that works for us is to use a single space-separated label instead of a YAML list:

runs-on: python highmem gpu

or equivalently

runs-on: >-
  python
  highmem
  gpu

It works for us because we can parse these in any order and launch JIT runners, but maybe this could be part of a solution for others.

@xucian
Copy link

xucian commented Apr 6, 2024

We had this issue and just realised that something that works for us is to use a single space-separated label instead of a YAML list:

runs-on: python highmem gpu

or equivalently

runs-on: >-
  python
  highmem
  gpu

It works for us because we can parse these in any order and launch JIT runners, but maybe this could be part of a solution for others.

can you expand on this? is this making github look for "python highmem gpu" (and thus hang), and meanwhile you somehow intercept this and launch runners on-demand? thanks!

@ChristopherHX
Copy link
Contributor

meanwhile you somehow intercept this and launch runners on-demand?

@gabriel-samfira
Copy link
Contributor

gabriel-samfira commented Apr 6, 2024

can you expand on this? is this making github look for "python highmem gpu" (and thus hang), and meanwhile you somehow intercept this and launch runners on-demand? thanks!

@xucian

I am really sorry for the length of this post.

You can do this with an "autoscaler" (there are several out there - ARC, for example. We also wrote one, but I don't want to do any shameless plugs). It's fairly easy to roll your own if you don't need anything too generic, by relying on github webhooks to let you know when a new job is queued. Workflow job payloads that you get from github have a bunch of fields set and a few headers of interest. In the header you have the signature of the payload in case your webhook uses a secret (which it should), and the entity that the webhook is meant for (repo, organization or enterprise). In the payload itself, you get the repo that the workflow originated from, and details about the workflow that triggered it. As part of the payload you also get the labels that were set in runs-on.

Now, a few things to know about how runners pick up jobs:

Runners can be registered at the repo, org or enterprise levels. In some cases, you can have runners registered in all 3 hierarchy levels. So you can have for example an enterprise called example which has an org called example-org which has a repo called example-org/example-repo. You can have runners registered at all 3 levels and the enterprise can decide to share them with example-org which in turn can share the enterprise runners with example-org/example-repo.

The example-org entity can also create a set of runners which it can share with the repos created in that org. So in this scenario, the example-org/example-repo repository has access to the runners shared by the enterprise, the runners shared by the org, and potentially its own runners.

The runners from all hierarchy levels can have the exact same labels set, and in most cases, the default labels (self-hosted, x64, linux - can vary based on arch and OS), will be present on all runners by default. So if a worklow author decides to create a workflow in the example-org/example-repo repository that has the runs-on field set to just self-hosted for example, any runner from any hierarchy level can pick up the job.

You can't predict which runner will pick up the job. And this is the challenge when dealing with workflows that use label sets that match multiple types of runners an entity might have.

To get the most predictable results, workflow authors should always use unique and non ambiguous label sets to target runners. For example, let's say you have the following sets of runners:

enterprise:
  labels: [enterprise, linux, self-hosted, gpu]
org:
  labels: [org, linux, self-hosted, sr-iov]
repo:
  labels: [repo, linux, self-hosted, gpu]

If you use only self-hosted as a label in your workflow, then any of the 3 groups of runners you have access to, can pick up the job. If you use [self-hosted, gpu], then the job can be picked up by runners at the enterprise level and the repo level. If however you specify [repo, linux, self-hosted], then only repo runners will match those labels and only those runners will pick up your job.

You can even have multiple types of runners at the same hierarchy level. So you can have two types of runners at the repo, both of which have the self-hosted label. Both those runner types will match the workflows in question.

The point is to use a set of labels that uniquely identifies the runners you're interested in.

What @lordmauve suggests is to create a label that is a space delimited string. In effect, that is only one label, but it's unique to only that set of runners. If you register your runners with that label and craft your workflow to target that unique label, you should get a predictable result every time. This is not a bad approach, but depending on what you want to do, it may not be practical. You may want to be able to target runners at both the repo level and the org level. Or you may have different types of runners that share a set of labels and you'd be ok with either type picking up the job. The problem here is that we can't set a priority on which runners pick up the job. The only control any autoscaler has is what type of runner it spins up when a new job comes in. But this can never be 100% reliable.

For example, let's say we have an org with 2 repos. Each repo has a workflow that uses self-hosted as a label. One repo holds an app that requires a GPU to run, the other does not. Now let's also assume that the org has 2 types of runners available. One set of runners with the labels self-hosted, linux, x64 and one with self-hosted, linux, x64, gpu. If both these repos start a workflow at the same time, the repo that requires a GPU can pick up a runner that does not have the gpu label set, and that job will probably fail.

Now in this case, you might be inclined to say: "okay, but you can use a runner group to limit repos that can use GPUs". Yes, you could do that, but there are 2 potential issues with that approach:

This makes the job of the autoscaler almost impossible. If you only have the self-hosted label set and no group property in the workflow_job payload, an autoscaler that supports multiple runner types, will not know if you want a runner in one runner group or another.

So, currently, the best way to make sure your jobs run on the runners you want, is to use unique sets of labels to register your runners and inside your workflows.

The problem with this approach is that in large organizations with many teams, in most cases, individual teams won't use anything more than self-hosted and the ops team ends up having to track down those workflows and attempt to convince people to use unique label sets. But IMO, there is currently no way around it and unique sets of labels is the best way to get the desired result.

Ideally, we could have a set of "rules" or "filters" that we can set up in github, which will dictate to which runner a job will be routed. But we don't have that now, and I am not sure if it's something that the amazing folks at GitHub want to have as a feature. In the meantime, a unique set of labels should do.

@naikrovek
Copy link

naikrovek commented Apr 7, 2024

I wrote something like this. It listened to webhooks and spawned containers whose runners were configured to match the labels requested. If a workflow asked for a larger runner, I created a larger container (more than default CPU and RAM) and registered it.

My problem was that I have enough simultaneous jobs being run at any moment that less specific jobs would be taken by the more specific runners, then the more specific job would be left high and dry, because the runner I spawned for it was taken by some job that only asked for 'Linux', for example.

It was murder to communicate to all 12k developers on our GHES instance about how to use labels, and we eventually gave up.

There is currently no way to guarantee that a specific runner takes a specific job and that is a minor source of annoyance.

All I can do is have so many runners of each specific configuration ready to go at all times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future Feature work that we haven't prioritized
Projects
None yet
Development

No branches or pull requests