-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to opt out of agent auto-update #246
Comments
We're also looking at producing an image as part of the runner build so if you are doing single use container builds then you never hit a required update - it's picking up the container with the runner in it that the service demands. Question - are you doing single use containers or does the runner / container lifetime span jobs and builds in your scenario? |
instead of entrypoint running run.sh (interactive) it should probably run runsvc.sh which handles update return codes. That coupled with a image released with every runner version would probably eliminate the issue https://github.com/actions/runner/blob/master/src/Misc/layoutbin/runsvc.sh |
^^ I'm working on some docker containers to be part of the build - I'll look over your files |
just make sure the dockerfiles work for arm/arm64 please 😃 |
I have a customer requesting this. At the moment the self-hosted runner self-updates, making it possible unpredictable for production workloads. |
Please don't include docker by default. I already run this inside a container. |
Here is my suggestions:
|
It's the service that demands and pulls the runner forward. There are new capabilities and features on a cloud service that's rolling forward which needs changes in the runner. Even if it's demanded only if you use a new feature, it still means a runner can get pulled forward. So, when it gets pulled forward, if the container is referencing Also note that if we build some containers and sim release with the runner, that doesn't preclude you from building your own containers as you're doing now. It would be a pure convenience for others. |
I'd rather it stopped working than auto updated. |
That's possible as well. Call ./run.sh, it will exit and not handle the update code. However, please be aware that all your workflows will stop running until you build another container or manually configure all your self hosted runners again. You will only be able to do that with the version the service demands that you didn't want to auto update to. So I'm not sure what the stop buys you. Note that the runner is a simple step runner. The yaml and the orchestration runs service side (that's why things like matrix'ing sets with outputs etc.) and delegates steps to this app. We're also not talking about the hosted images, dev tools, etc. which usually affect the work being done. the tools in the container or on the self hosted runner will stay the same. |
☝️ Also note that we are painfully aware of how this affects container scenarios and building your own. We're thinking about that so I don't want to discount the problem here. Just trying to explain how it works and what the current state is. Hope that helps. |
I understand where you are coming from. With CI, you are running a lot of untrusted code so maybe my argument doesn't hold much water. That being said, you can isolate untrusted code using KVM, run it as an unprivileged user, trash the results. However, the way the system is currently designed and distributed makes this difficult. One would normally expect a package to have a clean split between "executables/assets" and "run-time state". For whatever reason, this package doesn't have that. Not even the configuration is separated out. Ideally, you'd install something like Some of the consequences of this design are:
So, Auto update makes the package unpredictable. Some package systems provide verification functions to check if there are any errors on the file system or something has been manipulated. Some package managers provide permissions check system which validates file permissions. Some package managers provide file ownership query and other operations. If the package files are changing, these functions stop working. Additionally, it makes it hard to remove the package, if the files are not known. For example, removing this package in arch linux boils down to So, my advice is:
|
Oh, I also wanted to add, auto update is a massive target for black hats, so completely agree with @SvanBoxel - hard to audit, hard to reproduce, hard to validate, etc. |
We'll reflect on ☝️ for a bit. But note that there's a ton of separation right now where the runner knows nothing about yaml, the language and other service side / user features. that's all compiled / processed service side into the simplest "job" message which is one node of the whole run graph ( a set of steps). Those use contexts and are passed through fairly opaquely to the runner to minimize changes. However there are times when there's a feature that's added which takes runner changes. As an example, js or container actions can have post job scripts and we're about to add pre. Another example would be side car containers for test (expressed in yaml but executed by the runner). So there's actually multiple moving parts (1) the service (2) the runner and (3) the actions that 3rd parties are releasing. :o So now, when an action packages a pre/script (start an emulator, shutdown etc.) they expect it to run and even though the outer workflow is run service side, actions run VM side and the steps are run by the runner. In Azure (where this code started) we actually did a real complicated thing. We allowed the runners to stay back and only get update when the user used a new service side feature that demanded a new runner and we also allowed action authors to demand a min version of the runner. The service would compute a min based on features used and enumerate actions, calculate the min and if too old, send an update message. It still would get updated, just less often and much less predictably. Task authors couldn't figure it out, didn't get it right and it led to really hard to debug situations. I think the fundamental design decision is the yaml is not run by the runners. It's run service side with a durable stateful orchestration which dishes out work to runners. That has alot of benefits with things like parallel graphs of nodes which dish out simple instructions to a runner. But it does mean that the runner binary is an RPC remote extension of that orchestration and actions are separate from that. Completely open to refactoring the runner and ideas. But I think that the root of the problem is the multiple moving parts with the other two moving forward and getting new features. :) If you got this far, thanks for reading my wall of text. Hope it provides some background. |
Yeah, I can understand what you are saying. I agree it's not simple and to be honest, I see both sides of that coin. But as someone who was trying to put together a package for users, this was non-trivial to deal with. Right now, all the files are installed into So somehow we need to find a middle ground that allows someone like myself to package it up in a more traditional way, but also support things like on the fly updates. Whether you just vendor all the self updating code into the runtime directory is kind of up to you/the engineers behind the code, but at least it should be possible. In essence, the package should install the core code, the system integration, the user account setup, systemd, pull in dependencies which may be provided by the system (nodejs, c#) etc, and then you might start up an instance, it would self update it's own scripts into the runtime directory, but this is largely state which could be wiped clean without affecting the package being reinstalled, running the update process in the future, etc. |
It doesnt have to stop working in general if you do deprecation warnings saying "a new version is available" for one version. Running X would work for most scenarios. deprecation is hard for sure, but everyone would have an upgrade window in this scenario |
We follow the model above when we (rarely) deprecate features. When we do it, that process ☝️ spans months. However, this is not about deprecation. This is about new features getting added to the service (which users use when released. e.g. new yaml step construct) and the runner having to add new code (not deprecating) for that feature to work. This is also about the runner adding new features to support new capabilities for action authors (and yet others who use those actions and maintain runners). |
An important consideration I would like us to think about is that companies in some industries are simply not compliant if they run software that hasn't been explicitly verified/validated. So having a service that runs on your own infrastructure that automatically updates could mean your business is not compliant. An explicit upgrade process, or a way to opt-out of automatic updates, would solve this compliance conflict. I can give the example of software that is being used within companies that require FDA compliance need to go through this process before a service is being upgraded: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?CFRPart=820&showFR=1&subpartNode=21:8.0.1.1.12.7 |
☝️ is what I was going to follow up with. |
Yep. We've worked with those regulated environments in both azure and github. In those scenarios, other things do not move forward. Everything is controlled or air-gapped including the on-premise server (TFS at MS, upcoming GHES at GitHub). It's not just the runner that's under scrutiny. The tasks/actions were also scrutinized and controlled. In those scenarios, the runner (and actions) do not update since the service is also controlled. Even in on-premises non air gapped, the service and actions do not continuously update so the runners can stay back. The minimum runner configured is recorded and control server / service side. Which goes back to the key point: if the other parts (service and actions) move forward then the code that executes those features has to move forward. |
Note that I think we should still look at investing in options here. The points above are clarifying where we currently are with the current model. |
Also note in the short term if you are running the runner in a container, you should call the In the medium term, we also want to support Finally, we should consider refactoring as @ioquatix referred to into the base runner (listener) and the steps portion which is derived from the service yaml orchestration. |
@bryanmacfarlane one thing I considered several months ago was writing a compatible daemon in Ruby just for fun. Do you have any published specs for the GitHub Actions runner? I started reading through the C# source code but... trying to derive a protocol/specification by reading through C# is not really my thing. |
I'm a bit concerned about the runner updating to a pre-release version. When I build my container to get around the auto-update autokill, I query |
I also did a double-take at the "pre-release" tag on the last 3 versions: v2.277.0, v2.277.1, and v2.278.0. It's not clear to me why those specifically are pre-releases. As far as I can tell, only the Linux arm/arm64 builds are pre-release, but those were added prior to the listed versions. |
Rust's builders also got stuck yesterday when a pre-release was pushed as an update to our runners. We can't use auto-updates for security reasons. |
Is there any update on the issue? I am using Azure Kubernetes to host the runners and every time a build is triggered, the runners first go offline to auto-update which is super annoying. |
@pietroalbini I think you may be confusing the runner application (this repo) and the hosted images. Even if you could pin this application, the OS images and all the toolsets slide underneath it so I don't think this issue will address your issue and other's concerns. |
@bryanmacfarlane the reason Rust can't use auto-updates is not related to the hosted images, but it's a byproduct of running a custom runner fork as a workaround to another runner limitation. The main problem Rust has with the self-hosted runner is #494 (the runner executing builds from PRs causing security issues), and to work around it we're using a fork of the runner that rejects any build coming that's not a In our fork we also added a change to prevent auto-updates, but since the GitHub backend only sends jobs to a runner if it's up to date, preventing auto-updates that way effectively disables the runner whenever new runner release is made. We have to watch for new releases and rebase our fork before the updat e rolls to our organization, and when we don't get around to it fast enough our queue stops, which is annoying. |
@bryanmacfarlane @hross Do you have plans to create some kind of channel of communication for upgrades? With grace period to upgrade? Or to support X last versions for compatibility? |
@patryk-kozak-branch Sorry about what you experienced, do you have any runner diag log still available for me to check? I would like to understand why the 2.280.3 upgrade breaks everything for you. 🙇 |
This comment has been minimized.
This comment has been minimized.
@jenschelkopf is that private? Im unable to see that issue |
Ah, yes @myoung34. That issue is part of a repository where we track customer feedback internally. I got a little link happy :) I'll hide the comment so other people don't try it as well. |
My little, but effective workaround of my own runner implementation written in golang, which obviously doesn't want to be replaced by the official runner:
As you can compile the runner in a fork or locally, why not change the runner version to 3.0.0 (file https://github.com/actions/runner/blob/main/src/runnerversion). If you run the dev script / workflow now your runner won't get updates for some month or even years. (no idea when github runners will reach 3.x.x, then use 4.0.0 and so on). This method might bear the risk github requires a new runner feature like a new message encryption method (fips was the last one which were added) and your old runner version doesn't have that feature and fails to connect, because the actions service assumes the feature is present. |
So is there anything new on this topic? Auto-updating when running "runner" in Kubernetes is constantly failing. An option to opt out would be nice. |
When using infrastructure as code, containers and recipes, pinning versions for reproducibility is a good practice. Usually docker container images contain static tools and binaries and the docker image is tagged with a specific version. Constructing a docker image of a runner with a specific runner version and then self-updating itself doesn't seem that natural, instead the docker image should use whatever that binary was built/tagged to. Additionally to this - this concept doesn't play well when using ephemeral runners and kuberentes. First of all, we need to pay the price of downloading/self-updating every single ephemeral pod for every single job which causes delays in execution. Secondly this doesn't work well and containers may get stuck Related issues that will be solved with this: - actions#1396 - actions#246 - actions#485 - actions#422 - actions#442
When using infrastructure as code, containers and recipes, pinning versions for reproducibility is a good practice. Usually docker container images contain static tools and binaries and the docker image is tagged with a specific version. Constructing a docker image of a runner with a specific runner version and then self-updating itself doesn't seem that natural, instead the docker image should use whatever that binary was built/tagged to. Additionally to this - this concept doesn't play well when using ephemeral runners and kuberentes. First of all, we need to pay the price of downloading/self-updating every single ephemeral pod for every single job which causes delays in execution. Secondly this doesn't work well and containers may get stuck Related issues that will be solved with this: - actions#1396 - actions#246 - actions#485 - actions#422 - actions#442
I would like to see this as well 🥇 |
This has been implemented in runner |
Describe the enhancement
Running the agent within docker and having it auto update causes the container to throttle and never succeed without having to manage PID 1
It would be nice to have something in
config.sh
orrun.sh
, possibly environment variables to help opt out of auto-updateCode Snippet
n/a
Additional information
n/a
The text was updated successfully, but these errors were encountered: