Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle upgrade with ephemeral #1396

Closed
MichaelJJ opened this issue Oct 1, 2021 · 18 comments
Closed

How to handle upgrade with ephemeral #1396

MichaelJJ opened this issue Oct 1, 2021 · 18 comments
Labels
bug Something isn't working

Comments

@MichaelJJ
Copy link

To support ephemeral runners as docker containers, we created an init script which runs the following:

/config.sh <arguments>
/run.sh

We've noticed that if the actions runner version is old, the runner will self update then exit without actually running a job or de-registering. When there is no upgrade the runner works correctly.

We are using the latest ubuntu docker image as a base.

Here is the log from the container:

# Authentication
√ Connected to GitHub
# Runner Registration
√ Runner successfully added
√ Runner connection is good
# Runner settings
√ Settings Saved.
√ Connected to GitHub
2021-10-01 20:50:54Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.283.2 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
√ Removed .credentials
√ Removed .runner

Is there a way to either skip the upgrade or have the runner process a job?

@MichaelJJ MichaelJJ added the bug Something isn't working label Oct 1, 2021
@TingluoHuang
Copy link
Member

😢 @thboop I guess we have another condition about deleting the config file. 😢

@TingluoHuang
Copy link
Member

@pje FYI...

@cgeers
Copy link

cgeers commented Oct 1, 2021

I'm affected by this issue as well.

@TingluoHuang
Copy link
Member

I think this is actually fixed by #1384

Before the fix, we blindly deleted the runner settings file when an ephemeral runner exit.
During auto-update, the old version ephemeral runner supports to exit, then start back with a newer version then pick up the job.

Since the setting files get deleted accidentally, the newer version runner can't start back again... 😢

@cgeers
Copy link

cgeers commented Oct 2, 2021

@TingluoHuang #1384 seems relevant, but I think the crux of this problem is that the auto update procedure creates a new process before the old one waits around for a few seconds and exits. The old process does not appear to check for success on upgrade, it just waits a fixed time and exits. In a containerized runner, when this happens all processes in the container are killed, whether or not the upgrade actually had time to complete. I see this routinely.

This changing pid throughout the upgrade procedure doesn’t play well with containerized runners that don’t also embed their own service manager (systemd). This is why I question whether ephemeral runners should be subject to auto upgrade at all.

People are expecting the —ephemeral option to finally bring with it basic support for orchestrating containerized runners and I’m not sure it does just yet. The design of the upgrade process seems to be a blocker.

personally, I’d be fine with a —autoupdate=false option being available either with or without —ephemeral.

@MichaelJJ
Copy link
Author

MichaelJJ commented Oct 2, 2021

The auto-upgrade impacts the time for a runner to be online and ready to run the job depending on how long it takes for the download and install of the new runner.

@actions actions deleted a comment Oct 4, 2021
@actions actions deleted a comment Oct 4, 2021
@actions actions deleted a comment Oct 4, 2021
@ViacheslavKudinov
Copy link

I’m also interested in the functionality to prevent auto-upgrade.
We like some other people were affected by this release, and we believe that more control under upgrade process is beneficial for enterprise organizations where it is pretty critical to not be able to start new self hosted runners in ephemeral mode.

@giorgiocerruti
Copy link

We are affected too. We are running runners in an ECS cluster and disable the auto-update would be very handy.

fgalind1 added a commit to fgalind1/runner that referenced this issue Oct 28, 2021
When using infrastructure as code, containers and recipes, pinning
versions for reproducibility is a good practice. Usually docker
container images contain static tools and binaries and the docker image
is tagged with a specific version.

Constructing a docker image of a runner with a specific runner version
and then self-updating itself doesn't seem that natural, instead the
docker image should use whatever that binary was built/tagged to.

Additionally to this - this concept doesn't play well when using
ephemeral runners and kuberentes. First of all, we need to pay the price
of downloading/self-updating every single ephemeral pod for every single
job which causes delays in execution. Secondly this doesn't work well
and containers may get stuck

Related issues that will be solved with this:
- actions#1396
- actions#246
- actions#485
- actions#422
- actions#442
fgalind1 added a commit to fgalind1/runner that referenced this issue Oct 28, 2021
When using infrastructure as code, containers and recipes, pinning
versions for reproducibility is a good practice. Usually docker
container images contain static tools and binaries and the docker image
is tagged with a specific version.

Constructing a docker image of a runner with a specific runner version
and then self-updating itself doesn't seem that natural, instead the
docker image should use whatever that binary was built/tagged to.

Additionally to this - this concept doesn't play well when using
ephemeral runners and kuberentes. First of all, we need to pay the price
of downloading/self-updating every single ephemeral pod for every single
job which causes delays in execution. Secondly this doesn't work well
and containers may get stuck

Related issues that will be solved with this:
- actions#1396
- actions#246
- actions#485
- actions#422
- actions#442
@jimrazmus
Copy link

This also affects my team. We deploy the runners using idempotent Docker containers running on Nomad. We utilize a system job to ensure we have 1 runner executing per node along with the ephemeral/run-once runner option. When a job completes, the Nomad orchestrator handles the clean up and starts another fresh runner. Automatic upgrades conflict with our orchestration strategy. We end up in a loop where runners terminate after upgrading, which terminates the job, which leads to a new container launch, which starts the whole loop again.

Disabling automatic upgrades would be a welcome improvement.

@ethomson
Copy link
Contributor

ethomson commented Dec 1, 2021

We're adding an option to allow self-hosted ephemeral runners to opt-out of automatic updates so that you can manage updates yourself.

Some background: we consider the runner software and the hosted Actions software as a cohesive whole. Many times when we add a new feature to GitHub Actions, these changes need to be made both on the hosted service and in the runner - for example, when we added conditional steps to composite actions. This is why we've always required runner updates, so that we can be sure that the runner is compatible with the service version.

Obviously this is a painful requirement for many ephemeral users. So we'll add an opt-out mechanism for ephemeral, where the runner will not try to do a self-update. This flag will allow you to control when you update your runners.

Because the runner versions are so tightly coupled to the overall service, you'll be required to update within a month of a new runner version being released. After a month, your runners will no longer be able to connect to GitHub, so you will need to perform updates regularly. Immediately upon a new release, the runner will begin notifying you when an update is available on stdout and stderr. We'll also start adding annotations to workflow runs on outdated runners.

This is in development now and we plan to have it generally available in the new year.

@tyrken
Copy link

tyrken commented Dec 1, 2021

Thanks @ethomson - how will this "update within a month" limit work with Github Enterprise Server installations, where the server (where I presume the Actions server-side code resides) may not be on the bleeding edge upgrading to your latest releases all the time? Will you document a minimum runner version in each GHE release notes?

@ethomson
Copy link
Contributor

ethomson commented Dec 1, 2021

Thanks @ethomson - how will this "update within a month" limit work with Github Enterprise Server installations, where the server (where I presume the Actions server-side code resides) may not be on the bleeding edge upgrading to your latest releases all the time? Will you document a minimum runner version in each GHE release notes?

@tyrken We will, yes. We're still working on the details here but we'll have guidance - and I'm paraphrasing - "update your runner fleet first to version ". In a sense this is much easier on GHES since you control the upgrade of both pieces.

@tonywildey-valstro
Copy link

tonywildey-valstro commented Dec 3, 2021

I think that the grace period should work if we can then point our build scripts to the latest version (rather than specific named) and then automate a rebuild when the version changes. Would be nice if we could expose the check the runner does as an action so we can easily do the check

e.g. create this release structure as other apps do :

curl -O -L https://github.com/actions/runner/releases/download/latest/actions-runner-linux-x64.tar.gz
vs
curl -O -L https://github.com/actions/runner/releases/download/v2.285.0/actions-runner-linux-x64-2.285.0.tar.gz

This would allow us to automate builds without having to override build-args and without having to interrogate the release tag

@bryanmacfarlane
Copy link
Member

@tonywildey-valstro agreed. In some discussions we asserted the runner build process should not only build a tar gz but also publish a container. That would allow you to conume latest or maybe we could even move a stable label that's guarenteed to be < 30 but not quit on the bleeding edge (a publish from an hour ago).

one caveat to something like that. you would not only want to use the ephemeral runner concept but would also likely want to use the yaml job containers feature. That means the runner container is orthogonal to your build / tools / app container with your stuff in it and they can move independently (one reason why we created that yaml feature). If you couple the two then you have to build every time we build (within 30)

@tonywildey-valstro
Copy link

@bryanmacfarlane thanks for the response makes perfect sense
Will take a look at the yaml job containers as this is exactly the model we want.

@GregoireW
Copy link

GregoireW commented Dec 9, 2021

@ethomson

If you work on an option to disable the auto upgrade, please add an option to do an upgrade if needed ! ( --upgrade-only for instance )

It will be much simpler for me to have a script like:

./run.sh --upgrade-only
./config.sh .....
./run.sh 

edit: or a name like --check-upgrade but the idea is to get the recommended version ( may be different than the latest version)

@bsc-dev-ops
Copy link

Can we add a feature that we can specify auto update time? Example - We can add a corn expression to run the auto update every Sunday Night PST or some Global Time. This way we can still make our hosted runner up to date and also do not disturb runs during business hour.

@thboop
Copy link
Collaborator

thboop commented Feb 1, 2022

We've shipped the ability to disable auto upgrade, please see the changelog for more information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests