Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeral runner #831

Closed
wants to merge 5 commits into from
Closed

Conversation

rofafor
Copy link
Contributor

@rofafor rofafor commented Sep 20, 2021

The --ephemeral switch is now released and --once was (accidentally?) removed in https://github.com/actions/runner/pull/660/files:

Actually, the --ephemeral switch is already in the images published via CI, but missing one small fix. Here's my log from the current latest tagged image:

Passing --once to runsvc.sh to enable the legacy ephemeral runner.
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Starting Runner listener with startup type: service
Started listener process
Started running service
Unrecognized command-line input arguments: 'once'. For usage refer to: .\config.cmd --help or ./config.sh --help

√ Connected to GitHub

I had to leave RUNNER_FEATURE_FLAG_EPHEMERAL logic due to the fact that GitHub Enterprise Server 3.0 or 3.1 doesn't support this new ephemeral feature yet:

√ Connected to GitHub

# Runner Registration

An Internal Error Occurred. Activity Id: XXXXXXXX-e335-4995-9f33-659731da78eb.
Configuration failed. Retrying

@rofafor rofafor force-pushed the feature/ephemeral_runner branch 3 times, most recently from 4e2d4a9 to f35ce70 Compare September 20, 2021 18:31
@mumoshu
Copy link
Collaborator

mumoshu commented Sep 20, 2021

@rofafor Hey! Thanks for this PR. But you don't need to worry about the validation error. It's just a validation error. --once isn't removed in the PR 660. The author removed --once only from the list of valid flags and that's why you see Unrecognized command-line input arguments: 'once'. For usage refer to: .\config.cmd --help or ./config.sh --help , but the legacy behaviour isn't removed.

@mumoshu
Copy link
Collaborator

mumoshu commented Sep 20, 2021

@TingluoHuang JFYI, this seems like a revival of the discussion we had in the slack channel before.

@mumoshu
Copy link
Collaborator

mumoshu commented Sep 20, 2021

I had to leave RUNNER_FEATURE_FLAG_EPHEMERAL logic due to the fact that GitHub Enterprise Server 3.0 or 3.1 doesn't support this new ephemeral feature yet:

My original plan was to drop support for --once only after the --ephemeral flag gets rolled out to every flavor of GitHub.
That said, all your changes are almost perfect, except this part.

Could you also change L82 from:

if [ "${RUNNER_FEATURE_FLAG_EPHEMERAL:-}" == "true" -a "${RUNNER_EPHEMERAL}" != "false" ]; then

to

if [ "${RUNNER_FEATURE_FLAG_EPHEMERAL:-}" != "false" -a "${RUNNER_EPHEMERAL}" != "false" ]; then

and revert the change on L160, and finally revert removal of this part:

args=()
if [ "${RUNNER_FEATURE_FLAG_EPHEMERAL:-}" != "true" -a "${RUNNER_EPHEMERAL}" != "false" ]; then
  args+=(--once)
  echo "Passing --once to runsvc.sh to enable the legacy ephemeral runner."
fi

and change it to:

args=()
if [ "${RUNNER_FEATURE_FLAG_EPHEMERAL:-}" == "false" -a "${RUNNER_EPHEMERAL}" != "false" ]; then
  args+=(--once)
  echo "Passing --once to runsvc.sh to enable the legacy ephemeral runner."
fi

So that the new --ephemeral flag becomes the default option, while allowing to use the legacy --once flag for older GHES installations.

@rofafor
Copy link
Contributor Author

rofafor commented Sep 21, 2021

I tested and seems to be working with Enterprise servers as well when setting RUNNER_FEATURE_FLAG_EPHEMERAL: "false" despite the "Unrecognized command-line input" error.

@toast-gear
Copy link
Collaborator

toast-gear commented Oct 19, 2021

@rofafor we've been talking how to roll this out without breaking people's setups. Can you confirm if the your GHES https://ghe.company.com/api/v3/meta endpoint is callable from a runner without a token and if the endpoint is rate limited? Additionally, can you call it with a token and get an authenticated rate limit budget? What does the rate limit header look like curl -v?

Finally could you send along what the JSON looks like.

@rofafor
Copy link
Contributor Author

rofafor commented Oct 21, 2021

Target system GHES 3.1.8. Tested on both on my laptop and actual GHA workflow with similar results (as expected):

    - name: Access meta
      run: |
        curl -vsSL -H "Authorization: Bearer foobar" https://ghe.company.com/api/v3/meta
        curl -vsSL -H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" https://ghe.company.com/api/v3/meta
        curl -vsSL https://ghe.company.com/api/v3/meta

Bad credentials:

curl -vsSL -H "Authorization: Bearer foobar" https://ghe.company.com/api/v3/meta

< HTTP/2 401
< server: GitHub.com
< date: Thu, 21 Oct 2021 15:30:45 GMT
< content-type: application/json; charset=utf-8
< content-length: 105
< x-github-enterprise-version: 3.1.8
< x-github-media-type: github.v3; format=json
< access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
< access-control-allow-origin: *
< x-github-request-id: ***
< strict-transport-security: max-age=31536000; includeSubdomains
< x-frame-options: deny
< x-content-type-options: nosniff
< x-xss-protection: 1; mode=block
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< content-security-policy: default-src 'none'
< x-runtime-rack: 0.009127
<
{
  "message": "Bad credentials",
  "documentation_url": "https://docs.github.com/enterprise/3.1/rest"
}

With a proper token:

curl -vsSL -H "Authorization: Bearer ***" https://ghe.company.com/api/v3/meta

< HTTP/2 200
< server: GitHub.com
< date: Thu, 21 Oct 2021 15:33:57 GMT
< content-type: application/json; charset=utf-8
< content-length: 81
< cache-control: private, max-age=60, s-maxage=60
< vary: Accept, Authorization, Cookie, X-GitHub-OTP
< etag: "***"
< x-oauth-scopes: admin:org_hook, read:org, read:repo_hook, repo, workflow
< x-accepted-oauth-scopes:
< x-github-enterprise-version: 3.1.8
< x-github-media-type: github.v3; format=json
< access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
< access-control-allow-origin: *
< x-github-request-id: ***
< strict-transport-security: max-age=31536000; includeSubdomains
< x-frame-options: deny
< x-content-type-options: nosniff
< x-xss-protection: 1; mode=block
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< content-security-policy: default-src 'none'
< x-runtime-rack: 0.015951
<
{
  "verifiable_password_authentication": true,
  "installed_version": "3.1.8"
}

Without authentication:

curl -vsSL https://ghe.company.com/api/v3/meta

< HTTP/2 200
< server: GitHub.com
< date: Thu, 21 Oct 2021 15:36:07 GMT
< content-type: application/json; charset=utf-8
< content-length: 81
< cache-control: public, max-age=60, s-maxage=60
< vary: Accept
< etag: "***"
< x-github-enterprise-version: 3.1.8
< x-github-media-type: github.v3; format=json
< access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
< access-control-allow-origin: *
< x-github-request-id: ***
< strict-transport-security: max-age=31536000; includeSubdomains
< x-frame-options: deny
< x-content-type-options: nosniff
< x-xss-protection: 1; mode=block
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< content-security-policy: default-src 'none'
< x-runtime-rack: 0.015265
<
{
  "verifiable_password_authentication": true,
  "installed_version": "3.1.8"
}

...and the same with newer GHES 3.2.0:

curl -vsSL https://ghe-test.company.com/api/v3/meta

< HTTP/2 200
< server: GitHub.com
< date: Thu, 21 Oct 2021 15:46:39 GMT
< content-type: application/json; charset=utf-8
< content-length: 81
< cache-control: public, max-age=60, s-maxage=60
< vary: Accept
< etag: "***"
< x-github-enterprise-version: 3.2.0
< x-github-media-type: github.v3; format=json
< access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
< access-control-allow-origin: *
< x-github-request-id: ***
< strict-transport-security: max-age=31536000; includeSubdomains
< x-frame-options: deny
< x-content-type-options: nosniff
< x-xss-protection: 0
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< content-security-policy: default-src 'none'
< x-runtime-rack: 0.019815
<
{
  "verifiable_password_authentication": true,
  "installed_version": "3.2.0"
}

@rofafor rofafor force-pushed the feature/ephemeral_runner branch from ebc49e1 to 1b146e6 Compare October 21, 2021 15:54
@toast-gear
Copy link
Collaborator

toast-gear commented Nov 7, 2021

We've talked about this and we can't merge this as is. The problem we've got is once this is merged we will break anyone's GitHub Enterprise Server deployment as the ephemeral flag isn't available. The new flag will be supported in 3.3 I believe however that still isn't enough as not anyone will upgrade quickly, they could conceivably not upgrade for years when talking about banks).

In order to support setting the new flag we need to expose the GitHub environment to the runner's entrypoint i.e. environment variables GITHUB_TYPE: CLOUD vs GITHUB_TYPE: SERVER. If GITHUB_TYPE: SERVER what version of server? GITHUB_SERVER_VERSION: 3.3 (or whatever it is). This logic needs to be done server-side in the controller and held in memory with new environment variables bolted onto the runner spec once established. We also need to default to not setting the new flag unless we have confirmed through these new env vars that the deployment supports the flag, if it does then ENABLE_LEGACY_EPHEMERAL_RUNNER: false

@rofafor
Copy link
Contributor Author

rofafor commented Nov 7, 2021

Acknowledged. Closing this one.

@rofafor rofafor closed this Nov 7, 2021
@mumoshu
Copy link
Collaborator

mumoshu commented Nov 9, 2021

Thanks, everyone! Please let me try adopting this, perhaps by adding the server-side of this.

@mumoshu mumoshu reopened this Nov 9, 2021
@mumoshu mumoshu added this to the v0.21.0 milestone Nov 9, 2021
@mumoshu
Copy link
Collaborator

mumoshu commented Nov 10, 2021

This will resolve issues like #694 if done correctly!

@toast-gear
Copy link
Collaborator

toast-gear commented Nov 12, 2021

#941 worth highlighting this is what will happen if we set the new runner ephemeral flag by default without knowing if it's safe to do so.

@mumoshu
Copy link
Collaborator

mumoshu commented Dec 12, 2021

@toast-gear Given GHES 3.3 has been released recently, would it be safe to merge this as-is, after some graceful period(a month or two, maybe)?

@toast-gear
Copy link
Collaborator

toast-gear commented Dec 27, 2021

I think once 3.3 has been out for a while, something like 2/3 months (its been a month so far https://docs.github.com/en/[email protected]/admin/release-notes, lots of people will probably upgrade their GHES instances over Dec and Jan as code freezes are fairly common this time of the year in regulated environments), then I think we should probably consider just dropping the logic for setting the old flag entirely and only support the new flag. Baselining onto GHES 3.3 as a minimum seems reasonable to me and will help keep the codebase maintainable and as simple as possible and would be my preferred approach.

If we do want to keep logic for the old flag (--once there are people that want it to stick around actions/runner#1339, however, their reasoning for keeping it can be worked around with existing features in actions-runner-controller or with multiple deployments of actions-runner-controller. Furthermore, the reasoning is something GitHub are aware of and are sorting out on their end too as they wish to ultimately drop the --once flag) then I think once 3.3 has been out for the same suggested period of time, and the ability to disable auto-update is released, we should probably switch over to the new flag being the default and rename the environment variable to something that aligns with what it actually does now, maybe USE_LEGACY_EPHEMERAL_FLAG (I prefer this) or USE_ONCE_EPHEMERAL_FLAG? (I'm an advocate of just removing the logic for the --once flag entirely though but perhaps that's a bit aggressive given we can just hide it beyond a env var)

@toast-gear toast-gear modified the milestones: v0.21.0, v0.22.0 Jan 6, 2022
@crackedupcorson
Copy link

So I was trying to set the runners to use "--ephemeral" instead of "--once" as we kept seeing the race condition.
The guide doesn't make it clear that you need GHE 3.3 or later.
I ran into this issue, and it took me quite a while to figure out that it's a GHE versioning issue.

We're still running 3.2.6

I was running
runner 250 133 0 09:12 ? 00:00:00 /bin/bash ./config.sh --unattended --replace --name redacted-app-runners-ls9rf-kv4bm --url https://git.redacted.com/redacted --token BLAH --runnergroup --labels redacted-nonprod-apps --work /runner/_work --ephemeral runner 283 250 71 09:12 ? 00:00:00 ./bin/Runner.Listener configure --unattended --replace --name redacted-app-runners-ls9rf-kv4bm --url https://git.redacted.com/redacted --token BLAG --runnergroup --labels redacted-nonprod-apps --work /runner/_work --ephemeral

And saw this:
tail /runner/_diag/Runner_20220204-091406-utc.log at GitHub.Services.WebApi.VssHttpClientBase.HandleResponseAsync(HttpResponseMessage response, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable1 queryParameters, Object userState, CancellationToken cancellationToken) at GitHub.Runner.Listener.Configuration.ConfigurationManager.ConfigureAsync(CommandSettings command) at GitHub.Runner.Listener.Runner.ExecuteCommand(CommandSettings command) [2022-02-04 09:14:10Z ERR Runner] ##################################################### [2022-02-04 09:14:10Z ERR Runner] System.Exception: An Internal Error Occurred. Activity Id: 3f74acf1-65a3-40bd-8615-0d7badde284b. [2022-02-04 09:14:10Z ERR Terminal] WRITE ERROR: An Internal Error Occurred. Activity Id: 3f74acf1-65a3-40bd-8615-0d7badde284b. [2022-02-04 09:14:10Z INFO Listener] Runner execution has finished with return code 1

Can the docs be update to clarify this as a caveat?

@toast-gear
Copy link
Collaborator

toast-gear commented Feb 4, 2022

We don't document GitHub's documentation. The README already says that you need 3.3 https://github.com/actions-runner-controller/actions-runner-controller#github-enterprise-support to use ARC, we have baselined onto 3.3 and so you need to upgrade to 3.3.

In addition, we are going to be considering moving to the new flag by default soon so upgrading to 3.3 is going to be critical. This PR has been held up due to the need for GHES to be on 3.3 so we needed to give people time, 3 months which included the Dec, Jan period where code freezes are common is enough time. Please do organise upgrading as soon as possible. We will retain the --once flag to support backwards compatibility with older installations however once we have moved to the new flag we are assuming GHES 3.3 or greater for any changes made going forward.

In addition, of note is the next release of the controller has removed the registration only runners as they aren't needed on 3.3 so you will need to upgrade to 3.3 to be able to scale from 0 if that is a feature you use on next release.

@crackedupcorson
Copy link

We don't document GitHub's documentation. The README already says that you need 3.3 https://github.com/actions-runner-controller/actions-runner-controller#github-enterprise-support to use ARC, we have baselined onto 3.3 and so you need to upgrade to 3.3.

In addition, we are going to be considering moving to the new flag by default soon so upgrading to 3.3 is going to be critical. This PR has been held up due to the need for GHES to be on 3.3 so we needed to give people time, 3 months which included the Dec, Jan period where code freezes are common is enough time. Please do organise upgrading as soon as possible. We will retain the --once flag to support backwards compatibility with older installations however once we have moved to the new flag we are assuming GHES 3.3 or greater for any changes made going forward.

In addition, of note is the next release of the controller has removed the registration only runners as they aren't needed on 3.3 so you will need to upgrade to 3.3 to be able to scale from 0 if that is a feature you use on next release.

My apologies, I had only scanned the readme when I was upgrading the chart. I'll be more attentive in future.
I had done a POC before Christmas using this chart, and was firmly sold on it but put it aside for higher priority work.
I've starting working on it again, and then managed to get webhooks working on 3.2.6 (but have had a lot of noise and inconsistency with the ephemerality of the runners).

Our GHE instance is maintained by a central team, so I will pass your feedback onto them. Currently we're only using organization and repo level runners (hosted by various different SRE teams), but if we do end up using enterprise runners in the future I'll provide feedback if it helps

@toast-gear
Copy link
Collaborator

toast-gear commented Feb 4, 2022

My apologies, I had only scanned the readme when I was upgrading the chart

It's no worries, the solution technically does support < 3.3, just not all the features, however the maintainers don't get loads of time to work on ARC and so we need to be practical about versioning. We also don't have an enterprise account GHES or GHEC so ARC is far easier to maintain in general if we are strict on versions support.

In general when new features come out which require a newer GHES version we will baseline onto that version after what we consider to be a reasonable time frame and so it's important your organisation keeps on top of upgrading your GHES version. In general we won't outright remove stuff until quite a while has passed e.g. the --once flag, but we will change the defaults to match the baseline after a reasonable amount of time has passed so keeping ontop of upgradesis key.

@crackedupcorson
Copy link

My apologies, I had only scanned the readme when I was upgrading the chart

It's no worries, the solution technically does support < 3.3, just not all the features, however the maintainers don't get loads of time to work on ARC and so we need to be practical about versioning. We also don't have an enterprise account GHES or GHEC so ARC far easier just in general to maintain if we are strict on versions.

In general when new features come out which require a newer GHES version we will baseline onto that version after what we consider to be a reasonable time frame and so it's important your organisation keeps on top of upgrading your GHES version. In general we won't outright remove stuff until quite a while has passed e.g. the --once flag, but we will change the defaults to match the baseline after a reasonable amount of time has passed so keeping ontop of upgradesis key.

We've been encouraged to contribute to OSS a bit more, and I do know golang & k8s, so I'll happily contribute a bit once I've a few personal projects off my plate.

@toast-gear
Copy link
Collaborator

@crackedupcorson that's great to hear pal, PRs are welcome, a good first issue is #728 for someone familiar with golang & k8s

@mumoshu mumoshu force-pushed the master branch 2 times, most recently from ac017f0 to 25570a0 Compare March 3, 2022 02:05
@toast-gear
Copy link
Collaborator

toast-gear commented Mar 15, 2022

@rofafor, we appreciate the PR and the work you put into it, we've decided however to roll out the ephemeral runners via the next controller release however 0510897 #1189. We are subsequently going to remove the --once flag entirely #1196 as we can't see a reason to keep it to be honest. What this means is your PR won't get merged and can be closed. Again appreciate the work, if we didn't have GHES to worry about we would have merged this a long time ago 😅.

@rofafor
Copy link
Contributor Author

rofafor commented Mar 15, 2022

@toast-gear, I haven't been able to upgrade to GHES 3.3 yet and therefore this PR have been kind of abandoned from my side, so no harm done.

@rofafor rofafor closed this Mar 15, 2022
@rofafor rofafor deleted the feature/ephemeral_runner branch March 15, 2022 20:08
@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

@rofafor This has triggered several important discussions about when and how we should roll it out and without this we would have ended up breaking many things on the way. So, thank you! I appreciate your work ☺️

Also, I hope you'll soon be able to upgrade to GHES 3.3.

FWIW, static(persistent) runners are also improved in the next version of ARC, 0.22.0. If you see issues with --once, you can try falling back to static runners as well, because they are more reliable than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants