-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Dagger Engine as a Fly Machine (no more Docker) #471
Merged
gerhard
merged 6 commits into
thechangelog:master
from
gerhard:dagger-engine-on-fly-machines
Aug 4, 2023
Merged
Run Dagger Engine as a Fly Machine (no more Docker) #471
gerhard
merged 6 commits into
thechangelog:master
from
gerhard:dagger-engine-on-fly-machines
Aug 4, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68b3ef4
to
499d19b
Compare
gerhard
added a commit
to gerhard/changelog.com
that referenced
this pull request
Jul 31, 2023
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - thechangelog#471 Signed-off-by: Gerhard Lazu <[email protected]>
gerhard
added a commit
that referenced
this pull request
Jul 31, 2023
As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - #471 Signed-off-by: Gerhard Lazu <[email protected]>
499d19b
to
9f2cb95
Compare
+ includes `mage ci` benchmarks for various VM sizes + stress-tested Fly.io apps v2 by creating volumes & machines at least 30 times across several days (all worked as expected!) Signed-off-by: Gerhard Lazu <[email protected]>
Rather than opening multiple connections from each task when they run in parallel, open one connection and share it via Go's context. Signed-off-by: Gerhard Lazu <[email protected]>
20c38d1
to
877c27d
Compare
Signed-off-by: Gerhard Lazu <[email protected]>
If multiple instances of this run in parallel, one of them will fail in various non-obvious ways. Since resolving deps now has a common series of steps, Dagger is able to de-duplicate the ops and run it just once, before the two pipelines diverge. Running this in the new inline TUI - `export _EXPERIMENTAL_DAGGER_TUI=inline` - makes this easy to understand. Signed-off-by: Gerhard Lazu <[email protected]>
b0b207b
to
45f854f
Compare
There is a new GHA workflow - dagger_on_fly.yml - that makes use of this new Engine. There are also two new mage targets that manages the remote Dagger Engine. We use them to start the remote Dagger Engine in the pipeline, on-demand, and stop it when the job finishes running. If the job fails before it attempts to stop the Engine, this will remain running. This is exactly what we want since re-running failed jobs should be quicker (only by 20s, Dagger Engines are quick to start on Fly.io). The other interesting aspect of this is that we have a primary Engine, and a secondary one in case the primary one fails (regardless of the reason). Yes, always run 2 of everything™. Fly-by improvement: bump direnv to latest. Signed-off-by: Gerhard Lazu <[email protected]>
dagger/dagger#5535 Signed-off-by: Gerhard Lazu <[email protected]>
45f854f
to
d728bc2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
dagger-on-fly.mp4
This was mentioned in the context of #452 - here is the exact point in the episode transcript https://changelog.com/friends/2#transcript-58. This PR goes one step further and it:
FLY_DNS_SERVER
DAGGER_ENGINE_HOST
DAGGER_ENGINE_HOST_PORT
FLY_PRIMARY_DAGGER_ENGINE_MACHINE_ID
FLY_SECONDARY_DAGGER_ENGINE_MACHINE_ID
Find out more, including how we mitigate Fly.io machine SPOF, from
fly.io/dagger-engine-2023-05-20/README.md
.This PR introduces a couple of extra improvements (boy scout at heart):
mix deps.get
concurrency issuefly:daggerStart
&fly:daggerStop
Follow-ups:
DOCKER_ENGINE_HOST
from Settings > Actions > VariablesDOCKER_ENGINE_HOST_FQDN
from Settings > Actions > Variables~24h