Run Dagger Engine as a Fly Machine (no more Docker) #471

gerhard · 2023-07-02T11:26:37Z

dagger-on-fly.mp4

This was mentioned in the context of #452 - here is the exact point in the episode transcript https://changelog.com/friends/2#transcript-58. This PR goes one step further and it:

Adds instructions on how to provision a Dagger Engine on Fly.io
- notice the impact of various instance sizes on our pipeline run duration (spoiler alert: biggest is not best)
Integrates it with GitHub Actions
- notice the new Settings > Actions > Variables
  - FLY_DNS_SERVER
  - DAGGER_ENGINE_HOST
  - DAGGER_ENGINE_HOST_PORT
  - FLY_PRIMARY_DAGGER_ENGINE_MACHINE_ID
  - FLY_SECONDARY_DAGGER_ENGINE_MACHINE_ID
- notice how we use the Dagger Engine running in GitHub Actions to manage a remote Dagger Engine on Fly.io

Find out more, including how we mitigate Fly.io machine SPOF, from fly.io/dagger-engine-2023-05-20/README.md.

This PR introduces a couple of extra improvements (boy scout at heart):

Share the same Dagger Client across all mage tasks
Resolve the mix deps.get concurrency issue
Add mage targets for managing our remote Dagger Engine - fly:daggerStart & fly:daggerStop

Follow-ups:

Remove DOCKER_ENGINE_HOST from Settings > Actions > Variables
Remove DOCKER_ENGINE_HOST_FQDN from Settings > Actions > Variables
Delete https://fly.io/apps/docker-2022-06-13 in ~24h

As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - thechangelog#471 Signed-off-by: Gerhard Lazu <[email protected]>

As you already know, we use Dagger for CI/CD. By default, this runs on Fly.io (via Docker). In some cases, this can fail. The last failure was when DNS resolution stopped working after the Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io machines), e.g. https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1 As a temporary fix, we had to delete some secrets and re-run the job. The job ran on GHA free runners & failed for genuine reasons 6 mins later: https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391 While running on the free GHA runners can be 3x-8x slower, it's a good fall-back. You heard us mention on multiple occasions: "always have redundancies in place". Since we already have multiple CI runtimes in place (Fly.io. K8s), let's make our GHA workflow resilient by: - Run on our preferred back-end by default (Dagger on Fly.io) - ✅ If it succeeds, we are done - ❌ If it fails, fallback to running on the free GitHub runners - In forks, use free GitHub runners by default (we cannot share `secrets`) While this means that a workflow which fails for genuine reasons will fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems like a better place to improve from. This change goes one step further. We are using a third back-end: Dagger on K8s. This uses a self-hosted GitHub runner on K8s which is already integrated with Dagger. For now, we are using it just to see how the CI part compares to our primary setup (Dagger on Fly.io). We are not using Dagger on K8s to deploy the app. Let's see how this setup behaves over a few weeks/months before we consider taking it further. Part of this, we also improved on how we check for Fly.io connectivity. Things that could be improved in follow-ups: - the workflow should succeed if the `dagger-on-github-fallback` job succeeds - currently it fails if `dagger-on-fly-docker` fails - add `dagger-on-k8s` job as secondary fallback - GitHub Actions is currently missing actions/runner#1665 - maybe leverage a Dagger cache that works in forks too 😉 - Run Dagger Engine as a Fly Machine (no more Docker) - #471 Signed-off-by: Gerhard Lazu <[email protected]>

+ includes `mage ci` benchmarks for various VM sizes + stress-tested Fly.io apps v2 by creating volumes & machines at least 30 times across several days (all worked as expected!) Signed-off-by: Gerhard Lazu <[email protected]>

Rather than opening multiple connections from each task when they run in parallel, open one connection and share it via Go's context. Signed-off-by: Gerhard Lazu <[email protected]>

Signed-off-by: Gerhard Lazu <[email protected]>

If multiple instances of this run in parallel, one of them will fail in various non-obvious ways. Since resolving deps now has a common series of steps, Dagger is able to de-duplicate the ops and run it just once, before the two pipelines diverge. Running this in the new inline TUI - `export _EXPERIMENTAL_DAGGER_TUI=inline` - makes this easy to understand. Signed-off-by: Gerhard Lazu <[email protected]>

There is a new GHA workflow - dagger_on_fly.yml - that makes use of this new Engine. There are also two new mage targets that manages the remote Dagger Engine. We use them to start the remote Dagger Engine in the pipeline, on-demand, and stop it when the job finishes running. If the job fails before it attempts to stop the Engine, this will remain running. This is exactly what we want since re-running failed jobs should be quicker (only by 20s, Dagger Engines are quick to start on Fly.io). The other interesting aspect of this is that we have a primary Engine, and a secondary one in case the primary one fails (regardless of the reason). Yes, always run 2 of everything™. Fly-by improvement: bump direnv to latest. Signed-off-by: Gerhard Lazu <[email protected]>

dagger/dagger#5535 Signed-off-by: Gerhard Lazu <[email protected]>

gerhard changed the title ~~Run Dagger Engine on Fly Machines (a.k.a. Apps v2)~~ Run Dagger Engines on Fly Machines (a.k.a. Apps v2) Jul 2, 2023

gerhard force-pushed the dagger-engine-on-fly-machines branch 7 times, most recently from 68b3ef4 to 499d19b Compare July 2, 2023 17:04

gerhard mentioned this pull request Jul 30, 2023

Make our ship_it.yml GHA workflow resilient #476

Merged

gerhard changed the title ~~Run Dagger Engines on Fly Machines (a.k.a. Apps v2)~~ Run Dagger Engine as a Fly Machine (no more Docker) Jul 31, 2023

gerhard force-pushed the dagger-engine-on-fly-machines branch from 499d19b to 9f2cb95 Compare July 31, 2023 09:09

gerhard added 2 commits August 2, 2023 10:28

Share the same Dagger Client across all mage tasks

a19f94a

Rather than opening multiple connections from each task when they run in parallel, open one connection and share it via Go's context. Signed-off-by: Gerhard Lazu <[email protected]>

gerhard force-pushed the dagger-engine-on-fly-machines branch 6 times, most recently from 20c38d1 to 877c27d Compare August 2, 2023 11:11

gerhard added 2 commits August 2, 2023 13:25

Bump Node.js to latest v18

af5b62b

Signed-off-by: Gerhard Lazu <[email protected]>

gerhard force-pushed the dagger-engine-on-fly-machines branch 2 times, most recently from b0b207b to 45f854f Compare August 3, 2023 10:03

gerhard marked this pull request as ready for review August 3, 2023 10:10

gerhard added 2 commits August 3, 2023 11:25

Fix OCI image labels

d728bc2

dagger/dagger#5535 Signed-off-by: Gerhard Lazu <[email protected]>

gerhard force-pushed the dagger-engine-on-fly-machines branch from 45f854f to d728bc2 Compare August 3, 2023 10:25

gerhard merged commit 60664a2 into thechangelog:master Aug 4, 2023
5 checks passed

gerhard deleted the dagger-engine-on-fly-machines branch August 4, 2023 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Dagger Engine as a Fly Machine (no more Docker) #471

Run Dagger Engine as a Fly Machine (no more Docker) #471

gerhard commented Jul 2, 2023 •

edited

Loading

Run Dagger Engine as a Fly Machine (no more Docker) #471

Run Dagger Engine as a Fly Machine (no more Docker) #471

Conversation

gerhard commented Jul 2, 2023 • edited Loading

gerhard commented Jul 2, 2023 •

edited

Loading