Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Dagger Engine as a Fly Machine (no more Docker) #471

Merged
merged 6 commits into from
Aug 4, 2023

Conversation

gerhard
Copy link
Member

@gerhard gerhard commented Jul 2, 2023

dagger-on-fly.mp4

This was mentioned in the context of #452 - here is the exact point in the episode transcript https://changelog.com/friends/2#transcript-58. This PR goes one step further and it:

  • Adds instructions on how to provision a Dagger Engine on Fly.io
    • notice the impact of various instance sizes on our pipeline run duration (spoiler alert: biggest is not best)
  • Integrates it with GitHub Actions
    • notice the new Settings > Actions > Variables
      • FLY_DNS_SERVER
      • DAGGER_ENGINE_HOST
      • DAGGER_ENGINE_HOST_PORT
      • FLY_PRIMARY_DAGGER_ENGINE_MACHINE_ID
      • FLY_SECONDARY_DAGGER_ENGINE_MACHINE_ID
    • notice how we use the Dagger Engine running in GitHub Actions to manage a remote Dagger Engine on Fly.io

Find out more, including how we mitigate Fly.io machine SPOF, from fly.io/dagger-engine-2023-05-20/README.md.


This PR introduces a couple of extra improvements (boy scout at heart):

  • Share the same Dagger Client across all mage tasks
  • Resolve the mix deps.get concurrency issue
  • Add mage targets for managing our remote Dagger Engine - fly:daggerStart & fly:daggerStop

Follow-ups:

@gerhard gerhard changed the title Run Dagger Engine on Fly Machines (a.k.a. Apps v2) Run Dagger Engines on Fly Machines (a.k.a. Apps v2) Jul 2, 2023
@gerhard gerhard force-pushed the dagger-engine-on-fly-machines branch 7 times, most recently from 68b3ef4 to 499d19b Compare July 2, 2023 17:04
gerhard added a commit to gerhard/changelog.com that referenced this pull request Jul 31, 2023
As you already know, we use Dagger for CI/CD. By default, this runs on
Fly.io (via Docker). In some cases, this can fail.

The last failure was when DNS resolution stopped working after the
Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io
machines), e.g.
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1

As a temporary fix, we had to delete some secrets and re-run the job.
The job ran on GHA free runners & failed for genuine reasons
6 mins later:
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391

While running on the free GHA runners can be 3x-8x slower, it's a good
fall-back. You heard us mention on multiple occasions: "always have
redundancies in place". Since we already have multiple CI runtimes in
place (Fly.io. K8s), let's make our GHA workflow resilient by:
- Run on our preferred back-end by default (Dagger on Fly.io)
  - ✅ If it succeeds, we are done
  - ❌ If it fails, fallback to running on the free GitHub runners
- In forks, use free GitHub runners by default (we cannot share `secrets`)

While this means that a workflow which fails for genuine reasons will
fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems
like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger
on K8s. This uses a self-hosted GitHub runner on K8s which is already
integrated with Dagger. For now, we are using it just to see how the CI
part compares to our primary setup (Dagger on Fly.io). We are not using
Dagger on K8s to deploy the app. Let's see how this setup behaves over a
few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.

Things that could be improved in follow-ups:
- the workflow should succeed if the `dagger-on-github-fallback` job succeeds
  - currently it fails if `dagger-on-fly-docker` fails
- add `dagger-on-k8s` job as secondary fallback
  - GitHub Actions is currently missing actions/runner#1665
- maybe leverage a Dagger cache that works in forks too 😉
- Run Dagger Engine as a Fly Machine (no more Docker)
  - thechangelog#471

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard changed the title Run Dagger Engines on Fly Machines (a.k.a. Apps v2) Run Dagger Engine as a Fly Machine (no more Docker) Jul 31, 2023
gerhard added a commit that referenced this pull request Jul 31, 2023
As you already know, we use Dagger for CI/CD. By default, this runs on
Fly.io (via Docker). In some cases, this can fail.

The last failure was when DNS resolution stopped working after the
Docker instance was auto-upgraded from apps v1 -> v2 (a.k.a. Fly.io
machines), e.g.
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/attempts/1

As a temporary fix, we had to delete some secrets and re-run the job.
The job ran on GHA free runners & failed for genuine reasons
6 mins later:
https://github.com/thechangelog/changelog.com/actions/runs/5673476702/job/15395264391

While running on the free GHA runners can be 3x-8x slower, it's a good
fall-back. You heard us mention on multiple occasions: "always have
redundancies in place". Since we already have multiple CI runtimes in
place (Fly.io. K8s), let's make our GHA workflow resilient by:
- Run on our preferred back-end by default (Dagger on Fly.io)
  - ✅ If it succeeds, we are done
  - ❌ If it fails, fallback to running on the free GitHub runners
- In forks, use free GitHub runners by default (we cannot share `secrets`)

While this means that a workflow which fails for genuine reasons will
fail twice for us (1. Dagger on Fly.io, 2. Dagger on GitHub), it seems
like a better place to improve from.

This change goes one step further. We are using a third back-end: Dagger
on K8s. This uses a self-hosted GitHub runner on K8s which is already
integrated with Dagger. For now, we are using it just to see how the CI
part compares to our primary setup (Dagger on Fly.io). We are not using
Dagger on K8s to deploy the app. Let's see how this setup behaves over a
few weeks/months before we consider taking it further.

Part of this, we also improved on how we check for Fly.io connectivity.

Things that could be improved in follow-ups:
- the workflow should succeed if the `dagger-on-github-fallback` job succeeds
  - currently it fails if `dagger-on-fly-docker` fails
- add `dagger-on-k8s` job as secondary fallback
  - GitHub Actions is currently missing actions/runner#1665
- maybe leverage a Dagger cache that works in forks too 😉
- Run Dagger Engine as a Fly Machine (no more Docker)
  - #471

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard force-pushed the dagger-engine-on-fly-machines branch from 499d19b to 9f2cb95 Compare July 31, 2023 09:09
+ includes `mage ci` benchmarks for various VM sizes
+ stress-tested Fly.io apps v2 by creating volumes & machines at least
  30 times across several days (all worked as expected!)

Signed-off-by: Gerhard Lazu <[email protected]>
Rather than opening multiple connections from each task when they run in
parallel, open one connection and share it via Go's context.

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard force-pushed the dagger-engine-on-fly-machines branch 6 times, most recently from 20c38d1 to 877c27d Compare August 2, 2023 11:11
Signed-off-by: Gerhard Lazu <[email protected]>
If multiple instances of this run in parallel, one of them will fail in
various non-obvious ways. Since resolving deps now has a common series
of steps, Dagger is able to de-duplicate the ops and run it just once,
before the two pipelines diverge.

Running this in the new inline TUI - `export
_EXPERIMENTAL_DAGGER_TUI=inline` - makes this easy to understand.

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard force-pushed the dagger-engine-on-fly-machines branch 2 times, most recently from b0b207b to 45f854f Compare August 3, 2023 10:03
@gerhard gerhard marked this pull request as ready for review August 3, 2023 10:10
There is a new GHA workflow - dagger_on_fly.yml - that makes use of this
new Engine. There are also two new mage targets that manages the remote
Dagger Engine. We use them to start the remote Dagger Engine in the
pipeline, on-demand, and stop it when the job finishes running. If the
job fails before it attempts to stop the Engine, this will remain
running. This is exactly what we want since re-running failed jobs
should be quicker (only by 20s, Dagger Engines are quick to start on
Fly.io).

The other interesting aspect of this is that we have a primary Engine,
and a secondary one in case the primary one fails (regardless of the
reason). Yes, always run 2 of everything™.

Fly-by improvement: bump direnv to latest.

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard force-pushed the dagger-engine-on-fly-machines branch from 45f854f to d728bc2 Compare August 3, 2023 10:25
@gerhard gerhard merged commit 60664a2 into thechangelog:master Aug 4, 2023
5 checks passed
@gerhard gerhard deleted the dagger-engine-on-fly-machines branch August 4, 2023 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant