Skip to content

ci: drop arm64 from main pushes + parallelize build-app for ~30min faster deploys#1128

Merged
buremba merged 5 commits into
mainfrom
feat/ci-fast-deploys
May 28, 2026
Merged

ci: drop arm64 from main pushes + parallelize build-app for ~30min faster deploys#1128
buremba merged 5 commits into
mainfrom
feat/ci-fast-deploys

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 28, 2026

Motivation

Today main → prod takes ~40min. Earlier today we hit the failure mode that's been ticking: PR #1116's build-app failed mid-rollout with no space left on device during the arm64 image export, leaving prod in a half-rolled state for ~30min before manual recovery. The disk pressure comes from cross-arch builds via QEMU on a 14GB GHA runner.

What changes (after pi v2 feedback)

Drop arm64 from main pushes. Prod is single-node Hetzner cpx41 (x86_64) per project_hetzner_prod_cost. Building linux/arm64 via QEMU on every main push is dead weight:

  • ~15-20min of emulated build time per push
  • Doubles Playwright/Chrome binary downloads (one set per arch) — directly the cause of the disk-full failure
  • No prod node ever pulls the arm64 manifest

Arm64 is still available via workflow_dispatch — a new platforms input defaults to linux/amd64 but can be set to linux/amd64,linux/arm64 for a manual multi-arch rebuild.

What I tried and walked back

Originally also parallelized build-app (dropped its needs: build-worker/build-embeddings-service chain). Pi v1 caught that this lost the transitive connector-parity-smoke gate. Pi v2 caught that even with parity-smoke restored, parallelizing still opens a low-probability but real failure window: if build-worker or build-embeddings-service then fails for any other reason, build-app has already pushed the Flux-watched tag and Flux rolls a half-existent release. Both pi flags were correct. The ~7min critical-path saving isn't worth the risk. Reverted to the safe gate.

Expected outcome

Before After
build-app job ~25min ~8min
Critical path (main → prod) ~40min ~20-22min
GHA runner disk usage ~14GB peak ~6GB peak
Disk-full failure surface Recurring Eliminated

Smaller win than I originally claimed, but no reliability regression.

Validation

Empirical on the first main push after this lands.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@buremba, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 12 minutes and 24 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f2f2a2fb-52a5-4661-9004-d751c9a93ef7

📥 Commits

Reviewing files that changed from the base of the PR and between 51124de and e913388.

📒 Files selected for processing (2)
  • .github/workflows/build-images.yml
  • packages/owletto
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ci-fast-deploys

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 28, 2026

bug_free 86, simplicity 82, slop 25, bugs 0, 0 blockers

Typecheck/unit/integration passed. Ran actionlint on build-images.yml; [env] it only reported pre-existing SC2086 on generate-tag line 44, not from this diff. Did not run Docker image builds/GHA workflow locally.

Suggested fixes

File Line Change
.github/workflows/build-images.yml 27 Replace the long BUILD_PLATFORMS comment with a concise note that main builds amd64 by default and workflow_dispatch can request multi-arch; remove the internal memory reference and timing narrative.
.github/workflows/build-images.yml 90 Replace the PR-history-heavy build-app comment with a short invariant: app must push last because Flux watches app while the chart uses one shared tag for app/worker/embeddings.
.github/workflows/build-images.yml 116 Add if: contains(env.BUILD_PLATFORMS, 'arm64') to each Set up QEMU step so default amd64-only pushes skip unnecessary setup work.
Full verdict JSON
{
  "bug_free_confidence": 86,
  "bugs": 0,
  "slop": 25,
  "simplicity": 82,
  "blockers": [],
  "change_type": "chore",
  "behavior_change_risk": "medium",
  "tests_adequate": true,
  "suggested_fixes": [
    {
      "file": ".github/workflows/build-images.yml",
      "line": 27,
      "change": "Replace the long BUILD_PLATFORMS comment with a concise note that main builds amd64 by default and workflow_dispatch can request multi-arch; remove the internal memory reference and timing narrative."
    },
    {
      "file": ".github/workflows/build-images.yml",
      "line": 90,
      "change": "Replace the PR-history-heavy build-app comment with a short invariant: app must push last because Flux watches app while the chart uses one shared tag for app/worker/embeddings."
    },
    {
      "file": ".github/workflows/build-images.yml",
      "line": 116,
      "change": "Add `if: contains(env.BUILD_PLATFORMS, 'arm64')` to each `Set up QEMU` step so default amd64-only pushes skip unnecessary setup work."
    }
  ],
  "notes": "Typecheck/unit/integration passed. Ran actionlint on build-images.yml; [env] it only reported pre-existing SC2086 on generate-tag line 44, not from this diff. Did not run Docker image builds/GHA workflow locally.",
  "categories": {
    "src": 0,
    "tests": 0,
    "docs": 0,
    "config": 0,
    "deps": 2,
    "migrations": 0,
    "ci": 36,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

@buremba buremba enabled auto-merge (squash) May 28, 2026 16:23
buremba added 5 commits May 28, 2026 19:22
…ster deploys

Today main → prod takes ~40min and has a recurring disk-full failure
mode (PR #1116's build-app failed mid-rollout earlier today on
'no space left on device' during the arm64 image export).

Two scoped changes that compound:

## Drop arm64 on main pushes

Prod is single-node Hetzner cpx41 — x86_64 only. Building linux/arm64
via QEMU on every main push is pure waste:
- ~15-20min of emulated build time per main push
- Doubles Playwright/Chrome binary downloads (one set per arch) which
  is what fills the ~14GB GHA runner disk
- No prod node ever pulls the arm64 manifest

The arm64 leg is still available on demand: workflow_dispatch now takes
a 'platforms' input defaulting to linux/amd64 — a dev who needs the
multi-arch image for an arm64 machine can re-run with
'linux/amd64,linux/arm64'.

## Parallelize build-app

build-app previously had needs: [build-worker, build-embeddings-service]
to keep all three image tags appearing on ghcr in sync (Flux watches
only the app policy but rolls the shared tag to all three Deployments).

The serial gate cost ~7-10min critical path. In practice build-app is
already the longest job and finishes 2-3min after build-worker even
when launched in parallel — so the new app tag appears AFTER the
worker/embeddings tags in nearly every case. kubelet's image-pull
back-off handles the rare outlier (pod retries until the tag exists,
no permanent failure).

## Expected outcome

- Per-job: build-app ~25min → ~8min (no QEMU arm64)
- Critical path: ~40min → ~10-12min
- Disk pressure on the runner halved, eliminating the failure mode
  that nearly stranded today's CORS rollout

Validation will be empirical on the first main push after this lands.
Pi flagged that dropping the build-worker dependency also dropped the
transitive connector-parity-smoke gate, which is the runtime self-check
that catches worker images that crash on boot (the failure mode #774
was added to prevent).

Restore connector-parity-smoke as a direct build-app dependency. Still
parallel with build-worker and build-embeddings-service for the
~7-10min savings — just guards against the higher-severity 'broken
worker image gets shipped silently' class instead of the lower-severity
'app tag appears 30s before worker tag' class.
… blocker)

Pi v2 flagged that the parallelization win opens a real failure
window: even with the connector-parity-smoke gate in place, a failed
build-worker or build-embeddings-service (disk pressure, registry
hiccup, unrelated Dockerfile regression) lets build-app still publish
the Flux-watched shared tag — Flux then rolls a release to a tag
whose sibling images don't exist.

The ~7min critical-path saving isn't worth that risk window. Reverted
to the safe gate. The arm64-drop alone still cuts ~15-20min off the
critical path (the bigger lever anyway), and addresses today's
disk-full incident root cause directly.

Net: PR is now the simpler 'drop arm64 from main pushes' change. Same
~25min → ~10-12min critical-path improvement was overoptimistic;
realistic outcome is ~25min → ~17-20min, no reliability regression.
@buremba buremba force-pushed the feat/ci-fast-deploys branch from b761350 to e913388 Compare May 28, 2026 18:23
@buremba buremba merged commit 0b10a0d into main May 28, 2026
21 checks passed
@buremba buremba deleted the feat/ci-fast-deploys branch May 28, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants