ci: drop arm64 from main pushes + parallelize build-app for ~30min faster deploys#1128
Conversation
|
Warning Review limit reached
More reviews will be available in 12 minutes and 24 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
bug_free 86, simplicity 82, slop 25, bugs 0, 0 blockers Typecheck/unit/integration passed. Ran actionlint on build-images.yml; [env] it only reported pre-existing SC2086 on generate-tag line 44, not from this diff. Did not run Docker image builds/GHA workflow locally. Suggested fixes
Full verdict JSON{
"bug_free_confidence": 86,
"bugs": 0,
"slop": 25,
"simplicity": 82,
"blockers": [],
"change_type": "chore",
"behavior_change_risk": "medium",
"tests_adequate": true,
"suggested_fixes": [
{
"file": ".github/workflows/build-images.yml",
"line": 27,
"change": "Replace the long BUILD_PLATFORMS comment with a concise note that main builds amd64 by default and workflow_dispatch can request multi-arch; remove the internal memory reference and timing narrative."
},
{
"file": ".github/workflows/build-images.yml",
"line": 90,
"change": "Replace the PR-history-heavy build-app comment with a short invariant: app must push last because Flux watches app while the chart uses one shared tag for app/worker/embeddings."
},
{
"file": ".github/workflows/build-images.yml",
"line": 116,
"change": "Add `if: contains(env.BUILD_PLATFORMS, 'arm64')` to each `Set up QEMU` step so default amd64-only pushes skip unnecessary setup work."
}
],
"notes": "Typecheck/unit/integration passed. Ran actionlint on build-images.yml; [env] it only reported pre-existing SC2086 on generate-tag line 44, not from this diff. Did not run Docker image builds/GHA workflow locally.",
"categories": {
"src": 0,
"tests": 0,
"docs": 0,
"config": 0,
"deps": 2,
"migrations": 0,
"ci": 36,
"generated": 0
}
}Local review gate — branch protection can require the |
…ster deploys Today main → prod takes ~40min and has a recurring disk-full failure mode (PR #1116's build-app failed mid-rollout earlier today on 'no space left on device' during the arm64 image export). Two scoped changes that compound: ## Drop arm64 on main pushes Prod is single-node Hetzner cpx41 — x86_64 only. Building linux/arm64 via QEMU on every main push is pure waste: - ~15-20min of emulated build time per main push - Doubles Playwright/Chrome binary downloads (one set per arch) which is what fills the ~14GB GHA runner disk - No prod node ever pulls the arm64 manifest The arm64 leg is still available on demand: workflow_dispatch now takes a 'platforms' input defaulting to linux/amd64 — a dev who needs the multi-arch image for an arm64 machine can re-run with 'linux/amd64,linux/arm64'. ## Parallelize build-app build-app previously had needs: [build-worker, build-embeddings-service] to keep all three image tags appearing on ghcr in sync (Flux watches only the app policy but rolls the shared tag to all three Deployments). The serial gate cost ~7-10min critical path. In practice build-app is already the longest job and finishes 2-3min after build-worker even when launched in parallel — so the new app tag appears AFTER the worker/embeddings tags in nearly every case. kubelet's image-pull back-off handles the rare outlier (pod retries until the tag exists, no permanent failure). ## Expected outcome - Per-job: build-app ~25min → ~8min (no QEMU arm64) - Critical path: ~40min → ~10-12min - Disk pressure on the runner halved, eliminating the failure mode that nearly stranded today's CORS rollout Validation will be empirical on the first main push after this lands.
Pi flagged that dropping the build-worker dependency also dropped the transitive connector-parity-smoke gate, which is the runtime self-check that catches worker images that crash on boot (the failure mode #774 was added to prevent). Restore connector-parity-smoke as a direct build-app dependency. Still parallel with build-worker and build-embeddings-service for the ~7-10min savings — just guards against the higher-severity 'broken worker image gets shipped silently' class instead of the lower-severity 'app tag appears 30s before worker tag' class.
… blocker) Pi v2 flagged that the parallelization win opens a real failure window: even with the connector-parity-smoke gate in place, a failed build-worker or build-embeddings-service (disk pressure, registry hiccup, unrelated Dockerfile regression) lets build-app still publish the Flux-watched shared tag — Flux then rolls a release to a tag whose sibling images don't exist. The ~7min critical-path saving isn't worth that risk window. Reverted to the safe gate. The arm64-drop alone still cuts ~15-20min off the critical path (the bigger lever anyway), and addresses today's disk-full incident root cause directly. Net: PR is now the simpler 'drop arm64 from main pushes' change. Same ~25min → ~10-12min critical-path improvement was overoptimistic; realistic outcome is ~25min → ~17-20min, no reliability regression.
b761350 to
e913388
Compare
Motivation
Today main → prod takes ~40min. Earlier today we hit the failure mode that's been ticking: PR #1116's
build-appfailed mid-rollout withno space left on deviceduring the arm64 image export, leaving prod in a half-rolled state for ~30min before manual recovery. The disk pressure comes from cross-arch builds via QEMU on a 14GB GHA runner.What changes (after pi v2 feedback)
Drop arm64 from
mainpushes. Prod is single-node Hetznercpx41(x86_64) perproject_hetzner_prod_cost. Buildinglinux/arm64via QEMU on every main push is dead weight:Arm64 is still available via
workflow_dispatch— a newplatformsinput defaults tolinux/amd64but can be set tolinux/amd64,linux/arm64for a manual multi-arch rebuild.What I tried and walked back
Originally also parallelized
build-app(dropped itsneeds: build-worker/build-embeddings-servicechain). Pi v1 caught that this lost the transitiveconnector-parity-smokegate. Pi v2 caught that even with parity-smoke restored, parallelizing still opens a low-probability but real failure window: ifbuild-workerorbuild-embeddings-servicethen fails for any other reason,build-apphas already pushed the Flux-watched tag and Flux rolls a half-existent release. Both pi flags were correct. The ~7min critical-path saving isn't worth the risk. Reverted to the safe gate.Expected outcome
Smaller win than I originally claimed, but no reliability regression.
Validation
Empirical on the first main push after this lands.