ci: drop arm64 from main pushes + parallelize build-app for ~30min faster deploys by buremba · Pull Request #1128 · lobu-ai/lobu

buremba · 2026-05-28T16:12:46Z

Motivation

Today main → prod takes ~40min. Earlier today we hit the failure mode that's been ticking: PR #1116's build-app failed mid-rollout with no space left on device during the arm64 image export, leaving prod in a half-rolled state for ~30min before manual recovery. The disk pressure comes from cross-arch builds via QEMU on a 14GB GHA runner.

What changes (after pi v2 feedback)

Drop arm64 from main pushes. Prod is single-node Hetzner cpx41 (x86_64) per project_hetzner_prod_cost. Building linux/arm64 via QEMU on every main push is dead weight:

~15-20min of emulated build time per push
Doubles Playwright/Chrome binary downloads (one set per arch) — directly the cause of the disk-full failure
No prod node ever pulls the arm64 manifest

Arm64 is still available via workflow_dispatch — a new platforms input defaults to linux/amd64 but can be set to linux/amd64,linux/arm64 for a manual multi-arch rebuild.

What I tried and walked back

Originally also parallelized build-app (dropped its needs: build-worker/build-embeddings-service chain). Pi v1 caught that this lost the transitive connector-parity-smoke gate. Pi v2 caught that even with parity-smoke restored, parallelizing still opens a low-probability but real failure window: if build-worker or build-embeddings-service then fails for any other reason, build-app has already pushed the Flux-watched tag and Flux rolls a half-existent release. Both pi flags were correct. The ~7min critical-path saving isn't worth the risk. Reverted to the safe gate.

Expected outcome

	Before	After
build-app job	~25min	~8min
Critical path (main → prod)	~40min	~20-22min
GHA runner disk usage	~14GB peak	~6GB peak
Disk-full failure surface	Recurring	Eliminated

Smaller win than I originally claimed, but no reliability regression.

Validation

Empirical on the first main push after this lands.

coderabbitai · 2026-05-28T16:12:55Z

Warning

Review limit reached

@buremba, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 12 minutes and 24 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f2f2a2fb-52a5-4661-9004-d751c9a93ef7

📥 Commits

Reviewing files that changed from the base of the PR and between 51124de and e913388.

📒 Files selected for processing (2)

.github/workflows/build-images.yml
packages/owletto

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/ci-fast-deploys

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-28T16:15:20Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

buremba · 2026-05-28T16:15:35Z

bug_free 86, simplicity 82, slop 25, bugs 0, 0 blockers

Typecheck/unit/integration passed. Ran actionlint on build-images.yml; [env] it only reported pre-existing SC2086 on generate-tag line 44, not from this diff. Did not run Docker image builds/GHA workflow locally.

Suggested fixes

File	Line	Change
`.github/workflows/build-images.yml`	27	Replace the long BUILD_PLATFORMS comment with a concise note that main builds amd64 by default and workflow_dispatch can request multi-arch; remove the internal memory reference and timing narrative.
`.github/workflows/build-images.yml`	90	Replace the PR-history-heavy build-app comment with a short invariant: app must push last because Flux watches app while the chart uses one shared tag for app/worker/embeddings.
`.github/workflows/build-images.yml`	116	Add `if: contains(env.BUILD_PLATFORMS, 'arm64')` to each `Set up QEMU` step so default amd64-only pushes skip unnecessary setup work.

Full verdict JSON

{
  "bug_free_confidence": 86,
  "bugs": 0,
  "slop": 25,
  "simplicity": 82,
  "blockers": [],
  "change_type": "chore",
  "behavior_change_risk": "medium",
  "tests_adequate": true,
  "suggested_fixes": [
    {
      "file": ".github/workflows/build-images.yml",
      "line": 27,
      "change": "Replace the long BUILD_PLATFORMS comment with a concise note that main builds amd64 by default and workflow_dispatch can request multi-arch; remove the internal memory reference and timing narrative."
    },
    {
      "file": ".github/workflows/build-images.yml",
      "line": 90,
      "change": "Replace the PR-history-heavy build-app comment with a short invariant: app must push last because Flux watches app while the chart uses one shared tag for app/worker/embeddings."
    },
    {
      "file": ".github/workflows/build-images.yml",
      "line": 116,
      "change": "Add `if: contains(env.BUILD_PLATFORMS, 'arm64')` to each `Set up QEMU` step so default amd64-only pushes skip unnecessary setup work."
    }
  ],
  "notes": "Typecheck/unit/integration passed. Ran actionlint on build-images.yml; [env] it only reported pre-existing SC2086 on generate-tag line 44, not from this diff. Did not run Docker image builds/GHA workflow locally.",
  "categories": {
    "src": 0,
    "tests": 0,
    "docs": 0,
    "config": 0,
    "deps": 2,
    "migrations": 0,
    "ci": 36,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

…ster deploys Today main → prod takes ~40min and has a recurring disk-full failure mode (PR #1116's build-app failed mid-rollout earlier today on 'no space left on device' during the arm64 image export). Two scoped changes that compound: ## Drop arm64 on main pushes Prod is single-node Hetzner cpx41 — x86_64 only. Building linux/arm64 via QEMU on every main push is pure waste: - ~15-20min of emulated build time per main push - Doubles Playwright/Chrome binary downloads (one set per arch) which is what fills the ~14GB GHA runner disk - No prod node ever pulls the arm64 manifest The arm64 leg is still available on demand: workflow_dispatch now takes a 'platforms' input defaulting to linux/amd64 — a dev who needs the multi-arch image for an arm64 machine can re-run with 'linux/amd64,linux/arm64'. ## Parallelize build-app build-app previously had needs: [build-worker, build-embeddings-service] to keep all three image tags appearing on ghcr in sync (Flux watches only the app policy but rolls the shared tag to all three Deployments). The serial gate cost ~7-10min critical path. In practice build-app is already the longest job and finishes 2-3min after build-worker even when launched in parallel — so the new app tag appears AFTER the worker/embeddings tags in nearly every case. kubelet's image-pull back-off handles the rare outlier (pod retries until the tag exists, no permanent failure). ## Expected outcome - Per-job: build-app ~25min → ~8min (no QEMU arm64) - Critical path: ~40min → ~10-12min - Disk pressure on the runner halved, eliminating the failure mode that nearly stranded today's CORS rollout Validation will be empirical on the first main push after this lands.

Pi flagged that dropping the build-worker dependency also dropped the transitive connector-parity-smoke gate, which is the runtime self-check that catches worker images that crash on boot (the failure mode #774 was added to prevent). Restore connector-parity-smoke as a direct build-app dependency. Still parallel with build-worker and build-embeddings-service for the ~7-10min savings — just guards against the higher-severity 'broken worker image gets shipped silently' class instead of the lower-severity 'app tag appears 30s before worker tag' class.

… blocker) Pi v2 flagged that the parallelization win opens a real failure window: even with the connector-parity-smoke gate in place, a failed build-worker or build-embeddings-service (disk pressure, registry hiccup, unrelated Dockerfile regression) lets build-app still publish the Flux-watched shared tag — Flux then rolls a release to a tag whose sibling images don't exist. The ~7min critical-path saving isn't worth that risk window. Reverted to the safe gate. The arm64-drop alone still cuts ~15-20min off the critical path (the bigger lever anyway), and addresses today's disk-full incident root cause directly. Net: PR is now the simpler 'drop arm64 from main pushes' change. Same ~25min → ~10-12min critical-path improvement was overoptimistic; realistic outcome is ~25min → ~17-20min, no reliability regression.

buremba enabled auto-merge (squash) May 28, 2026 16:23

buremba added 5 commits May 28, 2026 19:22

chore: bump owletto pointer to current main (clear check-drift)

26ac162

chore: point owletto at current main (b05d2fa) to clear check-drift

e913388

buremba force-pushed the feat/ci-fast-deploys branch from b761350 to e913388 Compare May 28, 2026 18:23

buremba merged commit 0b10a0d into main May 28, 2026
21 checks passed

buremba deleted the feat/ci-fast-deploys branch May 28, 2026 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: drop arm64 from main pushes + parallelize build-app for ~30min faster deploys#1128

ci: drop arm64 from main pushes + parallelize build-app for ~30min faster deploys#1128
buremba merged 5 commits into
mainfrom
feat/ci-fast-deploys

buremba commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review limit reached

Uh oh!

codecov-commenter commented May 28, 2026

Uh oh!

buremba commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What changes (after pi v2 feedback)

What I tried and walked back

Expected outcome

Validation

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

codecov-commenter commented May 28, 2026

Codecov Report

Uh oh!

buremba commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suggested fixes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

buremba commented May 28, 2026 •

edited

Loading