Skip to content

ci: bump install retry from 3 to 5 attempts (Aaron 2026-04-28)#81

Merged
AceHack merged 3 commits intomainfrom
fix/ci-cache-mise-on-lint-jobs-2026-04-28
Apr 28, 2026
Merged

ci: bump install retry from 3 to 5 attempts (Aaron 2026-04-28)#81
AceHack merged 3 commits intomainfrom
fix/ci-cache-mise-on-lint-jobs-2026-04-28

Conversation

@AceHack
Copy link
Copy Markdown
Owner

@AceHack AceHack commented Apr 28, 2026

Summary

Bumps the install-step retry wrapper added in PR #80 from 3 → 5 attempts per Aaron's 2026-04-28 input. Backoff schedule extended from 10s/30s to 10s/30s/60s/120s — covers transient CDN outages from short-burst (~10s recovery) through multi-minute (full 1-2 minute outage).

Why

PR #23's mise+bun-1.3.13 502 burned all 3 of mise's internal retries; a 3-attempt wrapper on top of that didn't add enough margin (3 wrapper × 3 mise = 9 underlying attempts, but they all stacked within a ~30s window where the CDN was still 502'ing).

Aaron explicit: "Mise's internal 3-attempt retry was exhausted on PR #23. go to 5 or 10". Picked 5 (less aggressive, total 5×4=20 underlying attempts spanning ~4 minutes).

Changes

All 3 lint jobs (lint-shell, lint-workflows, lint-markdown):

-          for attempt in 1 2 3; do
+          for attempt in 1 2 3 4 5; do
             if ./tools/setup/install.sh; then exit 0; fi
-            [ "$attempt" = "3" ] && { echo "install.sh failed after 3 attempts"; exit 1; }
-            backoff=$((attempt * 20 - 10))
+            [ "$attempt" = "5" ] && { echo "install.sh failed after 5 attempts"; exit 1; }
+            case "$attempt" in
+              1) backoff=10 ;;
+              2) backoff=30 ;;
+              3) backoff=60 ;;
+              4) backoff=120 ;;
+            esac
             echo "install.sh attempt $attempt failed; retrying in ${backoff}s..." >&2
             sleep "$backoff"
           done

Plus comment update on the lint-shell job documenting the 3 → 5 bump rationale.

Test plan

  • All 3 lint-job retry loops updated consistently
  • Backoff total: 10+30+60+120 = 220s ≈ 3.7 min — covers most observed CDN outages
  • CI passes

🤖 Generated with Claude Code

AceHack and others added 2 commits April 28, 2026 01:38
…26-04-28)

Three structural fixes for the PR #23 mise+bun-1.3.13 502 transient
class, addressing Aaron 2026-04-28 directives:

  "is there not a way to fix this?" (don't default to rerun)
  "we want to use stock and we better not be using that old
   version of ubuntu"
  "can you cache and retry?"
  "we want to make sure dev seutp and build machine setup are as
   close to the same a possible"
  "why not cache the whole install/setup"

1. **Comprehensive install cache** on lint-shell, lint-workflows,
   lint-markdown jobs (previously uncached). Caches everything
   tools/setup/install.sh writes:
     ~/.local/bin/mise (the mise binary)
     ~/.local/share/mise (mise runtimes — bun/dotnet/python/uv/java)
     ~/.cache/mise (mise download cache)
     ~/.dotnet/tools (dotnet global tools)
     ~/.elan (Lean toolchain)
     ~/.config/zeta (managed shellenv)
     tools/tla, tools/alloy (verifier jars)
   Cache key hashes BOTH .mise.toml AND tools/setup/** so install
   logic changes invalidate cache → vanilla install path gets
   re-tested whenever discipline changes.

2. **Retry layer** on the install step (CI-only — dev runs stay
   interactive). Three attempts with 10s/30s backoff. Mise's
   internal 3-attempt retry was exhausted on PR #23's bun download;
   wrapping at the install.sh layer catches the case where mise
   itself gives up. Same shape across all 3 lint jobs.

3. **Ubuntu 24.04 bump** on every workflow that pinned ubuntu-22.04
   (gate.yml lint jobs ×6, resume-diff.yml, scorecard.yml,
   memory-index-duplicate-lint.yml, budget-snapshot-cadence.yml).
   ubuntu-latest = ubuntu-24.04 since Jan 2025 per Otto-247 WebSearch
   verification; 22.04 is now LTS-2 stale. Stays on stock GitHub-
   hosted runner image (no custom pre-installed bun) per Aaron's
   "we want to use stock" + "vanilla ubuntu so we test do our install
   scripts work on vanalla and deve machines."

Dev↔CI parity: install.sh runs on both surfaces; cache restores
state similar to a dev's already-bootstrapped local env; cache key
on tools/setup/** + .mise.toml matches what a dev's environment
depends on. install.sh stays idempotent so cache hit = fast no-op,
cache miss = full vanilla install (which is the install-script
validation Aaron wants).

Composes with PR #75 curl_fetch helper (downstream curl retries),
PR #76 + #79 markdownlint carve-outs (verbatim ferry preservation),
Otto-247 version-currency, Otto-235 4-shell portability, Otto-341
mechanism-over-vigilance, and `feedback_structural_fix_beats_process_discipline_velocity_multiplier_aaron_2026_04_28.md`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 28, 2026 05:42
@AceHack AceHack enabled auto-merge (squash) April 28, 2026 05:42
…-lint-jobs-2026-04-28

# Conflicts:
#	.github/workflows/gate.yml
@AceHack AceHack merged commit 61c0a93 into main Apr 28, 2026
15 checks passed
@AceHack AceHack deleted the fix/ci-cache-mise-on-lint-jobs-2026-04-28 branch April 28, 2026 05:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates CI workflows to be more resilient to transient toolchain download failures and to run on a newer Ubuntu runner image.

Changes:

  • Increase tools/setup/install.sh retry loop in gate.yml lint jobs to 5 attempts with an explicit 10s/30s/60s/120s backoff.
  • Bump several workflows’ runs-on from ubuntu-22.04 to ubuntu-24.04.
  • Update/expand workflow comments around the install caching/retry rationale.

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Show a summary per file
File Description
.github/workflows/gate.yml Updates lint jobs to ubuntu-24.04 and adds/expands install caching + 5-attempt retry/backoff logic.
.github/workflows/scorecard.yml Runner bump to ubuntu-24.04.
.github/workflows/resume-diff.yml Runner bump to ubuntu-24.04 and associated comment update.
.github/workflows/memory-index-duplicate-lint.yml Runner bump to ubuntu-24.04 and associated comment update.
.github/workflows/budget-snapshot-cadence.yml Runner bump to ubuntu-24.04.
Comments suppressed due to low confidence (6)

.github/workflows/gate.yml:256

  • P1: timeout-minutes: 5 is likely too low now that the install step can sleep up to 220s (10+30+60+120) plus the runtime of up to 5 install attempts. If the network is flaky, the job can hit the workflow timeout and lose the intended retry/fail-after-5 behavior; consider increasing the job timeout (or moving retry/backoff into a separate step with its own timeout).
    name: lint (shellcheck)
    timeout-minutes: 5
    runs-on: ubuntu-24.04

.github/workflows/gate.yml:358

  • P1: timeout-minutes: 5 may be too low given the new 5-attempt install retry with up to 220s of backoff sleeps (plus install runtimes). Consider increasing the timeout so the retry strategy can actually complete and produce the explicit "failed after 5 attempts" error instead of timing out.
    name: lint (actionlint)
    timeout-minutes: 5
    runs-on: ubuntu-24.04

.github/workflows/gate.yml:523

  • P1: With the install step now retried up to 5 times and backoff totaling 220s, timeout-minutes: 5 risks the job timing out before the retry loop completes (especially on cold cache / slow CDN). Bumping the timeout would make the retries effective and preserve the intended failure message.
    name: lint (markdownlint)
    timeout-minutes: 5
    runs-on: ubuntu-24.04

.github/workflows/gate.yml:19

  • P1: This comment introduces personal-name attribution ("Aaron" and the task-style "Otto-247") in a workflow file. Repo convention is to use role references on current-state surfaces (like workflows) and keep personal/persona names limited to history surfaces; please rewrite this to use role-refs (e.g., "the human maintainer") and a non-name task reference if needed (see docs/AGENT-BEST-PRACTICES.md:284-346).
#     Lint jobs pinned to ubuntu-24.04 (short-lived, OS-independent
#     work; bumped from 22.04 on 2026-04-28 per Aaron's "we better
#     not be using that old version of ubuntu" + Otto-247 version-
#     currency WebSearch — ubuntu-latest = 24.04 since Jan 2025).

.github/workflows/gate.yml:266

  • Typo in quoted text: "dev seutp" should be "dev setup".
        #   "we want to make sure dev seutp and build machine setup

.github/workflows/gate.yml:267

  • Grammar/typo: "as close to the same a possible" should be "as close to the same as possible".
        #    are as close to the same a possible"

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db62467258

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

run: |
set -euo pipefail
for attempt in 1 2 3; do
for attempt in 1 2 3 4 5; do
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep retry budget within lint job timeout

This retry loop now allows five install attempts with 10/30/60/120s backoff, i.e. 220s of sleep before the final try, but the lint jobs still have a 5-minute timeout. That leaves less than 80s total for all install.sh executions, so under the transient CDN failures this change is trying to absorb (where install.sh itself already spends time retrying), the job can time out before attempt 5 is reached. The same pattern is duplicated in lint-workflows and lint-markdown, so the configured “5 attempts” is not reliably achievable unless timeout/backoff are adjusted together.

Useful? React with 👍 / 👎.

AceHack added a commit that referenced this pull request Apr 28, 2026
…Otto-357 strengthen (#83)

* tick-history: 2026-04-28T05:44Z — PR #80 MERGED + #81 retry-bump + #82 Otto-357 strengthen + 3 conflict resolutions

* fix(pr-83): reconcile verify-don't-parrot streak count — 4 ticks running (was inconsistent 3 vs 4)

PR #83 review thread (P2 copilot): the row described the streak count as
both "3 ticks running" early and "4 ticks running" later. The conflict
was a scope mismatch — the early count was meant to be cumulative
ticks-of-discipline-applied (4, matching the observations enumeration),
but I'd written it as 3 from an older draft state.

Reconciled to a single 4-count framing that explicitly references the
observations column (which enumerates the 4 distinct verifications
applied this tick: cron-id verify / AUTONOMOUS-LOOP.md grep / CronList
freshness / retry-3-failed-on-#23 sourcing).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants