ci: bump install retry from 3 to 5 attempts (Aaron 2026-04-28)#81
ci: bump install retry from 3 to 5 attempts (Aaron 2026-04-28)#81
Conversation
…26-04-28) Three structural fixes for the PR #23 mise+bun-1.3.13 502 transient class, addressing Aaron 2026-04-28 directives: "is there not a way to fix this?" (don't default to rerun) "we want to use stock and we better not be using that old version of ubuntu" "can you cache and retry?" "we want to make sure dev seutp and build machine setup are as close to the same a possible" "why not cache the whole install/setup" 1. **Comprehensive install cache** on lint-shell, lint-workflows, lint-markdown jobs (previously uncached). Caches everything tools/setup/install.sh writes: ~/.local/bin/mise (the mise binary) ~/.local/share/mise (mise runtimes — bun/dotnet/python/uv/java) ~/.cache/mise (mise download cache) ~/.dotnet/tools (dotnet global tools) ~/.elan (Lean toolchain) ~/.config/zeta (managed shellenv) tools/tla, tools/alloy (verifier jars) Cache key hashes BOTH .mise.toml AND tools/setup/** so install logic changes invalidate cache → vanilla install path gets re-tested whenever discipline changes. 2. **Retry layer** on the install step (CI-only — dev runs stay interactive). Three attempts with 10s/30s backoff. Mise's internal 3-attempt retry was exhausted on PR #23's bun download; wrapping at the install.sh layer catches the case where mise itself gives up. Same shape across all 3 lint jobs. 3. **Ubuntu 24.04 bump** on every workflow that pinned ubuntu-22.04 (gate.yml lint jobs ×6, resume-diff.yml, scorecard.yml, memory-index-duplicate-lint.yml, budget-snapshot-cadence.yml). ubuntu-latest = ubuntu-24.04 since Jan 2025 per Otto-247 WebSearch verification; 22.04 is now LTS-2 stale. Stays on stock GitHub- hosted runner image (no custom pre-installed bun) per Aaron's "we want to use stock" + "vanilla ubuntu so we test do our install scripts work on vanalla and deve machines." Dev↔CI parity: install.sh runs on both surfaces; cache restores state similar to a dev's already-bootstrapped local env; cache key on tools/setup/** + .mise.toml matches what a dev's environment depends on. install.sh stays idempotent so cache hit = fast no-op, cache miss = full vanilla install (which is the install-script validation Aaron wants). Composes with PR #75 curl_fetch helper (downstream curl retries), PR #76 + #79 markdownlint carve-outs (verbatim ferry preservation), Otto-247 version-currency, Otto-235 4-shell portability, Otto-341 mechanism-over-vigilance, and `feedback_structural_fix_beats_process_discipline_velocity_multiplier_aaron_2026_04_28.md`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…koff (Aaron 2026-04-28)
…-lint-jobs-2026-04-28 # Conflicts: # .github/workflows/gate.yml
There was a problem hiding this comment.
Pull request overview
Updates CI workflows to be more resilient to transient toolchain download failures and to run on a newer Ubuntu runner image.
Changes:
- Increase
tools/setup/install.shretry loop ingate.ymllint jobs to 5 attempts with an explicit 10s/30s/60s/120s backoff. - Bump several workflows’
runs-onfromubuntu-22.04toubuntu-24.04. - Update/expand workflow comments around the install caching/retry rationale.
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/gate.yml | Updates lint jobs to ubuntu-24.04 and adds/expands install caching + 5-attempt retry/backoff logic. |
| .github/workflows/scorecard.yml | Runner bump to ubuntu-24.04. |
| .github/workflows/resume-diff.yml | Runner bump to ubuntu-24.04 and associated comment update. |
| .github/workflows/memory-index-duplicate-lint.yml | Runner bump to ubuntu-24.04 and associated comment update. |
| .github/workflows/budget-snapshot-cadence.yml | Runner bump to ubuntu-24.04. |
Comments suppressed due to low confidence (6)
.github/workflows/gate.yml:256
- P1:
timeout-minutes: 5is likely too low now that the install step can sleep up to 220s (10+30+60+120) plus the runtime of up to 5 install attempts. If the network is flaky, the job can hit the workflow timeout and lose the intended retry/fail-after-5 behavior; consider increasing the job timeout (or moving retry/backoff into a separate step with its own timeout).
name: lint (shellcheck)
timeout-minutes: 5
runs-on: ubuntu-24.04
.github/workflows/gate.yml:358
- P1:
timeout-minutes: 5may be too low given the new 5-attempt install retry with up to 220s of backoff sleeps (plus install runtimes). Consider increasing the timeout so the retry strategy can actually complete and produce the explicit "failed after 5 attempts" error instead of timing out.
name: lint (actionlint)
timeout-minutes: 5
runs-on: ubuntu-24.04
.github/workflows/gate.yml:523
- P1: With the install step now retried up to 5 times and backoff totaling 220s,
timeout-minutes: 5risks the job timing out before the retry loop completes (especially on cold cache / slow CDN). Bumping the timeout would make the retries effective and preserve the intended failure message.
name: lint (markdownlint)
timeout-minutes: 5
runs-on: ubuntu-24.04
.github/workflows/gate.yml:19
- P1: This comment introduces personal-name attribution ("Aaron" and the task-style "Otto-247") in a workflow file. Repo convention is to use role references on current-state surfaces (like workflows) and keep personal/persona names limited to history surfaces; please rewrite this to use role-refs (e.g., "the human maintainer") and a non-name task reference if needed (see docs/AGENT-BEST-PRACTICES.md:284-346).
# Lint jobs pinned to ubuntu-24.04 (short-lived, OS-independent
# work; bumped from 22.04 on 2026-04-28 per Aaron's "we better
# not be using that old version of ubuntu" + Otto-247 version-
# currency WebSearch — ubuntu-latest = 24.04 since Jan 2025).
.github/workflows/gate.yml:266
- Typo in quoted text: "dev seutp" should be "dev setup".
# "we want to make sure dev seutp and build machine setup
.github/workflows/gate.yml:267
- Grammar/typo: "as close to the same a possible" should be "as close to the same as possible".
# are as close to the same a possible"
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: db62467258
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| run: | | ||
| set -euo pipefail | ||
| for attempt in 1 2 3; do | ||
| for attempt in 1 2 3 4 5; do |
There was a problem hiding this comment.
Keep retry budget within lint job timeout
This retry loop now allows five install attempts with 10/30/60/120s backoff, i.e. 220s of sleep before the final try, but the lint jobs still have a 5-minute timeout. That leaves less than 80s total for all install.sh executions, so under the transient CDN failures this change is trying to absorb (where install.sh itself already spends time retrying), the job can time out before attempt 5 is reached. The same pattern is duplicated in lint-workflows and lint-markdown, so the configured “5 attempts” is not reliably achievable unless timeout/backoff are adjusted together.
Useful? React with 👍 / 👎.
…Otto-357 strengthen (#83) * tick-history: 2026-04-28T05:44Z — PR #80 MERGED + #81 retry-bump + #82 Otto-357 strengthen + 3 conflict resolutions * fix(pr-83): reconcile verify-don't-parrot streak count — 4 ticks running (was inconsistent 3 vs 4) PR #83 review thread (P2 copilot): the row described the streak count as both "3 ticks running" early and "4 ticks running" later. The conflict was a scope mismatch — the early count was meant to be cumulative ticks-of-discipline-applied (4, matching the observations enumeration), but I'd written it as 3 from an older draft state. Reconciled to a single 4-count framing that explicitly references the observations column (which enumerates the 4 distinct verifications applied this tick: cron-id verify / AUTONOMOUS-LOOP.md grep / CronList freshness / retry-3-failed-on-#23 sourcing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Bumps the install-step retry wrapper added in PR #80 from 3 → 5 attempts per Aaron's 2026-04-28 input. Backoff schedule extended from 10s/30s to 10s/30s/60s/120s — covers transient CDN outages from short-burst (~10s recovery) through multi-minute (full 1-2 minute outage).
Why
PR #23's
mise+bun-1.3.13502 burned all 3 of mise's internal retries; a 3-attempt wrapper on top of that didn't add enough margin (3 wrapper × 3 mise = 9 underlying attempts, but they all stacked within a ~30s window where the CDN was still 502'ing).Aaron explicit: "Mise's internal 3-attempt retry was exhausted on PR #23. go to 5 or 10". Picked 5 (less aggressive, total 5×4=20 underlying attempts spanning ~4 minutes).
Changes
All 3 lint jobs (
lint-shell,lint-workflows,lint-markdown):Plus comment update on the lint-shell job documenting the 3 → 5 bump rationale.
Test plan
🤖 Generated with Claude Code