Skip to content

fix(setup): retry verifier-jar download on transient 5xx (Otto-285)#484

Merged
AceHack merged 1 commit intomainfrom
fix/install-script-retry-on-transient-502
Apr 25, 2026
Merged

fix(setup): retry verifier-jar download on transient 5xx (Otto-285)#484
AceHack merged 1 commit intomainfrom
fix/install-script-retry-on-transient-502

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented Apr 25, 2026

Summary

tools/setup/common/verifiers.sh calls curl -fsSL to download tla2tools.jar and alloy.jar without retries. GitHub's release-asset CDN occasionally returns 502 / 5xx errors during install, killing CI on every job that uses the install script (which is most of them).

Most recent observed: 2026-04-25 ~13:52 UTC, hit:

Both PRs had zero code changes related to the failures — pure environmental flake.

The fix

Per Otto-285 (don't use DST/determinism to avoid edge-case handling — when we can't control real-world chaos, react to it algorithmically), the curl call now retries:

  • `--retry 5` attempts
  • `--retry-delay 2` (initial; exponential-backs to 2/4/8/16/32 s)
  • `--retry-all-errors` (so 4xx/5xx retry too — curl's default only retries connect/dns/408/429)

`-fsSL` semantics preserved: if all 5 attempts hit the same transient, the final exit still fails. We don't silently swallow persistent CDN outages.

Aaron's framing

"the right fix is making install robust to network blips ... that is the right answer we cant control that part of the real world environment we have to react to it, good call"

Test plan

  • `bash -n tools/setup/common/verifiers.sh` syntax OK
  • CI runs (this PR exercises the install script via every job that runs setup)
  • Future CI flake counts on transient 5xx should drop measurably

Doesn't fix

The mise:elan curl on PR #482 is in mise's plugin code, not ours. Filing a separate BACKLOG row to investigate whether mise has a similar retry config or whether we need to mirror these assets.

🤖 Generated with Claude Code

GitHub's release-asset CDN occasionally returns 502 / 5xx
errors. Most recent observed: 2026-04-25 ~13:52 UTC, hit
PR #481 (CodeQL csharp install) + PR #482 (markdownlint
install via mise:elan). Both PRs were unrelated to the
network — pure environmental flake.

Per Otto-285 (don't use determinism to avoid edge-case
handling — when we can't control real-world chaos, react to
it algorithmically), the install script's curl call now
includes:

- `--retry 5` attempts
- `--retry-delay 2` (initial; curl exponential-backs to
  2/4/8/16/32 s)
- `--retry-all-errors` (so 4xx/5xx errors retry too —
  curl's default only retries connect / dns / 408 / 429 /
  5xx-with-Retry-After header)

`-fsSL` semantics preserved — if all 5 attempts hit the
same transient, the final exit code still fails, so we
don't silently swallow persistent CDN outages.

Aaron 2026-04-25 affirmation: "that is the right answer we
cant control that part of the real world environment we
have to react to it, good call."

Verified: `bash -n` syntax check passes.

Composes with Otto-285 (DST is not edge-case avoidance) +
Otto-281 (fix determinism not the comment) + Otto-264 (rule
of balance — every CI flake we hit gets a counterweight
fix).
Copilot AI review requested due to automatic review settings April 25, 2026 14:08
@AceHack AceHack enabled auto-merge (squash) April 25, 2026 14:08
@AceHack AceHack merged commit 2e5579e into main Apr 25, 2026
16 checks passed
@AceHack AceHack deleted the fix/install-script-retry-on-transient-502 branch April 25, 2026 14:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the tools/setup/ bootstrap path by making verifier JAR downloads resilient to transient GitHub release-asset CDN failures (5xx), reducing CI/install flakiness while preserving failure-on-persistent-outage behavior.

Changes:

  • Add curl retry flags (--retry, --retry-delay, --retry-all-errors) when downloading verifier JARs.
  • Add inline rationale comments explaining the transient-failure motivation and retry behavior.

Comment on lines +47 to +52
# ~13:52 UTC, hit PR #481 CodeQL csharp + PR #482 markdownlint
# CI runs). Per Otto-285 (don't use determinism to avoid
# edge-case handling — handle the network-non-determinism
# algorithmically), curl handles the retry: `--retry 5` attempts,
# exponential backoff (2/4/8/16/32 s default), `--retry-all-errors`
# so 4xx/5xx server errors retry too (curl's default only retries
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This comment block’s cross-reference looks incorrect: it says “PR #482 markdownlint”, but in this repo PR #482 is a different change (HLL fuzz test). Consider removing the PR-number references (they’ll drift) or updating them to the correct incident/PR IDs so the rationale stays accurate.

Suggested change
# ~13:52 UTC, hit PR #481 CodeQL csharp + PR #482 markdownlint
# CI runs). Per Otto-285 (don't use determinism to avoid
# edge-case handling — handle the network-non-determinism
# algorithmically), curl handles the retry: `--retry 5` attempts,
# exponential backoff (2/4/8/16/32 s default), `--retry-all-errors`
# so 4xx/5xx server errors retry too (curl's default only retries
# ~13:52 UTC during CI runs). Per Otto-285 (don't use
# determinism to avoid edge-case handling — handle the
# network-non-determinism algorithmically), curl handles the
# retry: `--retry 5` attempts, exponential backoff
# (2/4/8/16/32 s default), `--retry-all-errors` so 4xx/5xx
# server errors retry too (curl's default only retries

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +52
# algorithmically), curl handles the retry: `--retry 5` attempts,
# exponential backoff (2/4/8/16/32 s default), `--retry-all-errors`
# so 4xx/5xx server errors retry too (curl's default only retries
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The comment claims “exponential backoff (2/4/8/16/32 s default)”, but the command uses --retry-delay 2, which makes the delay behavior differ from curl’s default backoff (and may be a constant delay depending on curl version). Please align the comment with the actual retry behavior, or adjust the curl flags to match what the comment promises.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants