fix(setup): retry verifier-jar download on transient 5xx (Otto-285)#484
fix(setup): retry verifier-jar download on transient 5xx (Otto-285)#484
Conversation
GitHub's release-asset CDN occasionally returns 502 / 5xx errors. Most recent observed: 2026-04-25 ~13:52 UTC, hit PR #481 (CodeQL csharp install) + PR #482 (markdownlint install via mise:elan). Both PRs were unrelated to the network — pure environmental flake. Per Otto-285 (don't use determinism to avoid edge-case handling — when we can't control real-world chaos, react to it algorithmically), the install script's curl call now includes: - `--retry 5` attempts - `--retry-delay 2` (initial; curl exponential-backs to 2/4/8/16/32 s) - `--retry-all-errors` (so 4xx/5xx errors retry too — curl's default only retries connect / dns / 408 / 429 / 5xx-with-Retry-After header) `-fsSL` semantics preserved — if all 5 attempts hit the same transient, the final exit code still fails, so we don't silently swallow persistent CDN outages. Aaron 2026-04-25 affirmation: "that is the right answer we cant control that part of the real world environment we have to react to it, good call." Verified: `bash -n` syntax check passes. Composes with Otto-285 (DST is not edge-case avoidance) + Otto-281 (fix determinism not the comment) + Otto-264 (rule of balance — every CI flake we hit gets a counterweight fix).
There was a problem hiding this comment.
Pull request overview
This PR hardens the tools/setup/ bootstrap path by making verifier JAR downloads resilient to transient GitHub release-asset CDN failures (5xx), reducing CI/install flakiness while preserving failure-on-persistent-outage behavior.
Changes:
- Add
curlretry flags (--retry,--retry-delay,--retry-all-errors) when downloading verifier JARs. - Add inline rationale comments explaining the transient-failure motivation and retry behavior.
| # ~13:52 UTC, hit PR #481 CodeQL csharp + PR #482 markdownlint | ||
| # CI runs). Per Otto-285 (don't use determinism to avoid | ||
| # edge-case handling — handle the network-non-determinism | ||
| # algorithmically), curl handles the retry: `--retry 5` attempts, | ||
| # exponential backoff (2/4/8/16/32 s default), `--retry-all-errors` | ||
| # so 4xx/5xx server errors retry too (curl's default only retries |
There was a problem hiding this comment.
P1: This comment block’s cross-reference looks incorrect: it says “PR #482 markdownlint”, but in this repo PR #482 is a different change (HLL fuzz test). Consider removing the PR-number references (they’ll drift) or updating them to the correct incident/PR IDs so the rationale stays accurate.
| # ~13:52 UTC, hit PR #481 CodeQL csharp + PR #482 markdownlint | |
| # CI runs). Per Otto-285 (don't use determinism to avoid | |
| # edge-case handling — handle the network-non-determinism | |
| # algorithmically), curl handles the retry: `--retry 5` attempts, | |
| # exponential backoff (2/4/8/16/32 s default), `--retry-all-errors` | |
| # so 4xx/5xx server errors retry too (curl's default only retries | |
| # ~13:52 UTC during CI runs). Per Otto-285 (don't use | |
| # determinism to avoid edge-case handling — handle the | |
| # network-non-determinism algorithmically), curl handles the | |
| # retry: `--retry 5` attempts, exponential backoff | |
| # (2/4/8/16/32 s default), `--retry-all-errors` so 4xx/5xx | |
| # server errors retry too (curl's default only retries |
| # algorithmically), curl handles the retry: `--retry 5` attempts, | ||
| # exponential backoff (2/4/8/16/32 s default), `--retry-all-errors` | ||
| # so 4xx/5xx server errors retry too (curl's default only retries |
There was a problem hiding this comment.
P1: The comment claims “exponential backoff (2/4/8/16/32 s default)”, but the command uses --retry-delay 2, which makes the delay behavior differ from curl’s default backoff (and may be a constant delay depending on curl version). Please align the comment with the actual retry behavior, or adjust the curl flags to match what the comment promises.
Summary
tools/setup/common/verifiers.shcallscurl -fsSLto downloadtla2tools.jarandalloy.jarwithout retries. GitHub's release-asset CDN occasionally returns 502 / 5xx errors during install, killing CI on every job that uses the install script (which is most of them).Most recent observed: 2026-04-25 ~13:52 UTC, hit:
Both PRs had zero code changes related to the failures — pure environmental flake.
The fix
Per Otto-285 (don't use DST/determinism to avoid edge-case handling — when we can't control real-world chaos, react to it algorithmically), the curl call now retries:
`-fsSL` semantics preserved: if all 5 attempts hit the same transient, the final exit still fails. We don't silently swallow persistent CDN outages.
Aaron's framing
Test plan
Doesn't fix
The mise:elan curl on PR #482 is in mise's plugin code, not ours. Filing a separate BACKLOG row to investigate whether mise has a similar retry config or whether we need to mirror these assets.
🤖 Generated with Claude Code