Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions memory/MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- [**The git repo is the soulfile — binaries are scary, text history is fine (Aaron + Amara, 2026-04-29 RECALIBRATED)**](feedback_repo_is_soulfile_dont_commit_raw_diagnostic_dumps_aaron_amara_2026_04_29.md) — Recalibrated 2026-04-29: text compresses well in git pack-delta storage and is NOT the soulfile threat; binaries (compiled outputs, archives, large media, profile dumps in binary form) DON'T delta-compress and balloon clones forever. Aaron: *"don't go too hardcore on soulfile protection, text compresses very well, bin is what we are scared of and need to really really think about not history in text form."* PR-review readability for noisy text is a separate concern → recommended `.gitattributes` mitigation: `linguist-generated=true -diff` (NOT `-merge` — that unsets the merge driver and breaks 3-way text merges; landing in PR #761). Default: text → track freely; binary → git-lfs or non-soul repo.
- [**Corruption triage is a substrate health incident, not a backlog item (Aaron + Amara, 2026-04-29)**](feedback_corruption_triage_discipline_object_health_incident_aaron_amara_2026_04_29.md) — When `git fsck` reports corrupt objects, lane narrows hard: stop all background work, do read-only diagnosis first (no `git fsck --lost-found` — it writes), three-bucket reachability scan (live-ref / reflog-stash / dangling-only) — reachability is mode-dependent on fsck flags, fresh-clone verification BEFORE declaring "origin has it," verify squash-preservation by content not ancestry, stale remote-tracking refs are evidence not origin recovery. Worked example: 2026-04-29 audit found 2 corrupt objects — 9bf2daee (RECOVERABLE_FROM_ORIGIN) + 8d5e67fd (CORRUPT_BLOB_REFERENCED_BY_LIVE_LOCAL_BRANCH_AND_STALE_REMOTE_TRACKING_REF after three rounds of triage; branch tip clean, intermediate-history corrupt). Aaron emphasized: *"future self remembers this, this is very important."*
- [**PR-boundary restraint validation — bead promoted (Aaron + Aurora + Amara, 2026-04-29)**](feedback_pr_boundary_restraint_validation_bead_promoted_aaron_amara_2026_04_29.md) — Falsifier-not-fired bead-promotion on PR #699; canonical rule *"once a PR enters validation, only validation defects enter that PR; new ideas go to the next PR."* Validation-condition refinement (*"validated when original PR lands clean, not when follow-up opens"*) was Aurora's catch; Amara reactive-elaborator. Allowed/disallowed-changes lists in body.
- [**External dependency download retries — durable fix in code, not ephemeral rerun (Aaron, 2026-04-29)**](feedback_external_dependency_download_retries_durable_fix_not_ephemeral_rerun_aaron_2026_04_29.md) — Aaron 2026-04-29: *"we can retury on external dependency download failures, this goes against DST but we have not choice they are external dependencies we need. Next time instead of kicking a 2nd build we should fix it and reduce friction for future builds."* External dep downloads (toolchain installers, package mirrors, registry fetches) ARE the DST exception class — we have no choice. But the fix LOCATION matters: durable retry-with-backoff inside the code (`curl_fetch` --retry 5 in `tools/setup/`) reduces friction for FUTURE builds; ephemeral `gh run rerun --failed` only papers over THIS build and the friction returns next time. Refines the 2026-04-23 DST-retries-are-smell rule by naming the concrete domain + pinning the fix to the durable layer. Trigger: the elan-toolchain-502 rerun earlier this same session (anti-pattern Aaron caught). Durable fix landed alongside this memory: `tools/setup/common/elan.sh` + `tools/setup/linux.sh` migrated from raw `curl -fsSL` to `curl_fetch`.
- [**Beacon-promotion pattern — load-bearing rules earn external anchors when they're correct (Aaron + Amara + Claude.ai, 2026-04-28)**](feedback_beacon_promotion_load_bearing_rules_earn_external_anchors_aaron_amara_2026_04_28.md) — Round-level observation: 5 Mirror→Beacon graduations in one round (input-is-not-directive → SDT + RFC 2119; public-company compliance → SEC / Reg FD / SOX; metric corrections → Goodhart / Campbell; evidence lattice → lattice theory; commit-vs-tree → Git internals). Pattern: when an internal factory coinage becomes load-bearing, look for external lineage. Found = graduate Mirror → Beacon. Absent = drift signal worth investigating. Connects to alignment-experiment surface: the rate of load-bearing rules earning external lineage is itself a measurable signal.

These per-maintainer distillations show what's currently in force. Raw memories below are the history; CURRENT files are the projection. (`CURRENT-aaron.md` refreshed 2026-04-28 with sections 26-31 — speculation rule + EVIDENCE-BASED labeling + JVM preference + dependency honesty + threading lineage Albahari/Toub/Fowler + TypeScript/Bun-default discipline + Amara authority rule (default to reversible preservation).)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
---
name: External dependency download retries — durable fix in code, not ephemeral rerun via workflow (Aaron 2026-04-29)
description: Aaron 2026-04-29 — *"we can retury on external dependency download failures, this goes against DST but we have not choice they are external dependencies we need. Next time instead of kicking a 2nd build we should fix it and reduce friction for future builds."* External dependency downloads (toolchain installers, language runtimes, package mirrors, registry fetches) ARE the DST exception class — we have no choice; they are external and uncontrollable. BUT the fix location matters: durable retry-with-backoff inside the code (e.g., curl_fetch with --retry 5 in tools/setup/) reduces friction for FUTURE builds; ephemeral `gh run rerun --failed` only papers over THIS build and the friction returns next time. Composes with the 2026-04-23 DST-retries-are-smell rule (external-uncontrollable exception was already implied; this names the concrete domain + pins the fix to the durable layer).
type: feedback
---

# External dependency download retries — durable fix, not ephemeral rerun

## Source

Aaron 2026-04-29 (mid-tick correction during autonomous-loop):

> *"we can retury on external dependency download failures,
> this goes against DST but we have not choice they are
> external dependencies we need. Next time instead of kicking
> a 2nd build we should fix it and reduce friction for future
> builds."*

Typos preserved per
`memory/feedback_aaron_channel_verbatim_preservation_anything_through_this_channel_2026_04_29.md`
("retury" / "have not choice" / double-space — Aaron's exact
wording is signal).

## Trigger context

Earlier in the same session (`docs/hygiene-history/ticks/2026/04/29/`
shards prior to 0546Z), the autonomous-loop hit a transient
elan-toolchain `502` while CI was downloading the Lean 4
installer. I resolved it by running `gh run rerun --failed` —
the 2nd build succeeded because the upstream blip cleared.
Aaron's correction names that resolution as the wrong fix
location: the rerun made THIS build pass, but it did nothing
to reduce friction for the NEXT build that hits the same
mirror flake.

The right fix location is **inside the code**: the call site
that downloaded the elan installer in
`tools/setup/common/elan.sh` (the `elan-init.sh` curl-to-tmp
download) used raw `curl -fsSL` rather than the retry-equipped
helper `tools/setup/common/curl-fetch.sh::curl_fetch` that
already existed. Migrating elan.sh (and the parallel gap in
`tools/setup/linux.sh` for the mise tarball download) to
`curl_fetch` is the durable fix. (Line numbers omitted —
shell sources churn lines fast; grep for the URL constants
or the `curl_fetch` call to locate the current sites.)

## The rule (load-bearing)

```text
External dependency downloads are a recognized DST exception:
- toolchain installers (elan-init.sh, mise tarballs,
homebrew install.sh, dotnet-install.sh, rustup-init.sh)
- language runtime fetches (Lean toolchain pull, .NET SDK
download)
- package-registry pulls (NuGet restore, npm/bun install,
cargo fetch, lake update)
- GitHub release artifact downloads
- container registry pulls (in CI runners)
- git clone / fetch from upstream remotes

These ARE external + uncontrollable. We have no choice; the
factory needs them. Retry-with-backoff is legitimate per the
existing 2026-04-23 DST-retries-are-smell rule's "real
external uncontrollable reasons we can't control" exception.

The fix location matters:

Durable layer (PREFERRED) — retry inside the code:
- curl_fetch with --retry 5 --retry-delay 2
--retry-all-errors (file-output)
- dotnet restore with retry-equipped HTTP client
- mise install with curl-fetch.sh helper
- bun install / npm install with retry config
Effect: future builds inherit the retry; same flake
doesn't fail again.

Ephemeral layer (LAST RESORT) — rerun the build:
- gh run rerun --failed
- re-trigger the workflow
Effect: THIS build passes; same flake fails next time;
friction reaccumulates over the fleet of contributors
+ autonomous loop ticks.

The rule: when an external-dep download fails, FIRST check
whether the call site uses the retry-equipped helper. If not,
THAT is the durable fix. Only fall back to rerun if the
helper is already in use AND the failure is a one-shot
upstream incident that retry-equipped code already handled
on the second attempt.
```

## How to apply

When CI fails on a transient external-dependency download:

1. **Identify the call site.** The failure log names the
URL + the script. Open that script.
2. **Check whether it uses the retry-equipped helper.**
For shell scripts: is `curl_fetch` (or `curl_fetch_stream`
for installers piped to `sh`) sourced and used? For .NET:
does the dotnet command use the retry-equipped
`--source` / `--retry-on-error` flags or HTTP client?
For Node: is the package manager configured with retries?
3. **If the helper is bypassed, that IS the durable fix.**
Migrate the call site to the helper.
4. **If the helper is already in use AND the retry didn't
recover**, the failure is a deeper incident:
- Is the upstream service down for an extended window?
(Check status pages.)
- Is the URL pinned to a removed release? (Check
upstream.)
- Is the SHA256 mismatch deliberate (upstream
republished)? (Investigate.)
In these cases, `gh run rerun --failed` is rarely the
right answer — the friction needs different shape (URL
bump, mirror swap, fallback source).
5. **If genuine one-shot blip** (status pages clear, helper
in use, no upstream incident, retry-policy exhausted on
this exact moment), THEN `gh run rerun --failed` is OK.
But this is the rare case, not the default reach.

## Worked example: 2026-04-29 elan-toolchain 502

Sequence:
1. Autonomous-loop tick CI hit a 502 from
`raw.githubusercontent.com/leanprover/elan/...elan-init.sh`.
2. I ran `gh run rerun --failed`. 2nd build passed (transient
blip cleared).
3. Tick-history shard logged this as
"transient-CI-infrastructure-flake recovered via rerun."

Aaron's correction surfaced what was wrong with this
sequence: I treated the rerun as the resolution. The actual
resolution (per this rule) is migrating the elan.sh
call site to `curl_fetch` so the next 502 is absorbed by
`--retry 5` inside the install script.

The durable fix lands as part of the 2026-04-29 mid-tick
correction:

```diff
# tools/setup/common/elan.sh
- curl -fsSL "${ELAN_INIT_URL}" -o "${ELAN_INIT_TMP}"
+ source "$REPO_ROOT/tools/setup/common/curl-fetch.sh"
+ curl_fetch --output "${ELAN_INIT_TMP}" "${ELAN_INIT_URL}"

# tools/setup/linux.sh — same migration for mise tarball
- curl -fsSL "${MISE_URL}" -o "${MISE_TMP}/${MISE_TARBALL}"
+ curl_fetch --output "${MISE_TMP}/${MISE_TARBALL}" "${MISE_URL}"
```

(linux.sh already has REPO_ROOT + SETUP_DIR set up, just
needs the `source` line. elan.sh is invoked as a subprocess
from linux.sh + macos.sh, so it must source curl-fetch.sh
itself — sourcing in the parent does NOT propagate to
subprocess shells.)

## What this rule does NOT mean

- Doesn't mean "add retries everywhere." The DST rule still
applies for INTERNAL non-determinism — internal retries
remain a smell that needs root-causing.
- Doesn't mean "never rerun a build." Genuine one-shot
upstream blips that the retry-equipped code already
exhausted ARE the rare legitimate rerun case.
- Doesn't mean "all external = retryable." The exception is
scoped to **download / fetch failures from external
services that the factory legitimately depends on**. A
retry is NOT legitimate for, e.g., a flaky test that
happens to hit an external service if the underlying flake
is in our test code, not the external dep.
- Doesn't mean "ignore the friction entirely." Retry adds
latency on the slow path; the right shape is exponential
backoff with a bounded ceiling, not infinite retries.

## Composes with

- `memory/feedback_retries_are_non_determinism_smell_DST_holds_investigate_first_2026_04_23.md`
— the parent DST rule. This 2026-04-29 entry is the
concrete-domain refinement: external dep downloads ARE the
exception class, AND the fix lives in the code not the
workflow.
- `tools/setup/common/curl-fetch.sh` — the existing
retry-equipped helper, originally landed per Aaron's
2026-04-28 framing
(*"curl 502 pattern i mean why should a PR ever fail for
this? our code does not handle the retries already?"* +
*"sounds like a common helper would help too rather than
copy/paste."*). This rule extends that direction by adding
the rerun-vs-fix-location distinction.
- `memory/feedback_aaron_channel_verbatim_preservation_anything_through_this_channel_2026_04_29.md`
— Aaron's correction is preserved verbatim above per the
channel-verbatim rule.
- `memory/feedback_otto_355_blocked_with_green_ci_means_investigate_review_threads_first_dont_wait_2026_04_27.md`
— same pattern at the PR layer: investigate the cause
first, don't reach for the wait/rerun shortcut.

## Distilled keepers

```text
External dep downloads ARE the DST exception.
We have no choice; they're external.
Retry is legitimate at this boundary.
```

```text
Fix the friction in the code, not in the workflow.
gh run rerun --failed only saves THIS build.
curl_fetch in the script saves EVERY future build.
```

```text
Aaron 2026-04-29:
"instead of kicking a 2nd build we should fix it
and reduce friction for future builds."
```
17 changes: 16 additions & 1 deletion tools/setup/common/elan.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,17 @@

set -euo pipefail

# elan.sh runs as a subprocess from linux.sh + macos.sh, so it must
# source curl-fetch.sh itself (a sourced helper in the parent shell
# does NOT propagate to subprocess shells). Provides the
# retry-equipped curl_fetch helper for the elan-init.sh download —
# absorbs transient upstream 5xx without requiring a workflow
# rerun (durable fix in code, not ephemeral retry at the workflow
# layer).
REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
# shellcheck source=tools/setup/common/curl-fetch.sh
source "$REPO_ROOT/tools/setup/common/curl-fetch.sh"

if ! command -v elan >/dev/null 2>&1; then
echo "↓ installing elan (Lean 4 toolchain manager)..."
# Pinned to v4.2.1 commit SHA + verified SHA256 of elan-init.sh.
Expand All @@ -26,7 +37,11 @@ if ! command -v elan >/dev/null 2>&1; then
# mismatch, installer non-zero exit). `set -euo pipefail` would
# otherwise leak the file on any failure path.
trap 'rm -f "${ELAN_INIT_TMP}"' EXIT
curl -fsSL "${ELAN_INIT_URL}" -o "${ELAN_INIT_TMP}"
# Retry-equipped fetch — absorbs transient upstream 5xx (the
# elan-init.sh URL on raw.githubusercontent.com has hit 502 in
# CI runs; durable retry in the script avoids needing a workflow
# rerun).
curl_fetch --output "${ELAN_INIT_TMP}" "${ELAN_INIT_URL}"
# Portable SHA256 verification: sha256sum (Linux/git-bash/WSL) or
# shasum -a 256 (macOS default). Per the 4-shell portability target
# (macOS bash 3.2 / Ubuntu / git-bash / WSL).
Expand Down
13 changes: 12 additions & 1 deletion tools/setup/linux.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,15 @@ set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
SETUP_DIR="$REPO_ROOT/tools/setup"

# Retry-equipped curl helper — DST exception for external dep
# downloads, durable retry inside the script instead of ephemeral
# `gh run rerun --failed`. Sources curl_fetch (file-output, with
# `--retry 5 --retry-delay 2`, plus `--retry-all-errors` when the
# local curl supports it — curl-fetch.sh feature-detects and falls
# back on older curl builds).
# shellcheck source=tools/setup/common/curl-fetch.sh
source "$SETUP_DIR/common/curl-fetch.sh"

# ── Detect apt availability (Debian/Ubuntu) ─────────────────────────
if ! command -v apt-get >/dev/null 2>&1; then
echo "error: this script currently supports Debian/Ubuntu only"
Expand Down Expand Up @@ -84,7 +93,9 @@ if ! command -v mise >/dev/null 2>&1; then
# mismatch, tar extract failure). `set -euo pipefail` would otherwise
# leak the directory on any failure path.
trap 'rm -rf "${MISE_TMP}"' EXIT
curl -fsSL "${MISE_URL}" -o "${MISE_TMP}/${MISE_TARBALL}"
# Retry-equipped fetch — absorbs transient upstream 5xx without
# requiring a workflow rerun.
curl_fetch --output "${MISE_TMP}/${MISE_TARBALL}" "${MISE_URL}"
# Portable SHA256 verification: sha256sum (Linux) or shasum (macOS,
# though linux.sh runs on Linux only). Per the 4-shell portability
# target (macOS bash 3.2 / Ubuntu / git-bash / WSL).
Expand Down
Loading