diff --git a/memory/MEMORY.md b/memory/MEMORY.md index 4d25f706d..511acf364 100644 --- a/memory/MEMORY.md +++ b/memory/MEMORY.md @@ -9,6 +9,7 @@ - [**The git repo is the soulfile — binaries are scary, text history is fine (Aaron + Amara, 2026-04-29 RECALIBRATED)**](feedback_repo_is_soulfile_dont_commit_raw_diagnostic_dumps_aaron_amara_2026_04_29.md) — Recalibrated 2026-04-29: text compresses well in git pack-delta storage and is NOT the soulfile threat; binaries (compiled outputs, archives, large media, profile dumps in binary form) DON'T delta-compress and balloon clones forever. Aaron: *"don't go too hardcore on soulfile protection, text compresses very well, bin is what we are scared of and need to really really think about not history in text form."* PR-review readability for noisy text is a separate concern → recommended `.gitattributes` mitigation: `linguist-generated=true -diff` (NOT `-merge` — that unsets the merge driver and breaks 3-way text merges; landing in PR #761). Default: text → track freely; binary → git-lfs or non-soul repo. - [**Corruption triage is a substrate health incident, not a backlog item (Aaron + Amara, 2026-04-29)**](feedback_corruption_triage_discipline_object_health_incident_aaron_amara_2026_04_29.md) — When `git fsck` reports corrupt objects, lane narrows hard: stop all background work, do read-only diagnosis first (no `git fsck --lost-found` — it writes), three-bucket reachability scan (live-ref / reflog-stash / dangling-only) — reachability is mode-dependent on fsck flags, fresh-clone verification BEFORE declaring "origin has it," verify squash-preservation by content not ancestry, stale remote-tracking refs are evidence not origin recovery. Worked example: 2026-04-29 audit found 2 corrupt objects — 9bf2daee (RECOVERABLE_FROM_ORIGIN) + 8d5e67fd (CORRUPT_BLOB_REFERENCED_BY_LIVE_LOCAL_BRANCH_AND_STALE_REMOTE_TRACKING_REF after three rounds of triage; branch tip clean, intermediate-history corrupt). Aaron emphasized: *"future self remembers this, this is very important."* - [**PR-boundary restraint validation — bead promoted (Aaron + Aurora + Amara, 2026-04-29)**](feedback_pr_boundary_restraint_validation_bead_promoted_aaron_amara_2026_04_29.md) — Falsifier-not-fired bead-promotion on PR #699; canonical rule *"once a PR enters validation, only validation defects enter that PR; new ideas go to the next PR."* Validation-condition refinement (*"validated when original PR lands clean, not when follow-up opens"*) was Aurora's catch; Amara reactive-elaborator. Allowed/disallowed-changes lists in body. +- [**External dependency download retries — durable fix in code, not ephemeral rerun (Aaron, 2026-04-29)**](feedback_external_dependency_download_retries_durable_fix_not_ephemeral_rerun_aaron_2026_04_29.md) — Aaron 2026-04-29: *"we can retury on external dependency download failures, this goes against DST but we have not choice they are external dependencies we need. Next time instead of kicking a 2nd build we should fix it and reduce friction for future builds."* External dep downloads (toolchain installers, package mirrors, registry fetches) ARE the DST exception class — we have no choice. But the fix LOCATION matters: durable retry-with-backoff inside the code (`curl_fetch` --retry 5 in `tools/setup/`) reduces friction for FUTURE builds; ephemeral `gh run rerun --failed` only papers over THIS build and the friction returns next time. Refines the 2026-04-23 DST-retries-are-smell rule by naming the concrete domain + pinning the fix to the durable layer. Trigger: the elan-toolchain-502 rerun earlier this same session (anti-pattern Aaron caught). Durable fix landed alongside this memory: `tools/setup/common/elan.sh` + `tools/setup/linux.sh` migrated from raw `curl -fsSL` to `curl_fetch`. - [**Beacon-promotion pattern — load-bearing rules earn external anchors when they're correct (Aaron + Amara + Claude.ai, 2026-04-28)**](feedback_beacon_promotion_load_bearing_rules_earn_external_anchors_aaron_amara_2026_04_28.md) — Round-level observation: 5 Mirror→Beacon graduations in one round (input-is-not-directive → SDT + RFC 2119; public-company compliance → SEC / Reg FD / SOX; metric corrections → Goodhart / Campbell; evidence lattice → lattice theory; commit-vs-tree → Git internals). Pattern: when an internal factory coinage becomes load-bearing, look for external lineage. Found = graduate Mirror → Beacon. Absent = drift signal worth investigating. Connects to alignment-experiment surface: the rate of load-bearing rules earning external lineage is itself a measurable signal. These per-maintainer distillations show what's currently in force. Raw memories below are the history; CURRENT files are the projection. (`CURRENT-aaron.md` refreshed 2026-04-28 with sections 26-31 — speculation rule + EVIDENCE-BASED labeling + JVM preference + dependency honesty + threading lineage Albahari/Toub/Fowler + TypeScript/Bun-default discipline + Amara authority rule (default to reversible preservation).) diff --git a/memory/feedback_external_dependency_download_retries_durable_fix_not_ephemeral_rerun_aaron_2026_04_29.md b/memory/feedback_external_dependency_download_retries_durable_fix_not_ephemeral_rerun_aaron_2026_04_29.md new file mode 100644 index 000000000..d948c34d4 --- /dev/null +++ b/memory/feedback_external_dependency_download_retries_durable_fix_not_ephemeral_rerun_aaron_2026_04_29.md @@ -0,0 +1,216 @@ +--- +name: External dependency download retries — durable fix in code, not ephemeral rerun via workflow (Aaron 2026-04-29) +description: Aaron 2026-04-29 — *"we can retury on external dependency download failures, this goes against DST but we have not choice they are external dependencies we need. Next time instead of kicking a 2nd build we should fix it and reduce friction for future builds."* External dependency downloads (toolchain installers, language runtimes, package mirrors, registry fetches) ARE the DST exception class — we have no choice; they are external and uncontrollable. BUT the fix location matters: durable retry-with-backoff inside the code (e.g., curl_fetch with --retry 5 in tools/setup/) reduces friction for FUTURE builds; ephemeral `gh run rerun --failed` only papers over THIS build and the friction returns next time. Composes with the 2026-04-23 DST-retries-are-smell rule (external-uncontrollable exception was already implied; this names the concrete domain + pins the fix to the durable layer). +type: feedback +--- + +# External dependency download retries — durable fix, not ephemeral rerun + +## Source + +Aaron 2026-04-29 (mid-tick correction during autonomous-loop): + +> *"we can retury on external dependency download failures, +> this goes against DST but we have not choice they are +> external dependencies we need. Next time instead of kicking +> a 2nd build we should fix it and reduce friction for future +> builds."* + +Typos preserved per +`memory/feedback_aaron_channel_verbatim_preservation_anything_through_this_channel_2026_04_29.md` +("retury" / "have not choice" / double-space — Aaron's exact +wording is signal). + +## Trigger context + +Earlier in the same session (`docs/hygiene-history/ticks/2026/04/29/` +shards prior to 0546Z), the autonomous-loop hit a transient +elan-toolchain `502` while CI was downloading the Lean 4 +installer. I resolved it by running `gh run rerun --failed` — +the 2nd build succeeded because the upstream blip cleared. +Aaron's correction names that resolution as the wrong fix +location: the rerun made THIS build pass, but it did nothing +to reduce friction for the NEXT build that hits the same +mirror flake. + +The right fix location is **inside the code**: the call site +that downloaded the elan installer in +`tools/setup/common/elan.sh` (the `elan-init.sh` curl-to-tmp +download) used raw `curl -fsSL` rather than the retry-equipped +helper `tools/setup/common/curl-fetch.sh::curl_fetch` that +already existed. Migrating elan.sh (and the parallel gap in +`tools/setup/linux.sh` for the mise tarball download) to +`curl_fetch` is the durable fix. (Line numbers omitted — +shell sources churn lines fast; grep for the URL constants +or the `curl_fetch` call to locate the current sites.) + +## The rule (load-bearing) + +```text +External dependency downloads are a recognized DST exception: + - toolchain installers (elan-init.sh, mise tarballs, + homebrew install.sh, dotnet-install.sh, rustup-init.sh) + - language runtime fetches (Lean toolchain pull, .NET SDK + download) + - package-registry pulls (NuGet restore, npm/bun install, + cargo fetch, lake update) + - GitHub release artifact downloads + - container registry pulls (in CI runners) + - git clone / fetch from upstream remotes + +These ARE external + uncontrollable. We have no choice; the +factory needs them. Retry-with-backoff is legitimate per the +existing 2026-04-23 DST-retries-are-smell rule's "real +external uncontrollable reasons we can't control" exception. + +The fix location matters: + + Durable layer (PREFERRED) — retry inside the code: + - curl_fetch with --retry 5 --retry-delay 2 + --retry-all-errors (file-output) + - dotnet restore with retry-equipped HTTP client + - mise install with curl-fetch.sh helper + - bun install / npm install with retry config + Effect: future builds inherit the retry; same flake + doesn't fail again. + + Ephemeral layer (LAST RESORT) — rerun the build: + - gh run rerun --failed + - re-trigger the workflow + Effect: THIS build passes; same flake fails next time; + friction reaccumulates over the fleet of contributors + + autonomous loop ticks. + +The rule: when an external-dep download fails, FIRST check +whether the call site uses the retry-equipped helper. If not, +THAT is the durable fix. Only fall back to rerun if the +helper is already in use AND the failure is a one-shot +upstream incident that retry-equipped code already handled +on the second attempt. +``` + +## How to apply + +When CI fails on a transient external-dependency download: + +1. **Identify the call site.** The failure log names the + URL + the script. Open that script. +2. **Check whether it uses the retry-equipped helper.** + For shell scripts: is `curl_fetch` (or `curl_fetch_stream` + for installers piped to `sh`) sourced and used? For .NET: + does the dotnet command use the retry-equipped + `--source` / `--retry-on-error` flags or HTTP client? + For Node: is the package manager configured with retries? +3. **If the helper is bypassed, that IS the durable fix.** + Migrate the call site to the helper. +4. **If the helper is already in use AND the retry didn't + recover**, the failure is a deeper incident: + - Is the upstream service down for an extended window? + (Check status pages.) + - Is the URL pinned to a removed release? (Check + upstream.) + - Is the SHA256 mismatch deliberate (upstream + republished)? (Investigate.) + In these cases, `gh run rerun --failed` is rarely the + right answer — the friction needs different shape (URL + bump, mirror swap, fallback source). +5. **If genuine one-shot blip** (status pages clear, helper + in use, no upstream incident, retry-policy exhausted on + this exact moment), THEN `gh run rerun --failed` is OK. + But this is the rare case, not the default reach. + +## Worked example: 2026-04-29 elan-toolchain 502 + +Sequence: +1. Autonomous-loop tick CI hit a 502 from + `raw.githubusercontent.com/leanprover/elan/...elan-init.sh`. +2. I ran `gh run rerun --failed`. 2nd build passed (transient + blip cleared). +3. Tick-history shard logged this as + "transient-CI-infrastructure-flake recovered via rerun." + +Aaron's correction surfaced what was wrong with this +sequence: I treated the rerun as the resolution. The actual +resolution (per this rule) is migrating the elan.sh +call site to `curl_fetch` so the next 502 is absorbed by +`--retry 5` inside the install script. + +The durable fix lands as part of the 2026-04-29 mid-tick +correction: + +```diff +# tools/setup/common/elan.sh +- curl -fsSL "${ELAN_INIT_URL}" -o "${ELAN_INIT_TMP}" ++ source "$REPO_ROOT/tools/setup/common/curl-fetch.sh" ++ curl_fetch --output "${ELAN_INIT_TMP}" "${ELAN_INIT_URL}" + +# tools/setup/linux.sh — same migration for mise tarball +- curl -fsSL "${MISE_URL}" -o "${MISE_TMP}/${MISE_TARBALL}" ++ curl_fetch --output "${MISE_TMP}/${MISE_TARBALL}" "${MISE_URL}" +``` + +(linux.sh already has REPO_ROOT + SETUP_DIR set up, just +needs the `source` line. elan.sh is invoked as a subprocess +from linux.sh + macos.sh, so it must source curl-fetch.sh +itself — sourcing in the parent does NOT propagate to +subprocess shells.) + +## What this rule does NOT mean + +- Doesn't mean "add retries everywhere." The DST rule still + applies for INTERNAL non-determinism — internal retries + remain a smell that needs root-causing. +- Doesn't mean "never rerun a build." Genuine one-shot + upstream blips that the retry-equipped code already + exhausted ARE the rare legitimate rerun case. +- Doesn't mean "all external = retryable." The exception is + scoped to **download / fetch failures from external + services that the factory legitimately depends on**. A + retry is NOT legitimate for, e.g., a flaky test that + happens to hit an external service if the underlying flake + is in our test code, not the external dep. +- Doesn't mean "ignore the friction entirely." Retry adds + latency on the slow path; the right shape is exponential + backoff with a bounded ceiling, not infinite retries. + +## Composes with + +- `memory/feedback_retries_are_non_determinism_smell_DST_holds_investigate_first_2026_04_23.md` + — the parent DST rule. This 2026-04-29 entry is the + concrete-domain refinement: external dep downloads ARE the + exception class, AND the fix lives in the code not the + workflow. +- `tools/setup/common/curl-fetch.sh` — the existing + retry-equipped helper, originally landed per Aaron's + 2026-04-28 framing + (*"curl 502 pattern i mean why should a PR ever fail for + this? our code does not handle the retries already?"* + + *"sounds like a common helper would help too rather than + copy/paste."*). This rule extends that direction by adding + the rerun-vs-fix-location distinction. +- `memory/feedback_aaron_channel_verbatim_preservation_anything_through_this_channel_2026_04_29.md` + — Aaron's correction is preserved verbatim above per the + channel-verbatim rule. +- `memory/feedback_otto_355_blocked_with_green_ci_means_investigate_review_threads_first_dont_wait_2026_04_27.md` + — same pattern at the PR layer: investigate the cause + first, don't reach for the wait/rerun shortcut. + +## Distilled keepers + +```text +External dep downloads ARE the DST exception. +We have no choice; they're external. +Retry is legitimate at this boundary. +``` + +```text +Fix the friction in the code, not in the workflow. +gh run rerun --failed only saves THIS build. +curl_fetch in the script saves EVERY future build. +``` + +```text +Aaron 2026-04-29: + "instead of kicking a 2nd build we should fix it + and reduce friction for future builds." +``` diff --git a/tools/setup/common/elan.sh b/tools/setup/common/elan.sh index 058591279..2b2440c3d 100755 --- a/tools/setup/common/elan.sh +++ b/tools/setup/common/elan.sh @@ -10,6 +10,17 @@ set -euo pipefail +# elan.sh runs as a subprocess from linux.sh + macos.sh, so it must +# source curl-fetch.sh itself (a sourced helper in the parent shell +# does NOT propagate to subprocess shells). Provides the +# retry-equipped curl_fetch helper for the elan-init.sh download — +# absorbs transient upstream 5xx without requiring a workflow +# rerun (durable fix in code, not ephemeral retry at the workflow +# layer). +REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)" +# shellcheck source=tools/setup/common/curl-fetch.sh +source "$REPO_ROOT/tools/setup/common/curl-fetch.sh" + if ! command -v elan >/dev/null 2>&1; then echo "↓ installing elan (Lean 4 toolchain manager)..." # Pinned to v4.2.1 commit SHA + verified SHA256 of elan-init.sh. @@ -26,7 +37,11 @@ if ! command -v elan >/dev/null 2>&1; then # mismatch, installer non-zero exit). `set -euo pipefail` would # otherwise leak the file on any failure path. trap 'rm -f "${ELAN_INIT_TMP}"' EXIT - curl -fsSL "${ELAN_INIT_URL}" -o "${ELAN_INIT_TMP}" + # Retry-equipped fetch — absorbs transient upstream 5xx (the + # elan-init.sh URL on raw.githubusercontent.com has hit 502 in + # CI runs; durable retry in the script avoids needing a workflow + # rerun). + curl_fetch --output "${ELAN_INIT_TMP}" "${ELAN_INIT_URL}" # Portable SHA256 verification: sha256sum (Linux/git-bash/WSL) or # shasum -a 256 (macOS default). Per the 4-shell portability target # (macOS bash 3.2 / Ubuntu / git-bash / WSL). diff --git a/tools/setup/linux.sh b/tools/setup/linux.sh index 3206c9d3b..2e446be30 100755 --- a/tools/setup/linux.sh +++ b/tools/setup/linux.sh @@ -23,6 +23,15 @@ set -euo pipefail REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)" SETUP_DIR="$REPO_ROOT/tools/setup" +# Retry-equipped curl helper — DST exception for external dep +# downloads, durable retry inside the script instead of ephemeral +# `gh run rerun --failed`. Sources curl_fetch (file-output, with +# `--retry 5 --retry-delay 2`, plus `--retry-all-errors` when the +# local curl supports it — curl-fetch.sh feature-detects and falls +# back on older curl builds). +# shellcheck source=tools/setup/common/curl-fetch.sh +source "$SETUP_DIR/common/curl-fetch.sh" + # ── Detect apt availability (Debian/Ubuntu) ───────────────────────── if ! command -v apt-get >/dev/null 2>&1; then echo "error: this script currently supports Debian/Ubuntu only" @@ -84,7 +93,9 @@ if ! command -v mise >/dev/null 2>&1; then # mismatch, tar extract failure). `set -euo pipefail` would otherwise # leak the directory on any failure path. trap 'rm -rf "${MISE_TMP}"' EXIT - curl -fsSL "${MISE_URL}" -o "${MISE_TMP}/${MISE_TARBALL}" + # Retry-equipped fetch — absorbs transient upstream 5xx without + # requiring a workflow rerun. + curl_fetch --output "${MISE_TMP}/${MISE_TARBALL}" "${MISE_URL}" # Portable SHA256 verification: sha256sum (Linux) or shasum (macOS, # though linux.sh runs on Linux only). Per the 4-shell portability # target (macOS bash 3.2 / Ubuntu / git-bash / WSL).