diff --git a/docs/BACKLOG.md b/docs/BACKLOG.md index 8581956fa4..0e361a6a5c 100644 --- a/docs/BACKLOG.md +++ b/docs/BACKLOG.md @@ -780,6 +780,7 @@ are closed (status: closed in frontmatter)._ - [ ] **[B-0842](backlog/P2/B-0842-universal-basis-decomposition-pattern-fft-shazam-itron-disaggregation-reservoir-readout-zeta-substrate-aaron-2026-05-26.md)** universal basis-decomposition pattern — FFT + Picard/Shazam-style fingerprinting + Itron energy-disaggregation + reservoir-computing readout + Zeta substrate are all instances of `Σ ωᵢ sᵢ(t) ≈ y(t)` (Aaron 2026-05-26) - [ ] **[B-0843](backlog/P2/B-0843-source-honor-ledger-basis-royalty-pool-creator-paid-by-design-rights-lanes-discipline-for-b0841-b0842-amara-aaron-2026-05-26.md)** Source Honor Ledger / Basis Royalty Pool — creator-paid-by-design operational implementation + 5-rights-lane discipline + "Don't sell compression. Sell readout." product positioning (extends B-0841 + B-0842 per Amara 3rd ferry 2026-05-26) - [ ] **[B-0845](backlog/P2/B-0845-ts-cli-arg-parser-library-evaluation-citty-vs-commander-vs-clipanion-vs-manual-aaron-2026-05-26.md)** TS CLI arg-parser library evaluation — citty vs commander vs clipanion vs manual; choose canonical for Zeta's many shell-script-like TS tools (Aaron 2026-05-26) +- [ ] **[B-0846](backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md)** installer WiFi-reproducibility — cache.nixos.org timeouts hang nixos-install on same N derivations; closure-baking into ISO + extra-substituters + Cachix mirror for reproducible-over-WiFi target (Aaron 2026-05-26) - [ ] **[B-0847](backlog/P2/B-0847-each-ai-gets-own-github-identity-with-email-once-cluster-operational-substrate-honest-attribution-end-to-end-closes-enabledby-token-owner-not-actor-algo-wink-aaron-2026-05-26.md)** each Zeta AI gets own GitHub identity + email once cluster operational — substrate-honest attribution end-to-end (closes the `gh enabledBy = token-owner ≠ actor` algo-wink-attribution-gap; Ilyana review for public-surface name + email before any creation) (Aaron 2026-05-26) ## P3 — convenience / deferred diff --git a/docs/backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md b/docs/backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md new file mode 100644 index 0000000000..ab0fdea58c --- /dev/null +++ b/docs/backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md @@ -0,0 +1,156 @@ +--- +id: B-0846 +priority: P2 +status: open +title: installer WiFi-reproducibility — cache.nixos.org timeouts hang nixos-install on same N derivations; closure-baking into ISO + extra-substituters + Cachix mirror for reproducible-over-WiFi target (Aaron 2026-05-26) +effort: M +ask: aaron 2026-05-26 +created: 2026-05-26 +last_updated: 2026-05-26 +depends_on: + - B-0834 +composes_with: + - B-0831 + - B-0832 + - B-0833 + - B-0835 +tags: [installer, nixos-install, wifi-reproducibility, cache-nixos-org, closure-baking, substituters, cachix, network-resilience, physical-hardware-support-test, empirical-anchor, operator-ux] +--- + +## Problem + +Empirical 2026-05-26 physical hardware-support test (Aaron over WiFi): +**`nixos-install` hits `Timeout was reached (28) Operation too slow` against `cache.nixos.org` on the SAME ~5 derivations TWICE IN A ROW, each retry burning the default 300s `stalled-download-timeout` before bailing.** The install is "moving forward" but each stalled-retry cycle costs ~10 min and the same 5 derivations keep timing out, suggesting a particular set of large/uncached/slow-route derivations rather than transient flake. + +Operator framing 2026-05-26: + +> *"any ideas is it stuck?"* → *"yeah i want to make it reproducable over wifi it's moving forward i just got 5 more warnings"* → *"i got timeouts on the same 5 files"* → *"twices in a row"* → *"after 300 seconds"* + +This is the OPERATOR-NAMED target: **WiFi-reproducible install**. Current bounded-fix (this PR's other commit to `zeta-install.sh` invocation) adds `--fallback` + tuned timeouts so the install no longer loops on the same files — it falls back to local build when cache.nixos.org stalls. That fix UNBLOCKS the install but doesn't solve the underlying reproducibility-over-WiFi problem; for that we need closure-baking + alternate substituters. + +## Why the same 5 derivations keep timing out + +Hypotheses (preserve per default-to-both): + +1. **Specific derivations are large** (e.g., perl-5.40.0 + linux kernel + nvidia-driver + gcc) — large file size + WiFi MTU + cache CDN edge-routing combine into consistent timeout on specific paths +2. **Cache.nixos.org CDN routing is dropping packets to operator's specific edge** — WiFi network + ISP path + Fastly CDN routing all interact; some path-pairs are consistently slow +3. **These derivations are NOT in any local store on the ISO** — they require fresh download every install; transient flake amplifies +4. **`cache.nixos.org` is in degraded state today** (worth checking against [status.nixos.org](https://status.nixos.org/)) + +The bounded-fix's `--fallback` makes nix build-from-source when substitution fails, which UNBLOCKS but is SLOWER per derivation (compile vs download). For WiFi-reproducibility long-term, the right shape is: + +- Bake the closure for typical full-ai-cluster node configurations INTO the ISO at build time, so install doesn't need internet at all for those derivations +- Have fallback substituters (Cachix mirror, attic-served local mirror in the home lab, etc.) so cache.nixos.org timeouts don't gate progress +- Tune `nix.conf` defaults at ISO build time so future installs benefit without per-invocation flags + +## Proposed mitigation (3 phases) + +### Phase 1 — Closure-baking into ISO (biggest win) + +At ISO build time (in `iso.nix`): + +1. Compute the closure for `nixosConfigurations.` — gives all derivations a typical full-ai-cluster install needs +2. Include that closure in `isoImage.contents` via `pkgs.closureInfo` + `isoImage.storeContents` +3. Result: install can run OFFLINE for the baked closure; only diverges-from-baked derivations need internet + +Cost: ISO grows from ~2 GB → ~6-8 GB (full closure size). Acceptable for the WiFi-reproducibility win + offline install ability. + +Reference: NixOS minimal installer ISO already does this for the minimal closure. Need to extend for full-ai-cluster closure. + +### Phase 2 — Alternate substituters in `nix.conf` + +In the installer NixOS config (illustrative shape; substituter URLs + pubkeys are PLACEHOLDERS that MUST be verified at implementation time per `.claude/rules/dep-pin-search-first-authority.md` — URLs can drift, pubkeys can rotate, mirrors can deprecate; copy/paste from this row at implementation time is the failure mode the dep-pin rule catches): + +```nix +nix.settings = { + substituters = [ + "" + "" + # Future: self-hosted Cachix or attic mirror in home lab + # "" + ]; + trusted-public-keys = [ + "" + "" + ]; +}; +``` + +Result: nix tries multiple substituters in parallel; if cache.nixos.org is slow, alternatives may serve faster. The implementer WebSearches current values + cites the source per dep-pin discipline. + +### Phase 3 — Self-hosted mirror in home lab + +Once the cluster is up: + +1. Run an [attic](https://github.com/zhaofengli/attic) or [harmonia](https://github.com/nix-community/harmonia) server on the cluster +2. Pre-warm it with the full-ai-cluster closure +3. Subsequent node-adds + iterations pull from local LAN at gigabit instead of WiFi-to-CDN +4. Closes the home-lab reproducibility loop: cluster self-serves its own derivations + +This depends on B-0831 cascade #6 + the cluster being operational; sequencing matters. + +## Acceptance + +### Phase 1 (closure-baking) — load-bearing + +- [ ] ISO build time computes closure for canonical full-ai-cluster node config +- [ ] `pkgs.closureInfo` + `isoImage.storeContents` include the closure +- [ ] Fresh install with WiFi disabled mid-install completes for baked derivations (manual operator test) +- [ ] ISO size growth documented; tradeoff acceptable + +### Phase 2 (extra-substituters) — composable with Phase 1 + +- [ ] `nix.settings.substituters` includes nix-community.cachix.org +- [ ] `trusted-public-keys` includes corresponding cachix key +- [ ] Install proceeds when cache.nixos.org times out (validated via injected slow-route test) + +### Phase 3 (home-lab mirror) — long-horizon + +- [ ] attic or harmonia deployed on cluster +- [ ] Pre-warm script populates mirror with full-ai-cluster closure +- [ ] Subsequent node-adds prefer LAN mirror over WAN cache (substituter priority tuned) +- [ ] Documented in install-runbook + +## Composes with + +- B-0831 cascade #6 (CI testing infrastructure — Phase 1 closure-baking test belongs in CI) +- B-0832 (nmtui WiFi rescan — operator's first networking touchpoint) +- B-0833 (gh auth — closure-baking eliminates gh download from install path; doesn't help with auth itself) +- B-0834 (preserve install log — diagnostic surface for future cache timeouts) +- B-0835 (Bug 1 + Bug 3b — install completes; this row makes install RESILIENT) +- `.claude/rules/dep-pin-search-first-authority.md` — Phase 2 substituter list MUST WebSearch + verify current cachix.org URLs + pubkeys at implementation time + +## Bounded-fix shipped today (different PR scope) + +This same PR (B-0846 row authored here) ALSO ships the bounded mitigation in `full-ai-cluster/usb-nixos-installer/zeta-install.sh`: + +- `--fallback` so nix builds-from-source when substitute download fails +- `--option connect-timeout 10` so dead substituters drop fast (vs default 0=infinity) +- `--option stalled-download-timeout 60` so 5-minute-per-retry burn is cut to 1 minute +- `--option download-attempts 3` so retry loop bounds-progresses to fallback faster + +That bounded-fix unblocks Aaron's IMMEDIATE WiFi install but doesn't solve the longer-term reproducibility-over-WiFi target. This row tracks the proper substrate. + +## Why P2 + +- P0/P1 would be "install blocks completely on WiFi" — but the bounded-fix in this PR moves install from "looping forever on 5 files" to "slower but progressing" +- P2 captures the substrate-engineering work to make WiFi-install FAST + RELIABLE rather than SLOW + RELIABLE +- Sequenced after iter-5.4 install validation completes (which the bounded-fix unblocks) + +## Full reasoning + +Operator's 2026-05-26 verbatim test session: + +1. Initial install warnings → "any ideas is it stuck?" +2. After explanation → "yeah i want to make it reproducable over wifi it's moving forward i just got 5 more warnings" +3. Empirical narrowing → "i got timeouts on the same 5 files" + "twices in a row" + "after 300 seconds" + +The "same 5 files twice in 300s" empirical anchor is what makes this a structural problem rather than transient flake: a particular set of derivations is consistently failing the WiFi route, suggesting closure-baking or alternate-substituter is the right level of intervention rather than just retry tuning. + +Sub-rows likely needed: + +- B-0846.1: Phase 1 closure-baking implementation + ISO size tradeoff documentation +- B-0846.2: Phase 2 nix-community.cachix.org substituter addition + pubkey verification (WebSearch per dep-pin rule at implementation time) +- B-0846.3: Phase 3 home-lab attic/harmonia mirror (depends on cluster operational) + +To be filed as the row matures + Phase 1 is implementable. diff --git a/full-ai-cluster/usb-nixos-installer/zeta-install.sh b/full-ai-cluster/usb-nixos-installer/zeta-install.sh index ff84460d69..8f977cfa23 100755 --- a/full-ai-cluster/usb-nixos-installer/zeta-install.sh +++ b/full-ai-cluster/usb-nixos-installer/zeta-install.sh @@ -977,7 +977,30 @@ echo "Running nixos-install --flake /mnt/etc/zeta/full-ai-cluster#$HOST ..." # - initial-password.nix does NOT use builtins.readFile (per B-0835 # Bug 3b fix uses activation-script instead); its hash file (which # IS a secret) doesn't transit the impure-eval path -sudo nixos-install --impure --flake "/mnt/etc/zeta/full-ai-cluster#$HOST" --no-root-password +# +# WiFi-reproducibility (empirical 2026-05-26: cache.nixos.org timeouts +# on same 5 derivations twice in a row over WiFi): +# --fallback: build from source if substitute download fails (don't bail +# — keeps the install moving even when cache is flaky) +# --option connect-timeout 10: drop dead substituter connections fast +# instead of waiting the default 0 (=no timeout) +# --option stalled-download-timeout 60: cut the 300s default by 5×; a +# stalled download is detected sooner so retry or fallback +# fires faster +# --option download-attempts 3: cap retries (default 5) so the loop +# bounded-progresses to fallback +# Slower for the few stalled derivations (local build vs cache download) +# but UNBLOCKS the install instead of looping on the same 5 files. +# Full reproducibility work (closure-baking, Cachix mirror, extra-substituters) +# tracked at B-0846. +sudo nixos-install \ + --impure \ + --fallback \ + --option connect-timeout 10 \ + --option stalled-download-timeout 60 \ + --option download-attempts 3 \ + --flake "/mnt/etc/zeta/full-ai-cluster#$HOST" \ + --no-root-password # Explicit cleanup at end (defense-in-depth; trap also handles this on # success OR failure exit paths).