fix(B-0835)+feat(B-0846): zeta-install --fallback + nix-timeout tuning (WiFi cache.nixos.org timeout resilience; empirical 5-files-timeout-twice over WiFi)#5383
Merged
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
This PR hardens the USB NixOS installer’s nixos-install step against flaky cache.nixos.org WiFi downloads by enabling fallback-to-local-build and tightening Nix download timeouts, and it adds a P2 backlog row tracking longer-horizon “WiFi-reproducible install” substrate work.
Changes:
- Update
zeta-install.shto runnixos-installwith--fallbackplus tunedconnect-timeout,stalled-download-timeout, anddownload-attempts. - Add backlog row B-0846 documenting the observed timeout behavior and a phased mitigation plan (closure baking + extra substituters + mirror).
- Add B-0846 entry to
docs/BACKLOG.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| full-ai-cluster/usb-nixos-installer/zeta-install.sh | Adds bounded Nix download resilience flags to prevent repeated cache timeouts from stalling installs. |
| docs/backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md | New P2 row capturing the empirical WiFi timeout issue and outlining phased mitigation work. |
| docs/BACKLOG.md | Adds the B-0846 index entry under P2. |
… cache.nixos.org timeout resilience + B-0846 (WiFi-reproducibility substrate-engineering)
Empirical 2026-05-26 physical hardware-support test (Aaron over WiFi):
nixos-install hit `Timeout was reached (28) Operation too slow` against
cache.nixos.org on the SAME 5 derivations TWICE IN A ROW, each retry
burning the default 300s stalled-download-timeout before bailing.
Bounded fix (this commit):
sudo nixos-install --impure --fallback \
--option connect-timeout 10 \
--option stalled-download-timeout 60 \
--option download-attempts 3 \
--flake ... --no-root-password
- --fallback: build-from-source when substitute download fails (don't bail
— keeps install MOVING when cache is flaky)
- connect-timeout 10: drop dead connections fast (default 0 = infinity)
- stalled-download-timeout 60: cut 300s retry burn by 5×
- download-attempts 3: cap retries (default 5) so loop progresses to fallback
Tradeoff: slower for the few stalled derivations (local compile vs cache
download) but UNBLOCKS the install instead of looping forever.
Larger WiFi-reproducibility work tracked at B-0846 (this same PR):
- Phase 1: closure-baking the canonical full-ai-cluster node closure INTO
the ISO at build time (offline-install capability)
- Phase 2: extra-substituters in nix.conf (nix-community.cachix.org +
future self-hosted mirror)
- Phase 3: home-lab attic/harmonia mirror (cluster self-serves its own
derivations over LAN)
Operator framing: "yeah i want to make it reproducable over wifi" +
"i got timeouts on the same 5 files" + "twices in a row" + "after 300
seconds" (empirical anchor that's a structural problem, not transient).
Per `.claude/rules/dep-pin-search-first-authority.md`: Phase 2 substituter
URLs + pubkeys MUST WebSearch + verify current values at implementation
time (not training-data defaults).
Composes with: B-0832 + B-0833 + B-0834 + B-0835 + B-0831 cascade #6.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
…e typo (reproducable → reproducible) + Phase 2 pubkey placeholders + BACKLOG index regen 4 substantive Copilot findings, all verified + fixed: 1. zeta-install.sh:995 "B-0846+ (per backlog row to be filed in this PR)" → "B-0846." The row IS in this PR; "to be filed" was stale wording. 2. B-0846.md frontmatter title: "reproducable over wifi" → "reproducible- over-WiFi target". Misspelling reserved for verbatim operator quotes only (preserved in operator-framing quote block on line 27). 3. BACKLOG.md line 783: index entry inherited the misspelling from the title; regen now picks up the corrected title. 4. B-0846.md Phase 2 example listed concrete substituter URLs + pubkeys inline — converted to <VERIFY-AT-IMPL> placeholders with explicit pointer to .claude/rules/dep-pin-search-first-authority.md so the implementer WebSearches + cites current values rather than copying potentially-stale strings from this row. All findings verified via direct file inspection per .claude/rules/blocked-green-ci-investigate-threads.md verify-before-fix discipline (no FP-class catches; all 4 real). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
67b169d to
a2fb697
Compare
2 tasks
AceHack
added a commit
that referenced
this pull request
May 27, 2026
…--option fallback true (Nix-option pass-through) (#5410) Empirical failure 2026-05-27 (Aaron USB boot, ISO ci26490417201 / commit 282648d): Running nixos-install --flake /mnt/etc/zeta/full-ai-cluster#control-plane --fallback ... /run/current-system/sw/bin/nixos-install: unknown option `--fallback` [zeta-first-boot] Install failed. See output above. Dropped to interactive shell; install completely blocked. Root cause: PR #5383 (`fix(B-0835)+feat(B-0846): zeta-install --fallback`) added `--fallback` as a TOP-LEVEL flag to nixos-install, assuming it was passed through to underlying nix. nixos-install does NOT accept --fallback; the supported form is `--option fallback true` (the Nix-option pass-through convention nixos-install already uses for connect-timeout / stalled-download- timeout / download-attempts in the same invocation block). Fix: 1-line change `--fallback \` → `--option fallback true \` plus comment update noting the empirical anchor + correct pass-through form. Operator unblock for the broken interactive shell on Aaron's USB: sed -i 's|^ --fallback \\| --option fallback true \\|' \ /run/current-system/sw/bin/zeta-install zeta-install control-plane Composes with: - B-0835 (installer-config-bugs canonical bag) — adds Bug 9 to the catalog - B-0846 (WiFi-reproducibility substrate) — this fix preserves the intent (build-from-source fallback) using the correct API Successful empirical anchor for next ISO build will validate the fix. Co-authored-by: Lior <lior@zeta.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Aaron's USB install hit `cache.nixos.org` timeouts on same 5 derivations TWICE IN A ROW after 300s each over WiFi. Default nix invocation loops indefinitely; bounded-fix here adds `--fallback` so install switches to local build when substitute download stalls.
Two commits in one PR:
Bounded fix to `full-ai-cluster/usb-nixos-installer/zeta-install.sh`:
```bash
sudo nixos-install --impure --fallback \
--option connect-timeout 10 \
--option stalled-download-timeout 60 \
--option download-attempts 3 \
--flake ... --no-root-password
```
Tradeoff: slower for the few stalled derivations (compile vs download) but UNBLOCKS the install instead of looping forever.
Substrate-engineering work tracked at B-0846:
Operator framing
The "same 5 files twice in 300s" empirical anchor is what makes this a structural problem rather than transient flake.
Test plan
Composes with
Per `.claude/rules/dep-pin-search-first-authority.md`: B-0846 Phase 2 substituter URLs + pubkeys MUST WebSearch + verify current values at implementation time.
🤖 Generated with Claude Code