Skip to content

fix(B-0835)+feat(B-0846): zeta-install --fallback + nix-timeout tuning (WiFi cache.nixos.org timeout resilience; empirical 5-files-timeout-twice over WiFi)#5383

Merged
AceHack merged 2 commits into
mainfrom
feat-wifi-fallback-zeta-install-2026-05-26-2150z
May 27, 2026
Merged

fix(B-0835)+feat(B-0846): zeta-install --fallback + nix-timeout tuning (WiFi cache.nixos.org timeout resilience; empirical 5-files-timeout-twice over WiFi)#5383
AceHack merged 2 commits into
mainfrom
feat-wifi-fallback-zeta-install-2026-05-26-2150z

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 27, 2026

Summary

Aaron's USB install hit `cache.nixos.org` timeouts on same 5 derivations TWICE IN A ROW after 300s each over WiFi. Default nix invocation loops indefinitely; bounded-fix here adds `--fallback` so install switches to local build when substitute download stalls.

Two commits in one PR:

  1. Bounded fix to `full-ai-cluster/usb-nixos-installer/zeta-install.sh`:

    ```bash
    sudo nixos-install --impure --fallback \
    --option connect-timeout 10 \
    --option stalled-download-timeout 60 \
    --option download-attempts 3 \
    --flake ... --no-root-password
    ```

    • `--fallback`: build-from-source when substitute download fails
    • `connect-timeout 10`: drop dead connections fast (default 0 = infinity)
    • `stalled-download-timeout 60`: cut 300s retry burn by 5×
    • `download-attempts 3`: cap retries (default 5) so loop progresses to fallback

    Tradeoff: slower for the few stalled derivations (compile vs download) but UNBLOCKS the install instead of looping forever.

  2. Substrate-engineering work tracked at B-0846:

    • Phase 1: closure-baking the canonical full-ai-cluster node closure INTO the ISO at build time (offline-install capability)
    • Phase 2: extra-substituters in nix.conf (nix-community.cachix.org + future self-hosted mirror)
    • Phase 3: home-lab attic/harmonia mirror (cluster self-serves its own derivations over LAN)

Operator framing

"yeah i want to make it reproducable over wifi"
"i got timeouts on the same 5 files"
"twices in a row"
"after 300 seconds"

The "same 5 files twice in 300s" empirical anchor is what makes this a structural problem rather than transient flake.

Test plan

  • Edit applied + commits clean
  • CI build-iso passes (ISO build itself doesn't exercise the install-time `--fallback` flag, but should not regress)
  • Operator validation on next USB flash: nixos-install no longer loops on the same 5 files; either downloads succeed faster (connect-timeout drops dead connections sooner) or fallback-build kicks in within 60s instead of 300s

Composes with

Per `.claude/rules/dep-pin-search-first-authority.md`: B-0846 Phase 2 substituter URLs + pubkeys MUST WebSearch + verify current values at implementation time.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 27, 2026 02:20
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack enabled auto-merge (squash) May 27, 2026 02:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the USB NixOS installer’s nixos-install step against flaky cache.nixos.org WiFi downloads by enabling fallback-to-local-build and tightening Nix download timeouts, and it adds a P2 backlog row tracking longer-horizon “WiFi-reproducible install” substrate work.

Changes:

  • Update zeta-install.sh to run nixos-install with --fallback plus tuned connect-timeout, stalled-download-timeout, and download-attempts.
  • Add backlog row B-0846 documenting the observed timeout behavior and a phased mitigation plan (closure baking + extra substituters + mirror).
  • Add B-0846 entry to docs/BACKLOG.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
full-ai-cluster/usb-nixos-installer/zeta-install.sh Adds bounded Nix download resilience flags to prevent repeated cache timeouts from stalling installs.
docs/backlog/P2/B-0846-installer-wifi-reproducibility-cache-nixos-org-timeouts-closure-baking-extra-substituters-cachix-mirror-aaron-2026-05-26.md New P2 row capturing the empirical WiFi timeout issue and outlining phased mitigation work.
docs/BACKLOG.md Adds the B-0846 index entry under P2.

Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
Comment thread docs/BACKLOG.md Outdated
Lior and others added 2 commits May 26, 2026 22:27
… cache.nixos.org timeout resilience + B-0846 (WiFi-reproducibility substrate-engineering)

Empirical 2026-05-26 physical hardware-support test (Aaron over WiFi):
nixos-install hit `Timeout was reached (28) Operation too slow` against
cache.nixos.org on the SAME 5 derivations TWICE IN A ROW, each retry
burning the default 300s stalled-download-timeout before bailing.

Bounded fix (this commit):

  sudo nixos-install --impure --fallback \
    --option connect-timeout 10 \
    --option stalled-download-timeout 60 \
    --option download-attempts 3 \
    --flake ... --no-root-password

- --fallback: build-from-source when substitute download fails (don't bail
  — keeps install MOVING when cache is flaky)
- connect-timeout 10: drop dead connections fast (default 0 = infinity)
- stalled-download-timeout 60: cut 300s retry burn by 5×
- download-attempts 3: cap retries (default 5) so loop progresses to fallback

Tradeoff: slower for the few stalled derivations (local compile vs cache
download) but UNBLOCKS the install instead of looping forever.

Larger WiFi-reproducibility work tracked at B-0846 (this same PR):

- Phase 1: closure-baking the canonical full-ai-cluster node closure INTO
  the ISO at build time (offline-install capability)
- Phase 2: extra-substituters in nix.conf (nix-community.cachix.org +
  future self-hosted mirror)
- Phase 3: home-lab attic/harmonia mirror (cluster self-serves its own
  derivations over LAN)

Operator framing: "yeah i want to make it reproducable over wifi" +
"i got timeouts on the same 5 files" + "twices in a row" + "after 300
seconds" (empirical anchor that's a structural problem, not transient).

Per `.claude/rules/dep-pin-search-first-authority.md`: Phase 2 substituter
URLs + pubkeys MUST WebSearch + verify current values at implementation
time (not training-data defaults).

Composes with: B-0832 + B-0833 + B-0834 + B-0835 + B-0831 cascade #6.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…e typo (reproducable → reproducible) + Phase 2 pubkey placeholders + BACKLOG index regen

4 substantive Copilot findings, all verified + fixed:

1. zeta-install.sh:995 "B-0846+ (per backlog row to be filed in this PR)"
   → "B-0846." The row IS in this PR; "to be filed" was stale wording.

2. B-0846.md frontmatter title: "reproducable over wifi" → "reproducible-
   over-WiFi target". Misspelling reserved for verbatim operator quotes
   only (preserved in operator-framing quote block on line 27).

3. BACKLOG.md line 783: index entry inherited the misspelling from the
   title; regen now picks up the corrected title.

4. B-0846.md Phase 2 example listed concrete substituter URLs + pubkeys
   inline — converted to <VERIFY-AT-IMPL> placeholders with explicit
   pointer to .claude/rules/dep-pin-search-first-authority.md so the
   implementer WebSearches + cites current values rather than copying
   potentially-stale strings from this row.

All findings verified via direct file inspection per
.claude/rules/blocked-green-ci-investigate-threads.md verify-before-fix
discipline (no FP-class catches; all 4 real).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 02:27
@AceHack AceHack force-pushed the feat-wifi-fallback-zeta-install-2026-05-26-2150z branch from 67b169d to a2fb697 Compare May 27, 2026 02:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@AceHack AceHack merged commit 7c9447a into main May 27, 2026
31 checks passed
@AceHack AceHack deleted the feat-wifi-fallback-zeta-install-2026-05-26-2150z branch May 27, 2026 02:30
AceHack added a commit that referenced this pull request May 27, 2026
…--option fallback true (Nix-option pass-through) (#5410)

Empirical failure 2026-05-27 (Aaron USB boot, ISO ci26490417201 / commit 282648d):

  Running nixos-install --flake /mnt/etc/zeta/full-ai-cluster#control-plane --fallback ...
  /run/current-system/sw/bin/nixos-install: unknown option `--fallback`
  [zeta-first-boot] Install failed. See output above.

Dropped to interactive shell; install completely blocked.

Root cause: PR #5383 (`fix(B-0835)+feat(B-0846): zeta-install --fallback`)
added `--fallback` as a TOP-LEVEL flag to nixos-install, assuming it was
passed through to underlying nix. nixos-install does NOT accept --fallback;
the supported form is `--option fallback true` (the Nix-option pass-through
convention nixos-install already uses for connect-timeout / stalled-download-
timeout / download-attempts in the same invocation block).

Fix: 1-line change `--fallback \` → `--option fallback true \`
plus comment update noting the empirical anchor + correct pass-through form.

Operator unblock for the broken interactive shell on Aaron's USB:
  sed -i 's|^  --fallback \\|  --option fallback true \\|' \
    /run/current-system/sw/bin/zeta-install
  zeta-install control-plane

Composes with:
- B-0835 (installer-config-bugs canonical bag) — adds Bug 9 to the catalog
- B-0846 (WiFi-reproducibility substrate) — this fix preserves the intent
  (build-from-source fallback) using the correct API

Successful empirical anchor for next ISO build will validate the fix.

Co-authored-by: Lior <lior@zeta.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants