fix(B-0754 iter-3): per-device partprobe — bare partprobe was hitting the boot USB (/dev/sda) + bailing#5057
Merged
AceHack merged 3 commits intoMay 26, 2026
Conversation
… the USB we booted from (/dev/sda) + failing set -euo pipefail bail Iteration-2 cluster-node test (Aaron 2026-05-25, photo evidence on PC 1 first real-hardware boot of iter-2): - All steps PRIOR to partprobe succeeded end-to-end on first attempt: wifi, banner, greedy N-disk enumeration (correctly identified both Crucial CT1000P3PSSD8 NVMes with serials), ZETA_AUTO_CONFIRM=WIPE bypass, wipefs+sgdisk on both NVMes, GPT partition creation on both NVMes - Then: 'Error: Partition(s) 1 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use' → script bailed via set -euo pipefail → drop_to_shell fired correctly with recovery hints Root cause: zeta-install.sh called bare (no args). partprobe with no args probes EVERY block device the kernel knows about. Linux exposes USB mass-storage devices as /dev/sda when there are no SATA disks (this cluster node has only NVMes + the boot USB). The booted live ISO has mounted partitions on /dev/sda; partprobe rightfully refuses to refresh those (it's the running root filesystem); returns non-zero; set -euo pipefail bails. The greedy N-disk enumeration ALREADY correctly excluded the USB (TRAN=usb filter in lsblk); we never partitioned /dev/sda. The partprobe call was the only blanket-all-devices invocation in the whole script. Fix: pass only the disks WE just partitioned. Per-device partprobe on BOOT_DISK + each DATA_DISKS entry. Never touch /dev/sda. This is the substrate-honest 'do it right so it's not ambiguous for future me/users' fix Aaron asked for — the script now operates on explicit-target devices throughout, never blanket-system-wide. Composes with B-0754 iter-2 (the systemd PATH fix that unblocked us reaching this far). Iter-1 surfaced PATH bug; iter-2 surfaced this partprobe bug; iter-3 fixes it; each iteration narrows the failure modes for the substrate operators will run forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
This PR updates the AI-cluster USB NixOS installer script to avoid calling partprobe without arguments (which probes every block device, including the live-boot USB), and instead refreshes the kernel partition table only for the disks that were just partitioned.
Changes:
- Replace blanket
sudo partprobewith per-devicepartprobecalls forBOOT_DISKand eachDATA_DISKSentry. - Add inline rationale documenting why blanket probing is unsafe in this installer context.
… — silences Intel SoF audio probe error during boot Iter-2 PC 1 boot showed cosmetic but scary 'ASoC: failed to instantiate card -2 / snd_soc_register_card failed -2' from sof-audio-pci-intel-mtl (Intel Meteor Lake HD Audio + SoundWire). errno -2 = ENOENT = SoF firmware blobs not on the ISO. hardware.enableRedistributableFirmware = true pulls in linux-firmware which includes Intel SoF + WiFi + Bluetooth + NIC firmware. Redistributable-only license; no allowUnfree needed. ~80MB added to ISO. Operationally: audio is not load-bearing for cluster substrate, but per B-0759 first-time-CLI-user persona, scary ERROR lines in dmesg are UX noise we don't need. Also enables firmware probe to complete cleanly which is prerequisite for B-0771 (audio stack + NPU substrate for DAW + AI workloads scope) landing later. Bundled into iter-3 PR with the partprobe fix so we get ONE ISO rebuild instead of two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xplicit partprobe failure handler per disk Two small clarity fixes from Copilot review on #5057: P3: '/dev/sda' isn't guaranteed across hardware/boot order — reworded comment to 'kernel typically exposes USB mass-storage as /dev/sdX — commonly /dev/sda on boards with no SATA disks; specific letter isn't guaranteed but failure mode is the same regardless of letter'. P2: bare partprobe with set -euo pipefail aborts install silently — added explicit '|| bail "partprobe failed for ..."' per disk that names the offending disk + suggests manual recovery commands. Operator sees actionable error instead of silent set -euo pipefail exit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Iteration 2 result (cluster node 1, real-hardware test, Aaron 2026-05-25)
Photo evidence on PC 1 shows iter-2 reached 98% of the install path on first try:
Error: Partition(s) 1 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in useRoot cause
zeta-install.sh called bare
sudo partprobe(no args). partprobe with no args probes EVERY block device the kernel knows about. Linux exposes USB mass-storage as/dev/sdawhen no SATA disks present. The booted live ISO has mounted partitions on /dev/sda; partprobe refuses to refresh (rightfully); returns non-zero;set -euo pipefailbails.The greedy N-disk enum ALREADY correctly excluded the USB (TRAN=usb filter). We never partitioned /dev/sda. The partprobe call was the only blanket-all-devices invocation in the whole script.
Fix
Per-device partprobe on BOOT_DISK + each DATA_DISKS entry. Never blanket. Never touch /dev/sda.
Aaron 2026-05-25: 'i would rather do it right so it's not ambigious for future me / users' — the script now operates on explicit-target devices throughout, no blanket-system-wide invocations remaining.
Test plan
bash -nsyntax check