Skip to content

fix(B-0754 iter-3): per-device partprobe — bare partprobe was hitting the boot USB (/dev/sda) + bailing#5057

Merged
AceHack merged 3 commits into
mainfrom
otto-cli/b0754-iter3-partprobe-per-device-2026-05-25
May 26, 2026
Merged

fix(B-0754 iter-3): per-device partprobe — bare partprobe was hitting the boot USB (/dev/sda) + bailing#5057
AceHack merged 3 commits into
mainfrom
otto-cli/b0754-iter3-partprobe-per-device-2026-05-25

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 26, 2026

Iteration 2 result (cluster node 1, real-hardware test, Aaron 2026-05-25)

Photo evidence on PC 1 shows iter-2 reached 98% of the install path on first try:

  • Wifi connected
  • Banner shown
  • Greedy N-disk enum: both Crucial CT1000P3PSSD8 NVMes correctly identified with serials
  • Plan presented (BOOT nvme0n1: ESP 1G + root 256G + longhorn1 rest; DATA nvme1n1: whole disk longhorn2)
  • ZETA_AUTO_CONFIRM=WIPE bypass: WORKED
  • wipefs + sgdisk on both NVMes: SUCCESS
  • GPT partition creation on both NVMes: SUCCESS
  • Then: Error: Partition(s) 1 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use
  • drop_to_shell fired correctly with recovery hints

Root cause

zeta-install.sh called bare sudo partprobe (no args). partprobe with no args probes EVERY block device the kernel knows about. Linux exposes USB mass-storage as /dev/sda when no SATA disks present. The booted live ISO has mounted partitions on /dev/sda; partprobe refuses to refresh (rightfully); returns non-zero; set -euo pipefail bails.

The greedy N-disk enum ALREADY correctly excluded the USB (TRAN=usb filter). We never partitioned /dev/sda. The partprobe call was the only blanket-all-devices invocation in the whole script.

Fix

Per-device partprobe on BOOT_DISK + each DATA_DISKS entry. Never blanket. Never touch /dev/sda.

Aaron 2026-05-25: 'i would rather do it right so it's not ambigious for future me / users' — the script now operates on explicit-target devices throughout, no blanket-system-wide invocations remaining.

Test plan

  • bash -n syntax check
  • CI rebuilds ISO via build-ai-cluster-iso.yml
  • Aaron reflashes via zflash + boots cluster node 1 + observes unattended install reaches end-to-end (cluster member after reboot)
  • CI green

… the USB we booted from (/dev/sda) + failing set -euo pipefail bail

Iteration-2 cluster-node test (Aaron 2026-05-25, photo evidence
on PC 1 first real-hardware boot of iter-2):

- All steps PRIOR to partprobe succeeded end-to-end on first
  attempt: wifi, banner, greedy N-disk enumeration (correctly
  identified both Crucial CT1000P3PSSD8 NVMes with serials),
  ZETA_AUTO_CONFIRM=WIPE bypass, wipefs+sgdisk on both NVMes,
  GPT partition creation on both NVMes
- Then: 'Error: Partition(s) 1 on /dev/sda have been written,
  but we have been unable to inform the kernel of the change,
  probably because it/they are in use' → script bailed via
  set -euo pipefail → drop_to_shell fired correctly with
  recovery hints

Root cause: zeta-install.sh called bare  (no
args). partprobe with no args probes EVERY block device the
kernel knows about. Linux exposes USB mass-storage devices as
/dev/sda when there are no SATA disks (this cluster node has
only NVMes + the boot USB). The booted live ISO has mounted
partitions on /dev/sda; partprobe rightfully refuses to refresh
those (it's the running root filesystem); returns non-zero;
set -euo pipefail bails.

The greedy N-disk enumeration ALREADY correctly excluded the USB
(TRAN=usb filter in lsblk); we never partitioned /dev/sda. The
partprobe call was the only blanket-all-devices invocation in the
whole script.

Fix: pass only the disks WE just partitioned. Per-device partprobe
on BOOT_DISK + each DATA_DISKS entry. Never touch /dev/sda.

This is the substrate-honest 'do it right so it's not ambiguous
for future me/users' fix Aaron asked for — the script now operates
on explicit-target devices throughout, never blanket-system-wide.

Composes with B-0754 iter-2 (the systemd PATH fix that unblocked
us reaching this far). Iter-1 surfaced PATH bug; iter-2 surfaced
this partprobe bug; iter-3 fixes it; each iteration narrows the
failure modes for the substrate operators will run forever.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 00:54
@AceHack AceHack enabled auto-merge (squash) May 26, 2026 00:54
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the AI-cluster USB NixOS installer script to avoid calling partprobe without arguments (which probes every block device, including the live-boot USB), and instead refreshes the kernel partition table only for the disks that were just partitioned.

Changes:

  • Replace blanket sudo partprobe with per-device partprobe calls for BOOT_DISK and each DATA_DISKS entry.
  • Add inline rationale documenting why blanket probing is unsafe in this installer context.

Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
Comment thread full-ai-cluster/usb-nixos-installer/zeta-install.sh Outdated
… — silences Intel SoF audio probe error during boot

Iter-2 PC 1 boot showed cosmetic but scary 'ASoC: failed to
instantiate card -2 / snd_soc_register_card failed -2' from
sof-audio-pci-intel-mtl (Intel Meteor Lake HD Audio +
SoundWire). errno -2 = ENOENT = SoF firmware blobs not on
the ISO.

hardware.enableRedistributableFirmware = true pulls in
linux-firmware which includes Intel SoF + WiFi + Bluetooth +
NIC firmware. Redistributable-only license; no allowUnfree
needed. ~80MB added to ISO.

Operationally: audio is not load-bearing for cluster substrate,
but per B-0759 first-time-CLI-user persona, scary ERROR lines in
dmesg are UX noise we don't need. Also enables firmware probe to
complete cleanly which is prerequisite for B-0771 (audio stack +
NPU substrate for DAW + AI workloads scope) landing later.

Bundled into iter-3 PR with the partprobe fix so we get ONE ISO
rebuild instead of two.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xplicit partprobe failure handler per disk

Two small clarity fixes from Copilot review on #5057:

P3: '/dev/sda' isn't guaranteed across hardware/boot order — reworded comment to 'kernel typically exposes USB mass-storage as /dev/sdX — commonly /dev/sda on boards with no SATA disks; specific letter isn't guaranteed but failure mode is the same regardless of letter'.

P2: bare partprobe with set -euo pipefail aborts install silently — added explicit '|| bail "partprobe failed for ..."' per disk that names the offending disk + suggests manual recovery commands. Operator sees actionable error instead of silent set -euo pipefail exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 01:10
auto-merge was automatically disabled May 26, 2026 01:10

Pull Request is not mergeable

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@AceHack AceHack merged commit 35b3743 into main May 26, 2026
29 checks passed
@AceHack AceHack deleted the otto-cli/b0754-iter3-partprobe-per-device-2026-05-25 branch May 26, 2026 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants