Merge pull request #600 from akrzos/update_lab_rebuild_badfish

Add additional by-path disks for perflab and adjust the docs for fixi…
redhat-performance · Jan 29, 2025 · 81ba1dc · 81ba1dc
2 parents a14e364 + d6b0e13
commit 81ba1dc
Show file tree

Hide file tree

Showing 2 changed files with 24 additions and 18 deletions.
diff --git a/docs/tips-and-vars.md b/docs/tips-and-vars.md
@@ -45,13 +45,15 @@ Performance lab chart is available [here](https://wiki.rdu3.labs.perfscale.redha
 
 ## Install disk by-path vars
 
+Setting the install disk to use a by-path link is required for multi-disk systems as a
+symbolic link can change which underlying disk is referenced and may refer to a
+non-bootable disk or disk later in the boot order of hard disks. If this occurs, the
+deployment will eventually fail as the installed OCP is unable to boot properly.
+
 > [!TIP]
-> For multi node deployment of OCP 4.13 or greater it is advisable to
-> set the extra vars for by-path reference for the installation as sometimes disk
-> names get swapped during boot discovery (e.g., sda and sdb). Using the PCI
-> paths (in a homogeneous Scale or Performance lab cloud) should be consistent across
-> all the machines, and isn't subject to change during discovery. Below are the
-> extra vars along with the hardware used.
+> Using the PCI paths (in a homogeneous Scale or Performance lab cloud) should be
+> consistent across all the machines, and isn't subject to change during discovery.
+> Below are the extra vars along with the hardware used.
 
 For 3-node MNO deployments you only need to set `control_plane_install_disk`, if your
 MNO deployment has worker nodes then you will also need to set `worker_install_disk`.
@@ -79,8 +81,14 @@ edit the inventory file to set appropriate install paths for each machine.
 
 | Hardware | Install disk path
 | - | - |
-| Dell r740xd | /dev/disk/by-path/pci-0000:86:00.0-scsi-0:2:0:0 |
-| Dell r750  | /dev/disk/by-path/pci-0000:05:00.0-ata-1 |
+| Dell r740xd (SL-N, SL-G, SL-U, CL-N) | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:2:0:0 |
+| Dell r740xd (CL-U, CL-G) | /dev/disk/by-path/pci-0000:86:00.0-scsi-0:2:0:0 |
+| Dell r750 | /dev/disk/by-path/pci-0000:05:00.0-ata-1 |
+| Dell r7425 | /dev/disk/by-path/pci-0000:e2:00.0-scsi-0:2:0:0 |
+| Dell r7525 | /dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0 |
+| SuperMicro 6029p | /dev/disk/by-path/pci-0000:00:11.5-ata-5 |
+| Dell xe8640  | /dev/disk/by-path/pci-0000:01:00.0-nvme-1 |
+| Dell xe9680  | /dev/disk/by-path/pci-0000:01:00.0-nvme-1 |
 
 To find your machine's by-path reference:
 

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -25,7 +25,7 @@ _**Table of Contents**_
   - [Reset BMC / Resolving redfish connection error](#reset-bmc--resolving-redfish-connection-error)
   - [Missing Administrator IPMI privileges](#missing-administrator-ipmi-privileges)
   - [Failure of TASK SuperMicro Set Boot](#failure-of-task-supermicro-set-boot)
-- [Scalelab](#scalelab)
+- [Red Hat Labs](#red-hat-labs)
   - [Fix boot order of machines](#fix-boot-order-of-machines)
   - [Upgrade RHEL](#upgrade-rhel)
 <!-- /TOC -->
@@ -59,7 +59,7 @@ If the machines are reachable, but never registered with the assisted-installer,
 
 If some nodes correctly registered but some did not, then the missing nodes need to be individually diagnosed. On a missing node, check if the BMC actually mounted the virtual media. Typically the machine just requires a BMC reset due to not booting virtual media which is described in below sections. Another possibility includes non-functional hardware and thus the machine does not boot into the discovery image.
 
-## Failed on Wait for cluster to be ready 
+## Failed on Wait for cluster to be ready
 
 Check the "View cluster events" on the assisted-installer GUI to see if any validations
 are failing. If you see `Host xxxxx: validation sufficient-packet-loss-requirement-for-role that used to succeed is now failing`
@@ -400,21 +400,21 @@ This is caused by having an older BIOS version.
 
 When set to ignore the error, Jetlag can proceed, but you will need to manually unmount the ISO when the machines reboot the second time (as in not the reboot that happens immediately when Jetlag is run, but the one that happens after a noticeable delay). The unmount must be done as soon as the machines restart, as doing it too early can interrupt the process, and doing it after it boots into the ISO will be too late.
 
-# Scalelab
+# Red Hat Labs
 
 ## Fix boot order of machines
 
-If a machine needs to be rebuilt in the Scale Lab and refuses to correctly rebuild, it is likely a boot order issue. Using badfish, you can correct boot order issues by performing the following:
+If a machine needs to be rebuilt in the lab and refuses to correctly rebuild, it is likely a boot order issue. Using badfish, you can correct boot order issues by performing the following:
 
 > [!NOTE]
 > The process for the Performance Lab is similar, however the GitLab `config/idrac_interfaces.yml`
 > is specialized for the Scale Lab configurations, and needs to be modified for Performance Lab. The
 > necessary modifications are not covered here.
 
 ```console
-badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --boot-to-type foreman
-badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --check-boot
-badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --power-cycle
+podman run -it --rm quay.io/quads/badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --boot-to-type foreman
+podman run -it --rm quay.io/quads/badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --check-boot
+podman run -it --rm quay.io/quads/badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --power-cycle
 ```
 
 Substitute the user/password/hostname to allow the boot order to be fixed on the host machine. Note it will take a few minutes before the machine should reboot. If you previously triggered a rebuild, the machine will likely go straight into rebuild mode afterwards. You can learn more about [badfish here](https://github.com/redhat-performance/badfish).
@@ -434,10 +434,8 @@ The values in *config/idrac_interfaces.yml* are first of all for the Scale lab.
 - POLLING: [------------------->] 100% - Host state: On
 - INFO     - Command passed to On server, code return is 204.
 ```
-## Upgrade RHEL
 
-> [!TIP]
-> This applies to Scale lab and Performance lab.
+## Upgrade RHEL
 
 On the bastion machine: