Skip to content

Commit

Permalink
Merge pull request #600 from akrzos/update_lab_rebuild_badfish
Browse files Browse the repository at this point in the history
Add additional by-path disks for perflab and adjust the docs for fixi…
  • Loading branch information
openshift-merge-bot[bot] authored Jan 29, 2025
2 parents a14e364 + d6b0e13 commit 81ba1dc
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 18 deletions.
24 changes: 16 additions & 8 deletions docs/tips-and-vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,15 @@ Performance lab chart is available [here](https://wiki.rdu3.labs.perfscale.redha

## Install disk by-path vars

Setting the install disk to use a by-path link is required for multi-disk systems as a
symbolic link can change which underlying disk is referenced and may refer to a
non-bootable disk or disk later in the boot order of hard disks. If this occurs, the
deployment will eventually fail as the installed OCP is unable to boot properly.

> [!TIP]
> For multi node deployment of OCP 4.13 or greater it is advisable to
> set the extra vars for by-path reference for the installation as sometimes disk
> names get swapped during boot discovery (e.g., sda and sdb). Using the PCI
> paths (in a homogeneous Scale or Performance lab cloud) should be consistent across
> all the machines, and isn't subject to change during discovery. Below are the
> extra vars along with the hardware used.
> Using the PCI paths (in a homogeneous Scale or Performance lab cloud) should be
> consistent across all the machines, and isn't subject to change during discovery.
> Below are the extra vars along with the hardware used.
For 3-node MNO deployments you only need to set `control_plane_install_disk`, if your
MNO deployment has worker nodes then you will also need to set `worker_install_disk`.
Expand Down Expand Up @@ -79,8 +81,14 @@ edit the inventory file to set appropriate install paths for each machine.

| Hardware | Install disk path
| - | - |
| Dell r740xd | /dev/disk/by-path/pci-0000:86:00.0-scsi-0:2:0:0 |
| Dell r750 | /dev/disk/by-path/pci-0000:05:00.0-ata-1 |
| Dell r740xd (SL-N, SL-G, SL-U, CL-N) | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:2:0:0 |
| Dell r740xd (CL-U, CL-G) | /dev/disk/by-path/pci-0000:86:00.0-scsi-0:2:0:0 |
| Dell r750 | /dev/disk/by-path/pci-0000:05:00.0-ata-1 |
| Dell r7425 | /dev/disk/by-path/pci-0000:e2:00.0-scsi-0:2:0:0 |
| Dell r7525 | /dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0 |
| SuperMicro 6029p | /dev/disk/by-path/pci-0000:00:11.5-ata-5 |
| Dell xe8640 | /dev/disk/by-path/pci-0000:01:00.0-nvme-1 |
| Dell xe9680 | /dev/disk/by-path/pci-0000:01:00.0-nvme-1 |

To find your machine's by-path reference:

Expand Down
18 changes: 8 additions & 10 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ _**Table of Contents**_
- [Reset BMC / Resolving redfish connection error](#reset-bmc--resolving-redfish-connection-error)
- [Missing Administrator IPMI privileges](#missing-administrator-ipmi-privileges)
- [Failure of TASK SuperMicro Set Boot](#failure-of-task-supermicro-set-boot)
- [Scalelab](#scalelab)
- [Red Hat Labs](#red-hat-labs)
- [Fix boot order of machines](#fix-boot-order-of-machines)
- [Upgrade RHEL](#upgrade-rhel)
<!-- /TOC -->
Expand Down Expand Up @@ -59,7 +59,7 @@ If the machines are reachable, but never registered with the assisted-installer,

If some nodes correctly registered but some did not, then the missing nodes need to be individually diagnosed. On a missing node, check if the BMC actually mounted the virtual media. Typically the machine just requires a BMC reset due to not booting virtual media which is described in below sections. Another possibility includes non-functional hardware and thus the machine does not boot into the discovery image.

## Failed on Wait for cluster to be ready
## Failed on Wait for cluster to be ready

Check the "View cluster events" on the assisted-installer GUI to see if any validations
are failing. If you see `Host xxxxx: validation sufficient-packet-loss-requirement-for-role that used to succeed is now failing`
Expand Down Expand Up @@ -400,21 +400,21 @@ This is caused by having an older BIOS version.

When set to ignore the error, Jetlag can proceed, but you will need to manually unmount the ISO when the machines reboot the second time (as in not the reboot that happens immediately when Jetlag is run, but the one that happens after a noticeable delay). The unmount must be done as soon as the machines restart, as doing it too early can interrupt the process, and doing it after it boots into the ISO will be too late.

# Scalelab
# Red Hat Labs

## Fix boot order of machines

If a machine needs to be rebuilt in the Scale Lab and refuses to correctly rebuild, it is likely a boot order issue. Using badfish, you can correct boot order issues by performing the following:
If a machine needs to be rebuilt in the lab and refuses to correctly rebuild, it is likely a boot order issue. Using badfish, you can correct boot order issues by performing the following:

> [!NOTE]
> The process for the Performance Lab is similar, however the GitLab `config/idrac_interfaces.yml`
> is specialized for the Scale Lab configurations, and needs to be modified for Performance Lab. The
> necessary modifications are not covered here.
```console
badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --boot-to-type foreman
badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --check-boot
badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --power-cycle
podman run -it --rm quay.io/quads/badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --boot-to-type foreman
podman run -it --rm quay.io/quads/badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --check-boot
podman run -it --rm quay.io/quads/badfish -H mgmt-hostname -u user -p password -i config/idrac_interfaces.yml --power-cycle
```

Substitute the user/password/hostname to allow the boot order to be fixed on the host machine. Note it will take a few minutes before the machine should reboot. If you previously triggered a rebuild, the machine will likely go straight into rebuild mode afterwards. You can learn more about [badfish here](https://github.com/redhat-performance/badfish).
Expand All @@ -434,10 +434,8 @@ The values in *config/idrac_interfaces.yml* are first of all for the Scale lab.
- POLLING: [------------------->] 100% - Host state: On
- INFO - Command passed to On server, code return is 204.
```
## Upgrade RHEL

> [!TIP]
> This applies to Scale lab and Performance lab.
## Upgrade RHEL

On the bastion machine:

Expand Down

0 comments on commit 81ba1dc

Please sign in to comment.