Skip to content

Commit

Permalink
File headings and readme updates
Browse files Browse the repository at this point in the history
  • Loading branch information
nrockershousen committed Apr 15, 2024
1 parent 989d83e commit 5dd0f1f
Show file tree
Hide file tree
Showing 10 changed files with 82 additions and 82 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# DOCS-SHS
# shs-docs

## Overview

The docs-shs repository holds the documentation and documentation publication tooling
The shs-docs repository holds the documentation and documentation publication tooling
for the HPE Slingshot Host Software (SHS) product.

## Documentation Source
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@

### Required material
# Required material

All material will be available via the source URLs provided below as part of the HPE Slingshot Release for manufacturing and internal development systems.

#### Slingshot RPMs
## Slingshot RPMs

| Name | Contains | Typical Install Target |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------|
Expand All @@ -18,7 +18,7 @@ All material will be available via the source URLs provided below as part of the

Libfabric-devel is required on any host that a user would be able to compile an application for use with `libfabric`.

#### External vendor software
## External vendor software

| Name | Contains | Typical Install Target | Recommended Version | URL |
|----------------------------|---------------------------------------------|-----------------------------------------|---------------------|------------------------------------------------------------------------------------------------|
Expand Down
14 changes: 7 additions & 7 deletions docs/portal/developer-portal/install/configure_softroce.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@

## Configure Soft-RoCE
# Configure Soft-RoCE

Remote direct memory access (RDMA) over Converged Ethernet (RoCE) is a network protocol that enables RDMA over an Ethernet network.
RoCE can be implemented both in the hardware and in the software.
Soft-RoCE is the software implementation of the RDMA transport. RoCE v2 is used for HPE Slingshot 200Gbps NICs.

### Soft-RoCE on HPE Slingshot 200Gbps NICs
## Soft-RoCE on HPE Slingshot 200Gbps NICs

#### Prerequisites
### Prerequisites

1. `cray-cxi-driver` RPM package must be installed.
2. `cray-rxe-driver` RPM package must be installed.
3. HPE Slingshot 200Gbps NIC Ethernet must be configured and active.

#### Configuration
### Configuration

The following configuration is on the node image, and modifying the node image varies depending on the system management solution being used (HPE Cray EX or HPCM).

Expand Down Expand Up @@ -84,11 +84,11 @@ Follow the relevant procedures to achieve the needed configuration. Contact a sy
NOTE: Soft-RoCE device creation is not persistent across reboots.
The `rxe_init.sh` script must be run on every boot after the HPE Slingshot 200Gbps NIC Ethernet device is fully programmed with links up and AMAs assigned.

### Lustre Network Driver (LND) ko2iblnd configuration
## Lustre Network Driver (LND) ko2iblnd configuration

The ko2iblnd.ko changes are needed for better Soft-RoCE performance on LNDs.

#### Compute Node tuning for Soft-RoCE
### Compute Node tuning for Soft-RoCE

Tuning on compute node can be achieved in two ways. Follow the steps that work best for the system in use.

Expand Down Expand Up @@ -160,7 +160,7 @@ Tuning on compute node can be achieved in two ways. Follow the steps that work b
/sys/module/ko2iblnd/parameters/wrq_sge:1
```

#### E1000 ko2iblnd tuning for Soft-RoCE
### E1000 ko2iblnd tuning for Soft-RoCE

Configure clients to use Soft-RoCE and configure storage with MLX HCAs running HW RoCE.

Expand Down
10 changes: 5 additions & 5 deletions docs/portal/developer-portal/install/cxi_core_driver.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@

## CXI core driver
# CXI core driver

### GPU Direct RDMA overview
## GPU Direct RDMA overview

GPU Direct RDMA allows a PCIe device (the HPE Slingshot 200GbE NIC in this case) to access memory located on a GPU device. The NIC driver interfaces with a GPU's driver API to get the physical pages for virtual memory allocated on the device.

### Vendors supported
## Vendors supported

- AMD - ROCm library, amdgpu driver
- Nvidia - Cuda library, nvidia driver
- Intel - Level Zero library, dmabuf kernel interface

### Special considerations
## Special considerations

#### NVIDIA driver
### NVIDIA driver

The NVIDIA driver contains a feature called Persistent Memory. It does not release pinned pages when device memory is freed unless explicitly directed by the NIC driver or upon job completion.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@

### Install 200Gbps NIC host software
# Install 200Gbps NIC host software

The 200Gbps NIC software stack includes drivers and libraries to support standard Ethernet and libfabric RDMA interfaces.

#### Prerequisites for compute node installs
## Prerequisites for compute node installs

The 200Gbps NIC software stack must be installed after a base compute OS install has been completed. A list of 200Gbps NIC supported distribution installs can be found in the "Support Matrix" section under "Slingshot Host Software (SHS)" in the _HPE Slingshot Release Notes_ document. When those have been installed, then proceed with instructions for Installing 200Gbps NIC Host Software for that distribution.

Expand Down Expand Up @@ -45,7 +45,7 @@ manually loaded with the following commands:
To complete setup, follow the fabric management procedure for Algorithmic MAC
Address configuration.

#### 200Gbps NIC support in early boot
## 200Gbps NIC support in early boot

If traffic must be passed over the 200Gbps NIC prior to the root filesystem
being mounted (for example, for a network root filesystem using the 200Gbps NIC),
Expand All @@ -68,7 +68,7 @@ Due to these caveats, it is recommended that the `cray-libcxi-dracut` RPM only
be installed on systems whose configurations require 200Gbps NIC support in early
boot.

#### Check 200Gbps NIC host software version
## Check 200Gbps NIC host software version

Each 200Gbps NIC RPM has the HPE Slingshot version embedded in the release field of the
RPM metadata. This information can be queried using standard RPM commands. The
Expand Down Expand Up @@ -104,7 +104,7 @@ Distribution: (none)
The HPE Slingshot release for this version of `cray-libcxi` is 1.2.1 (SSHOT1.2.1).
This process can be repeated for all 200Gbps NIC RPMs.

#### Install validation
## Install validation

The 200Gbps NIC software stack install procedure should make all 200Gbps NIC devices
available for Ethernet and RDMA. Perform the following steps to validate the
Expand Down Expand Up @@ -132,7 +132,7 @@ Check for 200Gbps NIC Ethernet network devices.
hsn0 is CXI interface
```

#### 200Gbps NIC firmware management
## 200Gbps NIC firmware management

See the [Firmware Management](#firmware-management) section for more information on how to update firmware.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

### Install or upgrade compute nodes
# Install or upgrade compute nodes

The installation method will depend on what type of NIC is installed on the system.
Select one of the following procedures depending on the NIC in use:
Expand All @@ -9,7 +9,7 @@ Select one of the following procedures depending on the NIC in use:

NOTE: The upgrade process is nearly identical to the installation, and the proceeding instructions will note where the two processes delineate.

#### Prerequisites for Mellanox-based system installation
## Prerequisites for Mellanox-based system installation

1. Identify the target OS distribution, and distribution version for all compute targets in the cluster. Use this information to select the appropriate Mellanox OFED (MOFED) tar file to be used for install from the URL listed in the [External Vendor Software](install_metal.md#external-vendor-software) table above. The filename typically follows this pattern: `MLNX_OFED_LINUX-<version>-<OS distro>-<arch>.tgz`.

Expand All @@ -30,7 +30,7 @@ NOTE: The upgrade process is nearly identical to the installation, and the proce

NOTE: If the customer requires UCX on the system, then install the HPC-X solution using the recommended version provided by the [External Vendor Software](install_metal.md#external-vendor-software) table. Ensure that the HPC-X tarball matches the installed version of Mellanox OFED. In the HPC-x package, installation instructions are provided by Mellanox.

#### Install via package managers (recommended)
## Install via package managers (recommended)

1. For each distribution and distribution version as collected in the first step of the prerequisite install, download the RPMs mentioned in the previous section in the Slingshot RPMs table above.

Expand Down Expand Up @@ -113,7 +113,7 @@ NOTE: The upgrade process is nearly identical to the installation, and the proce

c. If the host is a compute node, and a user access node, perform steps 1 and 2, otherwise skip this step.

#### Install via command line
## Install via command line

1. For each distribution and distribution version as collected in the first step of the prerequisite install, download the RPMs mentioned in the previous section (Installation | Required Material | Source | RPMs).

Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@

### Install or upgrade Slingshot Host Software (SHS) on HPCM compute nodes
# Install or upgrade Slingshot Host Software (SHS) on HPCM compute nodes

This documentation provides step-by-step instructions to install and/or upgrade the Slingshot Host Software (SHS) on compute node images on an HPE Performance Cluster Manager (HPCM) using SLES15-SP4 as an example.

The procedure outlined here is applicable to SLES, RHEL, and COS distributions. Refer to the System Software Requirements for Fabric Manager and Host Software section in the HPE Slingshot Release Notes for exact version support for the release.

#### Process
## Process

The installation and upgrade method will depend on what type of NIC is installed on the system.
Select one of the following procedures depending on the NIC in use:
Expand All @@ -15,7 +15,7 @@ Select one of the following procedures depending on the NIC in use:

NOTE: The upgrade process is nearly identical to installation, and the proceeding instructions will note where the two processes delineate.

##### Mellanox-based system install/upgrade procedure
### Mellanox-based system install/upgrade procedure

This section is for systems using Mellanox NICs.
For systems using HPE Slingshot 200Gbps NICs, skip this section and instead proceed to the [HPE Slingshot 200Gbps CXI NIC system install/upgrade procedure](#hpe-slingshot-200gbps-cxi-nic-system-installupgrade-procedure).
Expand Down Expand Up @@ -205,7 +205,7 @@ For systems using HPE Slingshot 200Gbps NICs, skip this section and instead proc

13. Proceed directly to the [Firmware management](#firmware-management) and [ARP settings](#arp-settings) sections of this document to complete SHS compute install.

##### HPE Slingshot 200Gbps CXI NIC system install/upgrade procedure
### HPE Slingshot 200Gbps CXI NIC system install/upgrade procedure

This section is for systems using HPE Slingshot 200Gbps CXI NICs.
For systems using Mellanox NICs, skip this section and proceed to the [Mellanox-based system install procedure](#mellanox-based-system-installupgrade-procedure), followed by the [Firmware management](#firmware-management) section.
Expand Down Expand Up @@ -401,11 +401,11 @@ For systems using Mellanox NICs, skip this section and proceed to the [Mellanox-

10. Apply the post-boot firmware and firmware configuration. General instructions are in the "Install compute nodes" section of the _HPE Slingshot Installation Guide for Bare Metal_.

### Firmware management
# Firmware management

Mellanox NICs system firmware management is done through the `slingshot-firmware` utility.

### ARP settings
# ARP settings

The following settings are suggested for larger clusters to reduce the frequency of ARP cache misses during connection establishment when using the libfabric `verbs` provider, as basic/standard ARP default parameters will not scale to support large systems.

Expand Down
Loading

0 comments on commit 5dd0f1f

Please sign in to comment.