You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/portal/developer-portal/install/cxi_core_driver.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ GPU Direct RDMA allows a PCIe device (the HPE Slingshot 200GbE NIC in this case)
7
7
## Vendors supported
8
8
9
9
- AMD - ROCm library, amdgpu driver
10
-
-Nvidia - Cuda library, nvidia driver
10
+
-NVIDIA - Cuda library, NVIDIA driver
11
11
- Intel - Level Zero library, dmabuf kernel interface
12
12
13
13
## Special considerations
@@ -16,7 +16,7 @@ ___NVIDIA driver___
16
16
17
17
The NVIDIA driver contains a feature called Persistent Memory. It does not release pinned pages when device memory is freed unless explicitly directed by the NIC driver or upon job completion.
18
18
19
-
A cxi-ss1 parameter `nv_p2p_persistent` is used to enable Persistent Memory. The default is enabled.
19
+
A `cxi-ss1` parameter `nv_p2p_persistent` is used to enable Persistent Memory. The default is enabled.
20
20
21
21
The `nv_p2p_persistent` parameter can be disabled by setting it to 0 in the `modprobe cxi-ss1` command.
Copy file name to clipboardExpand all lines: docs/portal/developer-portal/install/install_or_upgrade_compute_nodes.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Install compute nodes
2
2
3
-
Perform this procedure to install SHS on compute nodes. This procedure can be used for systems that use either Mellanox NICs or HPE Slingshot 200Gpbs NICs.
3
+
Perform this procedure to install SHS on compute nodes. This procedure can be used for systems that use either Mellanox NICs or HPE Slingshot 200Gbps NICs.
4
4
5
5
The installation method will depend on what type of NIC is installed on the system.
6
6
Select one of the following procedures depending on the NIC in use:
@@ -39,7 +39,7 @@ NOTE: The upgrade process is nearly identical to the installation, and the proce
39
39
40
40
a. The RPMs should be copied or moved to a location accessible to one or more hosts where the RPMs will be installed. This can be a network file share, a physically backed location such as a disk drive on the host, or a remotely accessible location such as a web server that hosts the RPMs.
41
41
42
-
b. The host or host OS image should be modified to add a repository for the newly downloaded RPMs for the package manager used in the OS distribution. Select the RPMs from the distribution file for your environment (slingshot_compute_cos-2.4... for COS 2.4, slingshot_compute_sle15_sp4 for SLE15_sp4, and so on)
42
+
b. The host or host OS image should be modified to add a repository for the newly downloaded RPMs for the package manager used in the OS distribution. Select the RPMs from the distribution file for your environment (`slingshot_compute_cos-2.4...` for COS 2.4, `slingshot_compute_sle15_sp4` for SLE15_sp4, and so on)
43
43
For SLE 15, `zypper` is used as the package manager for the host. A Zypper repository should be added which provides the path to the RPMs are hosted. An example for this could be the following:
44
44
45
45
Assume that the RPMs were downloaded and added to a web server that is external to the host,
Copy file name to clipboardExpand all lines: docs/portal/developer-portal/operations/Cassini_Retry_Handler_cxi_rh.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -13,4 +13,4 @@ The retry handler identifies these scenarios as "Timeouts" or "NACKs" respective
13
13
-**NACKs**: Indicates that the target NIC observed some issue with a packet it received.
14
14
A lack of space to land the packet could result in various NACKs being sent back to the source (depending on which resource was lacking).
15
15
The most common NACK that is typically seen is a SEQUENCE_ERROR NACK. This simply indicates that a packet with an incorrect sequence number arrived. This is not an unusual situation.
16
-
A prior packet being lost (say sequence number X) will lead to subsequent packets (all with sequence numbers greater than X) getting a SEQUENCE_ERROR NACK in response.
16
+
A prior packet being lost (say sequence number X) will lead to subsequent packets (all with sequence numbers greater than X) getting a SEQUENCE_ERROR NACK in response.
Copy file name to clipboardExpand all lines: docs/portal/developer-portal/operations/compute_node_configuration.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ The `sat bootprep` input file should contain sections similar to the following t
26
26
For the examples below,
27
27
28
28
- Replace `<version>` with the version of SHS desired
29
-
- Replace `<playbook>` with the SHS ansible playbook that should be used
29
+
- Replace `<playbook>` with the SHS Ansible playbook that should be used
30
30
- Replace `ims_require_dkms: true` with `ims_require_dkms: false` if pre-built kernel binaries should be used instead of DKMS kernel packages. NOTE: This setting only exists with CSM 1.5 and later deployments.
31
31
32
32
**Note:**`shs_mellanox_install.yml` should be used if the Mellanox NIC is installed. `shs_cassini_install.yml` should be used if the HPE Slingshot 200Gbps NIC is installed.
Copy file name to clipboardExpand all lines: docs/portal/developer-portal/operations/configure_qos_shs.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Configure QoS for Slingshot Host Software (SHS)
2
2
3
-
The cxi-driver includes multiple QoS profiles for SHS. This includes PCP to DSCP mappings and other settings that must match the Rosetta side configs, as well as which internal HPE Slingshot 200Gbps NIC resources are made available to each traffic class in a profile.
3
+
The cxi-driver includes multiple QoS profiles for SHS. This includes PCP to DSCP mappings and other settings that must match the Rosetta side configurations, as well as which internal HPE Slingshot 200Gbps NIC resources are made available to each traffic class in a profile.
4
4
5
5
An admin will be able to choose from one of the profiles that is made available. See the following subsections for guidance on viewing and selecting QoS profiles on the host.
6
6
@@ -10,7 +10,7 @@ For general information on QoS outside the context of SHS, see "Configure Qualit
10
10
11
11
QoS profile names on the host match those on the switch. On the host there will be an integer value associated with each QoS Profile. This value is used to select the QoS Profile that the driver should load.
12
12
13
-
Starting in the Slingshot 2.2 release, the following profiles will be supported on the host:
13
+
Starting in the HPE Slingshot 2.2 release, the following profiles will be supported on the host:
14
14
15
15
- 1 - HPC
16
16
- 2 - LL_BE_BD_ET
@@ -34,7 +34,7 @@ parm: active_qos_profile:QoS Profile to load. Must match fabric QoS Pr
34
34
35
35
## Select QoS profile on the host
36
36
37
-
The `active_qos_profile` module parameter to the cxi-ss1 driver allows admins to choose a QoS profile. As with any module parameter, there are multiple ways for an admin to apply the change, such as the following:
37
+
The `active_qos_profile` module parameter to the `cxi-ss1` driver allows admins to choose a QoS profile. As with any module parameter, there are multiple ways for an admin to apply the change, such as the following:
38
38
39
39
- Directly via `insmod`/`modprobe`
40
40
- Kernel Command Line
@@ -49,7 +49,7 @@ For example, to load the LL_BE_BD_ET profile via `modprobe`:
49
49
Important notes:
50
50
51
51
- All nodes _must_ use the same QoS Profile on a particular fabric. See "Configure Quality of Service (QoS)" in the _HPE Slingshot Installation Guide_ for the environment in use.
52
-
- QoS Profile change cannot be done "live", as the cxi-ss1 driver must be reloaded. To change profiles, reboot nodes with the desired QoS profile specified.
52
+
- QoS Profile change cannot be done "live", as the `cxi-ss1` driver must be reloaded. To change profiles, reboot nodes with the desired QoS profile specified.
53
53
54
54
## Query QoS information on the host
55
55
@@ -100,7 +100,7 @@ The following error message on the host can be reported if the 200Gbps NIC and H
100
100
101
101
**Note:** The above errors, specifically `pfc_fifo_oflw` errors, can also occur if the Fabric Manager is not configured with 200Gbps NIC QoS settings.
102
102
103
-
The PCP to utilize for non-VLAN tagged Ethernet frames is defined in a QoS profile. The CXI Driver (cxi-ss1) defines a kernel module parameter, `untagged_eth_pcp`, to optionally change this value. The default value of -1 means the value defined in the QoS profile will be used.
103
+
The PCP to utilize for non-VLAN tagged Ethernet frames is defined in a QoS profile. The CXI Driver (`cxi-ss1`) defines a kernel module parameter, `untagged_eth_pcp`, to optionally change this value. The default value of -1 means the value defined in the QoS profile will be used.
104
104
105
105
The following is an example of how to override the value defined in the profile via modprobe:
- Else if there are no `integration-*` branches, but there is an integration branch with no `-<RELEASE>` suffix, determine what the release integration was based on by running the `git log` command.This finds the newest commit in the output (the commit closest to the top), which contains a message similar to "Import of 'slingshot-host-software' product version `<OLD-RELEASE>`".
37
+
- Else if there are no `integration-*` branches, but there is an integration branch with no `-<RELEASE>` suffix, determine what the release integration was based on by running the `git log` command.This finds the newest commit in the output (the commit closest to the top), which contains a message similar to "Import of 'slingshot-host-software' product version `<OLD-RELEASE>`".
38
38
39
39
```screen
40
40
ncn-m001# git log --topo-order refs/remotes/origin/integration | less
@@ -110,11 +110,11 @@ Failure to define any of the three variables above may result in install, upgrad
110
110
111
111
If group variable files are used, then a file must be defined for each target node type. Three groups of nodes are supported:
112
112
113
-
| Node Type | Product | Target Kernel Distribution | Group Variable File Name |
| Compute | COS | COS (see COS installation for target OS kernel) |`Compute/default.yml`|
116
+
| User Access/Login | UAN | COS (see COS installation for target OS kernel) |`Application/default.yml`|
117
+
| Non-compute Worker | CSM | CSM (see CSM installation for target OS kernel) |`Management_Worker/default.yml`|
118
118
119
119
An example configuration for a Compute node (`ansible/group_vars/Compute/default.yml`) on HPE Cray EX System Software 1.5 using COS 2.4 and CSM 1.3 might be the following:
Copy file name to clipboardExpand all lines: docs/portal/developer-portal/troubleshoot/cassini/RDMA_interface_troubleshooting.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -34,9 +34,9 @@ The availability of a CXI interface indicates several key signs of health. An in
34
34
- The interface retry handler is running
35
35
- A matching L2 interface is available
36
36
- The L1 interface has a temporary, locally administered, unicast address assigned to it. This is presumed to be an AMA applied by the fabric manager.
37
-
- The L1 link state is reported if verbosity is enabled. L1 link state reported by `fi_info` will match the state reported by the L2 device through the ip tool.
37
+
- The L1 link state is reported if verbosity is enabled. L1 link state reported by `fi_info` will match the state reported by the L2 device through the `ip` tool.
38
38
39
-
All these checks together make `fi_inf0` an excellent first tool to use to check the general health of 200Gbps NIC RDMA interfaces.
39
+
All these checks together make `fi_info` an excellent first tool to use to check the general health of 200Gbps NIC RDMA interfaces.
0 commit comments