Multi GPU Support

Compiling for Multi-GPU

To enable multi-GPU support, you need to set either QUDA_MPI=ON or QUDA_QMP=ON to enable either the MPI or QMP communications back end.

QMP is the USQCD QCD communications layer, that provides compatibility with other USQCD software packages. To enable QUDA to use QIO directly, you need to enable QMP. QUDA supports automatically downloading and compiling verified versions of QMP and QIO via the cmake flag QUDA_DOWNLOAD_USQCD=ON.

In the case of MPI, cmake should detect the MPI compiler and libraries by default. If need be, the paths to an MPI installation (root directory, libraries, includes, and binaries) can be set manually. This can most easily be done by visual ccmake configuration. MPI flags are under advanced options, accessed by using t. If you are using OpenMPI, we recommend using version 4.0.x.

Compiling QMP and QIO

We recommend using QUDA's automated download and compile feature, documented here.

If a custom compilation is needed/desired (possibly with Cray CC), you can compile QMP and QIO + c-lime manually. We advise using commit 3010fef of QMP and the qio3-0-0 of QIO.

QMP compilation instructions can be found here.
QIO compilation requires c-lime as a dependency. This is most easily obtained using recursive cloning git clone --recursive [email protected]:usqcd-software/qio.git. Navigate to qio/ and execute the command autoreconf -f -i. You can now run configure for your preferences, including the --with-qmp=[...] flag to specify a QMP install directory, then make and finally make install.

Running

Running on multiple GPUs is similar to running any other MPI application. In general, one process will be assigned to each GPU. Make sure that all of QUDA's environmental variables are propagated to all processes since these control some of QUDA's internal control flow. The run/bind scripts given below handle this automatically. Alternatively, one can set the broadcast of environment variables using the job launcher, e.g., with OpenMPI's mpirun using -x QUDA_RESOURCE_PATH=/path/to/somewhere.

Of note, if strictly necessary QUDA can be run under MPS (enabling GPU oversubscription with minimal overheads). This can be done via the QUDA_ENABLE_MPS environment variable. This is only for reference; in general you never need to do this.

Running QUDA's tests

When running QUDA through a host application, typically the host application is responsible for setting the process topology and local problem size. For QUDA's internal tests, these parameters are set using the following command-line parameters

--dim x y z t           # x y z t is the local (per process) problem size
--gridsize X Y Z T      # X Y Z T is the process topology

Multi-GPU emulation

To aid performance modelling and debugging, it is possible to switch on "emulated" communication in a given dimension (using rank-local pack/unpack kernels), even if in actuality that dimension is local to a given GPU. The command line flag --partition N facilitates this feature, where N is a 4-bit number, with bits 0,1,2,3 used to switch on/off communication in dimensions x,y,z,t (respectively). For example:

dslash_test --partition 1     ## enable x dimension communication
dslash_test --partition 6     ## enable y and z dimension communication
dslash_test --partition 15    ## enable full communication

Peer-to-peer communication

QUDA will automatically detect multiple GPUs in the same node and use direct peer-to-peer communication where available. For GPUs to be peer-to-peer capable, they need to be either on the same PCIe root complex (e.g., connected to the same CPU socket or PCIe switch) or be directly connected with an NVLink connection. While peer-to-peer communication will lead to much improved performance versus leaving MPI to handle the inter-GPU communication, it can useful for benchmarking and/or debugging to disable it. This can be done by setting the environment variable QUDA_ENABLE_P2P=0.

GPU Direct RDMA and CUDA-aware MPI

QUDA can be optionally support GPU-aware MPI and GPU Direct RDMA (GDR), i.e., where data is passed directly to MPI without first copying it to the host, or conversely data is received directly into GPU memory. By default this option is disabled since passing a GPU pointer to an MPI library that is unaware of GPUs will lead to undefined behaviour (most likely a segmentation fault). This can be enabled by setting the environment variable QUDA_ENABLE_GDR=1.

When doing so, you should ensure that the MPI library you are using is also GPU enabled and the network drivers support it. For Mellanox Infiniband, this means OFED v2.1 or v3.1 and later (which depends on which IB card is in your system). Details for Mellanox can be found here. As another note, GDR sometimes does not work if the GPUs are in exclusive mode.

On systems that do not support GDR, but are running a CUDA-aware MPI library, e.g., OpenMPI, MVAPICH2, then the MPI library can automatically stage the MPI buffers in CPU memory if provided with a GPU pointer. Typically letting the MPI library take care of this staging is slower than having QUDA do it since it introduces unnecessary synchronization. However, we note that on systems that do not have a NIC available, enabling GDR support and using this in combination with GPU-aware MPI can be beneficial for debugging, if not performance.

It should be noted that enabling GDR will never make the performance worse, since the dslash policy autotuner will automatically test all enabled policies, e.g., basic, GDR-enabled, etc., and pick the best one for each given precision, volume, etc. Details on the dslash policy tuning are given below.

OpenMPI

We recommend taking advantage of OpenMPI provided by your system administrator and working with them if you see sub-par performance. Below we give an example run script of how to use OpenMPI with GDR support and instructions for optimal process placement.

For reference, for ex. for local workstation experiments, instructions for building CUDA-aware OpenMPI can be found here and instructions for running CUDA-aware OpenMPI can be found here.

MVAPICH2

Instructions for running the current GDR-enabled MVAPICH2 can be found here. MVAPICH2-GDR is only available as a binary, but the source code for the regular CUDA-aware MVAPICH2 (with host message staging) is available.

To enable CUDA-awareness in MVAPICH2, if building from source you must set --enable-cuda when running configure. When running, you must set the environment variable MV2_USE_CUDA=1. Specific GDR-related instructions are below.

Cray MPI

The exact details of configuring QUDA for Cray MPI can vary from machine to machine. Instructions for Perlmutter are available here.

More legacy instructions that have not been tested recently are as follows:

To enable GPU-awareness on Cray's MPI you need to set the environment variable MPICH_RDMA_ENABLED_CUDA=1. At present Cray's implementation provides no user control over which messages will be exchanged using RDMA versus using host staging. This means that MPI exchange can go through the CPU memory with no means to force enable RDMA. The end result is that only very small volumes will utilize RDMA on Cray's XC platform, which most likely means only coarse grids with multigrid.

We note that performance on Cray systems may be improved by enabling MPICH_NEMESIS_ASYNC_PROGRESS=1, which enables results in the MPI library spawning threads to ensure the forward progress of asynchronous MPI calls (which QUDA utilizes).

Maximizing Communications Performance

On systems with multiple GPUs and multiple NIC, to ensure maximum GPU-NIC throughput, care must be taken to ensure that GPUs communicate with the closest NIC. This can be done by querying the topology of the machine you are running on, and then instrumenting your MPI and / or run script to ensure correct placement.

General Example

For example, when running on DGX-1, which is a system with 4x EDR NICs and 8x P100 GPUs. Each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication.

First of all, we query the node topology with nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_2	mlx5_1	mlx5_3	CPU Affinity
GPU0	 X 	NV1	NV1	NV1	NV1	SOC	SOC	SOC	PIX	SOC	PHB	SOC	0-19
GPU1	NV1	 X 	NV1	NV1	SOC	NV1	SOC	SOC	PIX	SOC	PHB	SOC	0-19
GPU2	NV1	NV1	 X 	NV1	SOC	SOC	NV1	SOC	PHB	SOC	PIX	SOC	0-19
GPU3	NV1	NV1	NV1	 X 	SOC	SOC	SOC	NV1	PHB	SOC	PIX	SOC	0-19
GPU4	NV1	SOC	SOC	SOC	 X 	NV1	NV1	NV1	SOC	PIX	SOC	PHB	20-39
GPU5	SOC	NV1	SOC	SOC	NV1	 X 	NV1	NV1	SOC	PIX	SOC	PHB	20-39
GPU6	SOC	SOC	NV1	SOC	NV1	NV1	 X 	NV1	SOC	PHB	SOC	PIX	20-39
GPU7	SOC	SOC	SOC	NV1	NV1	NV1	NV1	 X 	SOC	PHB	SOC	PIX	20-39
mlx5_0	PIX	PIX	PHB	PHB	SOC	SOC	SOC	SOC	 X 	SOC	PHB	SOC	
mlx5_2	SOC	SOC	SOC	SOC	PIX	PIX	PHB	PHB	SOC	 X 	SOC	PHB	
mlx5_1	PHB	PHB	PIX	PIX	SOC	SOC	SOC	SOC	PHB	SOC	 X 	SOC	
mlx5_3	SOC	SOC	SOC	SOC	PHB	PHB	PIX	PIX	SOC	PHB	SOC	 X 	

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

We see that there are eight GPUs and four NICs as expected. The critical point is that GPU0 and GPU1 are both connected directly to mlx5_0 on the same PCIe switch, with GPU2 and GPU3 on mlx5_1, etc. So when launching our job on multiple nodes we need to ensure that processes mapped to the these GPUs are instructed to use these NICs.

Open MPI <= 3 (without UCX)

Moved to the legacy section here. Best practices is to use the most up-to-date version of OpenMPI with UCX.

OpenMPI 4 with UCX

While the general discussion given for OpenMPI 3 still applies the scriptsneed to be slightly modified to use the correct environment variables for UCX when using OpenMPI 4 with UCX. The run.sh script for UCX looks like:

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1 mlx5_2:1 mlx5_2:1 mlx5_3:1 mlx5_3:1)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export UCX_NET_DEVICES=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} \
        $APP

In the example above, the REORDER variable tells us the order we want the GPUs to map to the local MPI process. Here we have only used the default ordering, e.g., REORDER=(0 1 2 3 4 5 6 7), which would produce an optimal mapping for a local 1x2x2x2 process topology (e.g., given the NVLink topology of DGX-1, GPU 0 can communicate with GPUs 1, 2 and 4 which are the only GPUs needed for this 3-d topology). However, if we were running with 1x1x2x4 local process topology (given that the default MPI process topology is ((pt*Nz + pz)Ny + py)Nx + px then process 0 would need to be able to communicate with processes 1 (Z +/-), 2 (T+) and 6 (T-), but GPU 0 only has connections to GPUs 1, 2, 3, and 4.** So in this case, we would want to use REORDER=(0 1 2 3 6 7 4 5) which would map GPU 4 to process 6 providing the optimal peer-to-peer connectivity matrix.

** This is the default for QUDA and MILC. BQCD on the other hand uses the inverse of this mapping ((px*Ny + py)Nz + pz)Nt + pt. In this case, BQCD mapping would actually provide the optimal peer-to-peer mapping with the default GPU order.

Given the above binding script, the corresponding MPI launch command is then (note: update with latest version of UCX):

export UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc                           # select transports
export UCX_MEMTYPE_CACHE=n                                                 # see https://github.com/openucx/ucx/wiki/NVIDIA-GPU-Support
export UCX_RNDV_SCHEME=get_zcopy                                           # improves GPUDirectRDMA performance
export UCX_RNDV_THRESH=131304                                              # your mileage may vary

mpirun
 -np 48                                                                     # total number of processes
 -npernode 6                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh

MVAPICH2

For completeness, we give the equivalent scripts from above when using MVAPICH2-GDR with QUDA. Testing with the latest version of MVAPICH2-GDR (2.3alpha installation instructions are here) has shown issues with intra-node communication where either the sender or receiver are GPU pointers, leading to MVAPICH2-GDR seg faulting. This isn't a performance issue since QUDA's low-level handling of peer-to-peer communication within the node will almost certainly be superior to an MPI's implementation, however, since the Dslash policy tuner by default will test policies that include just handing a GPU pointer to MPI, we need to explicitly disable these policies from being tested. We can do this by setting the environment variable QUDA_ENABLE_P2P=7, which will enable both copy-engine and direct store dslash communication policies but will disable the non-explicit P2P policies.

The equivalent launch and run scripts for DGX-1 are shown below. The main difference between OpenMPI and MVAPICH being that the latter relies on environment variables for setting MPI parameters.

#!/bin/bash

# MVAPICH environment variables

# Enable CUDA-aware MPI
export MV2_USE_CUDA=1

# Enable GDR
export MV2_USE_GPUDIRECT=1

# Set maximum GDR message size 
export MV2_GPUDIRECT_LIMIT=4194304

# Enable GDRCOPY library (set to 0 if not installed on the system)
export MV2_USE_GPUDIRECT_GDRCOPY=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/gpfs/sw/gdrdrv/install/lib64/libgdrapi.so

# Disable MVAPICH's internal affinity setting (since we'll do it manually using numactl)
export MV2_ENABLE_AFFINITY=0
export MV2_USE_MCAST=1

# Where $MPI_HOME should be the path to the MVAPICH installation
LD_PRELOAD+=:$MPI_HOME/lib64/libmpi.so

export EXE="./dslash_test"                                                     # executable
export ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000"

mpirun -np 128 -f hostfile ./run.sh

The equivalent run.sh is shown below, with the only difference being the environment variable for binding the NIC.

#!/bin/bash

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
export QUDA_ENABLE_GDR=1

# disable non-P2P communication in the node
export QUDA_ENABLE_P2P=7

export CUDA_DEVICE_MAX_CONNECTIONS=1

# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$MV2_COMM_WORLD_LOCAL_RANK

export MV2_IBA_HCA=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} $APP

SpectrumMPI

A reference run script and binding script for Summit is given below. The script assumes that the environment variable APP has been defined as the full executable plus arguments.

Run script:

#!/bin/bash -v
#BSUB -P XXX
#BSUB -W 2:00
#BSUB -nnodes 432
#BSUB -J jobname
#BSUB -o jobOut.%J
#BSUB -e jobErr.%J
##### -cn_cu 'maxcus=48' # Set to num nodes / 18 to constrain racks; reduces throughput
#BSUB -alloc_flags "smt4"

# submit with
# bsub run.lsf

nodes=432
ranks=$[${nodes} * 6]

export QUDA_ENABLE_GDR=1
export QUDA_RESOURCE_PATH=`pwd`/tunecache
mkdir -p $QUDA_RESOURCE_PATH

# Generally HISQ MG only
#export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
#export QUDA_ENABLE_MANAGED_MEMORY=1
#export QUDA_ENABLE_MANAGED_PREFETCH=1

# Prepare executable name
EXE=...
ARGS=...
export APP="${EXE} ${ARGS}"

# Setup for jsrun
export OMP_NUM_THREADS=7

echo "START_RUN: `date`"

# each resource set is one entire nodes, with 6 total MPI ranks each,
# with complete visibility of all 6 GPUs and 42 CPU cores (both sockets)
jsrun --nrs ${nodes} -a6 -g6 -c42 -dpacked -b packed:7 --latency_priority gpu-cpu --smpiargs="-gpu" ./bind-6gpu.sh

echo "FINISH_RUN: `date`"

Binding script:

#!/bin/bash

lrank=$(($PMIX_RANK % 6))

echo $APP

case ${lrank} in
 [0])
 export PAMI_IBV_DEVICE_NAME=mlx5_0:1
 numactl --physcpubind=0,4,8,12,16,20,24 --membind=0 $APP
 ;;

 [1])
 export PAMI_IBV_DEVICE_NAME=mlx5_0:1
 numactl --physcpubind=28,32,36,40,44,48,52 --membind=0 $APP
 ;;

 [2])
 export PAMI_IBV_DEVICE_NAME=mlx5_0:1
 numactl --physcpubind=56,60,64,68,72,76,80 --membind=0 $APP
 ;;

 [3])
 export PAMI_IBV_DEVICE_NAME=mlx5_3:1
 numactl --physcpubind=88,92,96,100,104,108,112 --membind=8 $APP
 ;;

 [4])
 export PAMI_IBV_DEVICE_NAME=mlx5_3:1
 numactl --physcpubind=116,120,124,128,132,136,140 --membind=8 $APP
 ;;

 [5])
 export PAMI_IBV_DEVICE_NAME=mlx5_3:1
 numactl --physcpubind=144,148,152,156,160,164,168 --membind=8 $APP
 ;;
esac

GH200 Superchip: 1x Superchip per node

In the case of a node with a single superchip and one NIC, nvidia-smi topo -m may return something along the lines of:

$ nvidia-smi topo -m
        GPU0    NIC0 CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX  0-71    0               1
NIC0    PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

This is representative and may not be exact. For this setup, an appropriate binding script is:

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export UCX_NET_DEVICES=mlx5_0:1

numactl --cpunodebind=0 --membind=0 $APP

Memory binding is critically important on 1xGH200 superchip nodes

4xGH200 Superchip Node

OpenMPI/UCX (Jupiter)

This set of instructions is relevant for FZJ/Jupiter, though the details of the NIC output below may not be exact. The output of nvidia-smi topo -m may look like:

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     NV6     NV6     SYS     SYS     SYS     SYS     SYS     SYS     0-71    0               4
GPU1    NV6      X      NV6     NV6     SYS     SYS     SYS     SYS     SYS     SYS     72-143  1               12
GPU2    NV6     NV6      X      NV6     SYS     SYS     SYS     SYS     SYS     SYS     144-215 2               20
GPU3    NV6     NV6     NV6      X      SYS     SYS     SYS     SYS     SYS     SYS     216-287 3               28
NIC0    SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS
NIC1    SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5

In this case, a representative binding script would be:

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
export QUDA_ENABLE_TUNING=1
# enable GDR support
export QUDA_ENABLE_GDR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# this is the list of GPUs we have
GPUS=(0 1 2 3)

# This is the list of NICs we should use for each GPU
# e.g., associate GPU0 with MLX0, GPU1 with MLX1, GPU2 with MLX2 and GPU3 with MLX3
# The other NICs are included for completeness but are ignored
NICS=(mlx5_0:1 mlx5_1:1 mlx5_2:1 mlx5_3:1 mlx5_4:1 mlx5_5:1)

# This is the list of NUMA regions we should use for each MPI rank <-> GPU
# e.g., 4x72 core Grace CPUs
CPUS=(0 1 2 3)

# The number of threads to use for the calling app; may vary based on the CPU app.
# Grace has 72 physical cores.
export OMP_NUM_THREADS=16

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export UCX_NET_DEVICES=${NIC_REORDER[lrank]}

numactl --cpunodebind=${CPU_REORDER[$lrank]} --membind=${CPU_REORDER[$lrank]} $APP

Memory binding is critically important on 4xGH200 superchip nodes.

Dependence on CUDA_DEVICE_MAX_CONNECTIONS

There is an environment variable called CUDA_DEVICE_MAX_CONNECTIONS, this controls how many hardware channels the GPU should use, e.g., how much work can be launched independently from different streams without any false dependencies. However, since it has the lowest latency, QUDA gets optimum performance at CUDA_DEVICE_MAX_CONNECTIONS=1 since this gives the lower latency and you still get overlap between kernel launches and memory copies in general due to the order in which these are issued. So in general, the advice is to set this parameter equal to 1 and this will provide optimal scaling.

Low-level Details

Dslash Policy Tuning

Since the optimum Dslash overlapping computation and communication strategy varies depending the machine you running on, the size of the problem you running, the precision, etc., QUDA implements multiple dslash execution policies and utilizes the autotuner to identify the optimal strategy for a given parameter set and use that policy for all subsequent invocations (dslash_policy.hpp). At present the following policies are enabled in QUDA:

QUDA_DSLASH=0: bandwidth optimized - aim for maximum compute and comms overlap (one halo kernel per dimension)
QUDA_FUSED_DSLASH=1: kernel latency optimized - use a single halo update kernel for all dimensions
QUDA_GDR_DSLASH=2: GDR-enabled variant of QUDA_DSLASH
QUDA_FUSED_GDR_DSLASH=3: GDR-enabled variant of QUDA_FUSED_DSLASH
QUDA_GDR_RECV_DSLASH=4: variant of QUDA_DSLASH which only enables GDR for the receive
QUDA_FUSED_GDR_RECV_DSLASH=5: variant of QUDA_FUSED_DSLASH which only enables GDR for the receive
QUDA_ZERO_COPY_PACK_DSLASH=6: write the non-p2p packed halo buffers directly to CPU memory for minimum MPI_Send latency
QUDA_FUSED_ZERO_COPY_PACK_DSLASH=7: write non-p2p packed halo buffers directly to CPU memory and use fused halo kernel
QUDA_ZERO_COPY_DSLASH=8: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in halo update kernels
QUDA_FUSED_ZERO_COPY_DSLASH=9: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in a single halo update kernel
QUDA_ZERO_COPY_PACK_GDR_RECV_DSLASH=10
QUDA_FUSED_ZERO_COPY_PACK_GDR_RECV_DSLASH=11
QUDA_DSLASH_FUSED_PACK=12: fused dslash and halo packer kernel, the first thread blocks in the grid will pack the halo buffer and subsequent blocks will apply the interior dslash
QUDA_DSLASH_FUSED_PACK_FUSED_HALO=13: fused dslash and halo packer kernel and fused halo update kernel
QUDA_SHMEM_UBER_PACKINTRA_DSLASH=14: NVSHMEM policy using an Uber-kernel with packing for intra-node, interior and exterior fused into a single kernel, separate kernel for inter-node packing
QUDA_SHMEM_UBER_PACKFULL_DSLASH=15: NVSHMEM policy using an Uber-kernel with packing, interior and exterior fused into a single kernel
QUDA_SHMEM_PACKINTRA_DSLASH=16: NVSHMEM policy with intra-node packing and interior fused into a single kernel, separate inter-node packing and exterior kernels
QUDA_SHMEM_PACKFULL_DSLASH=17: NVSHMEM policy with packing and interior fused into a single kernel, separate exterior kernel

To facilitate faster auto tuning, by default, only policies 0 and 1 are enabled. To enable the zero copy policies (6, 7, 10, 11, 12, 13) use QUDA_ENABLE_ZERO_COPY=1 and for the GDR policies (1, 2, 8, 9) set QUDA_ENABLE_GDR=1. In most instances, you will just want to let the autotuner pick the best policy for your parameter set. However, you can restrict the policy set to tune other by setting the environment argument QUDA_ENABLE_DSLASH_POLICY, e.g., setting QUDA_ENABLE_DSLASH_POLICY=1,3,5 would restrict the policy tuning to a subset of the "fused" variants only.

By default all policies will use peer-to-peer communication if available. To disable peer-to-peer, you set QUDA_ENABLE_P2P=0.

Dslash Component Benchmarking

In order to benchmark the components of the Dslash in isolation, QUDA can selectively disable portions of the Dslash computation. This is useful for example to benchmarking NIC performance, or to test kernel performance in the absence of communication. The dslash computation is broken down into multiple steps:

packing: prepare contiguous halo buffers to be handed off to MPI / P2P communication
comms: p2p cudaMemcpy within the node and MPI between nodes
interior: apply the dslash stencil on the interior while the halo regions are being communicated
exterior: once the comms have finished we finish the calculation with the application of the halo on the boundary elements (copy: when GDR / P2P is not available between a set of GPUs, then we have the additional D2H/H2D memcpys for staging the MPI buffers in CPU memory)

The following set of environment variables can be used to disable the various parts of computation and/or the communication. All of the below variables default to 1 (e.g., do the full calculation), but can be disabled by setting to 0 (obviously result will be wrong).

QUDA_ENABLE_DSLASH_PACK - enable / disable initial packing kernel
QUDA_ENABLE_DSLASH_COMMS - enable/disable P2P memcpys and / or MPI exchange
QUDA_ENABLE_DSLASH_INTERIOR - enable/disable interior kernel computation
QUDA_ENABLE_DSLASH_EXTERIOR - enable/disable exterior kernel computation
QUDA_ENABLE_DSLASH_COPY - enable/disable host staging copies for MPI if GDR/P2P not enabled

By combining the explicit policy choice with the above variables, we can benchmark in isolation any computation or communication pattern.

For communication benchmarking, the dslash_test and staggered_dslash_test programs will report the effective bi-directional bandwidth sustained by the algorithm (just grep the output for “bi”). See the below results for an example taken between two GPUs connected using PCIe peer-to-peer. With the full computation enabled we are unable to see what the actual achieved bi-directional bandwidth is, e.g., it plateaus once all the communications are hidden by the local computation, but when you only do the communications we see the expected behavior and the bi-directional bandwidth is saturating at 19 GB/s. What we can also see is the bandwidth doesn't saturate until a relatively large local volume: this is the motivation for future work using a SHMEM-style programming model where all peer-to-peer reads and write will be done by reading and writing directly to neighboring GPUs, which has significantly lower latency.

Legacy Information

Asymmetric Topologies

On asymmetric systems where some GPUs are on one side of the QPI bus and the NIC is on the other, care must be taken since the QPI bus cannot efficiently forward the memory traffic between attached PCIe devices. For example the following system has four GPUs and a NIC one socket, with two GPUs and not NIC on the other socket. This system will only give efficient GDR support for the first four GPUs, with the other two needing to stage their inter-node memory traffic explicitly through CPU memory.

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	mlx5_0	CPU Affinity
GPU0	 X 	PIX	PHB	PHB	SOC	SOC	PHB	0-9
GPU1	PIX	 X 	PHB	PHB	SOC	SOC	PHB	0-9
GPU2	PHB	PHB	 X 	PIX	SOC	SOC	PHB	0-9
GPU3	PHB	PHB	PIX	 X 	SOC	SOC	PHB	0-9
GPU4	SOC	SOC	SOC	SOC	 X 	PHB	SOC	10-19
GPU5	SOC	SOC	SOC	SOC	PHB	 X 	SOC	10-19
mlx5_0	PHB	PHB	PHB	PHB	SOC	SOC	 X

To enable such a setup, the environment variable QUDA_ENABLE_GDR_BLACKLIST can be used to exclude a given number of GPUs from using GDR, and instead will fallback to using explicit staging through CPU memory. The below is an example of how to do this for the above topology using OpenMPI.

mpirun
 -np 48                                                                     # total number of processes
 -npernode 6                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 --mca btl sm,self,openib                                                   # enable intra-node, loop back to self, and IB
 --mca btl_openib_want_cuda_gdr 1                                           # enable GDR for MPI
 --mca btl_openib_cuda_rdma_limit 1000000000                                # set the largest message size for GDR
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh

where run.sh would be as given below

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
export QUDA_ENABLE_GDR=1

# exclude GPUs 4 and 5 from GDR since it's across QPI
export QUDA_ENABLE_GDR_BLACKLIST="4,5"

export CUDA_DEVICE_MAX_CONNECTIONS=1

# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5)

# This is the list of NICs we should use for each GPU
NICS=(mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x10 core CPUs split into 2 threads per process with correct NUMA assignment
CPUS=(1-2 3-4 5-6 7-8 10-11 15-16)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=2

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping                                                                                                                                         
       
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} $APP

Open MPI <= 3 (without UCX)

The script below (for OpenMPI) achieves that. To use this script with QUDA's dslash_test, running on 16 nodes of DGX-1, it would be launched with something like

mpirun 
 -np 128                                                                    # total number of processes
 -npernode 8                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 --mca btl sm,self,openib                                                   # enable intra-node, loop back to self, and IB
 --mca btl_openib_want_cuda_gdr 1                                           # enable GDR for MPI
 --mca btl_openib_cuda_rdma_limit 1000000000                                # set the largest message size for GDR
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh                                                                   # name of the below script

In the run.sh script we set the order of CUDA devices as how they will be mapped to the local MPI ranks (the REORDER variable). Given this order, we ensure that the closest NIC for a given process is assigned, and furthermore we set the the CPU cores available for each process to obtain the correct non-overlapping NUMA mapping.

#!/bin/bash                                                                                                                                                                                                                                                

# QUDA specific-environment variables                                                                                                                                                                                                                      

# set the QUDA tunecache path                                                                                                                                                                                                                              
export QUDA_RESOURCE_PATH=.

# enable GDR support                                                                                                                                                                                                                                       
export QUDA_ENABLE_GDR=1

export CUDA_DEVICE_MAX_CONNECTIONS=1

# this is the list of GPUs we have                                                                                                                                                                                                                         
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU                                                                                                                                                                                                      
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3                                                                                                                                                                
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)

# This is the list of CPU cores we should use for each GPU                                                                                                                                                                                                 
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment                                                                                                                                                                       
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU                                                                                                                                                                                                                     
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)                                                                                                                                                                      
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping                                                                                                                                                
export CUDA_VISIBLE_DEVICES="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}

numactl --physcpubind=${CPU_REORDER[$lrank]} $APP

QUDA calls

Multi GPU Support

Contents

Compiling for Multi-GPU

Compiling QMP and QIO

Running

Running QUDA's tests

Multi-GPU emulation

Peer-to-peer communication

GPU Direct RDMA and CUDA-aware MPI

OpenMPI

MVAPICH2

Cray MPI

Maximizing Communications Performance

General Example

Open MPI <= 3 (without UCX)

OpenMPI 4 with UCX

MVAPICH2

SpectrumMPI

GH200 Superchip: 1x Superchip per node

4xGH200 Superchip Node

OpenMPI/UCX (Jupiter)

Dependence on CUDA_DEVICE_MAX_CONNECTIONS

Low-level Details

Dslash Policy Tuning

Dslash Component Benchmarking

Legacy Information

Asymmetric Topologies

Open MPI <= 3 (without UCX)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!