Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting things running on Chicoma #70

Open
Yurlungur opened this issue Feb 5, 2022 · 18 comments
Open

Getting things running on Chicoma #70

Yurlungur opened this issue Feb 5, 2022 · 18 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@Yurlungur
Copy link
Collaborator

Yurlungur commented Feb 5, 2022

I will use this issue to document the steps I've taken to get things running on Chicoma. This will be a living document.

Getting source code

First, download the code from git. Phoebus relies heavily on submodules, and so you must use a recursive clone:

git clone --recursive [email protected]:lanl/phoebus.git

To transfer the code to chicoma, I found the easiest thing was to clone recursively to my desktop and then rsync excluding version control directories:

rsync --progress --exclude '.git' --exclude 'build' -rLpt phoebus -e 'ssh ${USER}@wtrw ssh' ch-fe:~

(Note: do not exclude bin directories, as the Kokkos bin directory is needed.)

We need to do this because the whole machine is behind a gateway server and you can't access the web.

Getting a node

Since Chicoma is a hybrid machine, request the gpu partition with -p gpu and request the 80GB A100s with -C gpu80. (These are flags for slurm.)

Full command:

salloc -p gpu -C gpu80 --time=8:00:00

To get a debug node (only the 40GB GPUs available), use:

salloc --qos=debug --reservation=gpu_debug --partition=gpu_debug --time=2:00:00

Note the debug nodes are available only for 2h at a time.

Coding environment

We need to enable a programming environment. Here is one possible path:

module unload cray-libsci
module load PrgEnv-gnu cudatoolkit/24.7_12.5
module load craype-accel-nvidia80
unset OMP_NUM_THREADS
unset OMP_PROC_BIND
unset OMP_PLACES
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1
C_NATIVE=cc
CXX_NATIVE=CC
module load cray-hdf5-parallel
module load cmake
export NVCC_WRAPPER_DEFAULT_COMPILER=CC

This shouldn't matter, but it does. Always load cmake last. Otherwise it does not correctly resolve install paths for loaded modules.

Compiling

Make a build directory

mkdir -p build && cd build

Here's the line required to build the torus problem

cmake -DPHOEBUS_GEOMETRY=MonopoleCart -DCMAKE_BUILD_TYPE=Release -DPHOEBUS_ENABLE_CUDA=ON -DPHOEBUS_ENABLE_MPI=ON -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON ..
make -j

If you want to build on a frontend node, add the following flag

-DKokkos_ARCH_AMPERE80=ON 

Running

A 2d sim on one GPU works as expected:

./src/phoebus -i ../inputs/torus.pin

When using MPI, use srun to launch the job. mpirun does not work as expected. A 3D sim accross 2 nodes might be launched as:

srun -p gpu -C gpu80 --time=0:30:00 --nodes 2 --ntasks 8 --ntasks-per-node 4  ~/phoebus/external/parthenon/external/Kokkos/bin/hpcbind -- phoebus -i torus.pin

Note the hpcbind call, which prevents MPI ranks or GPUs from migrating, which can cause problems.

@Yurlungur
Copy link
Collaborator Author

Chicoma appears to be extremely finicky. If you deviate from this recipe at all, you may experience problems. For example, cmake warns that you may want to set CRAYPE_LINK_TYPE=dynamic. DO NOT DO THIS. Your code will build, but srun will segfault when you try to run your code.

@Yurlungur
Copy link
Collaborator Author

Update: It works! But the instructions are very different from before. In particular, PrgEnv-nvidia no longer works, instead I must use PrgEnv-gnu. Also the cuda-aware MPI instructions are different, as the OFI transport layer is broken and UCX must be used. See
https://hpc.lanl.gov/platforms/chicoma/gpu-best-practices.html

@Yurlungur Yurlungur pinned this issue Jul 29, 2022
@Yurlungur Yurlungur added the documentation Improvements or additions to documentation label Jul 29, 2022
@Yurlungur
Copy link
Collaborator Author

From Shengtai with AthenaPK:

module load cpe-cuda cuda cmake cray-hdf5-parallel
setenv CRAY_ACCEL_TARGET nvidia80
setenv MPICH_GPU_SUPPORT_ENABLED 1
cmake -S. -Bbuild-gpu -DCMAKE_CXX_COMPILER=CC

@Yurlungur
Copy link
Collaborator Author

Yurlungur commented Aug 8, 2022

My experimentation confirms that the OFI transport layer works now. It is much faster than UCX. Main post updated.

@Yurlungur
Copy link
Collaborator Author

Since these instructions are being used by other parthenon-based codes, please see this Parthenon issue:
parthenon-hpc-lab/parthenon#720

I have added the relevant Kokkos config line to the Pheobus CmakeLists.txt but it is not in Parthenon at this time. (And it's not clear it should be.)

@Yurlungur Yurlungur self-assigned this Aug 22, 2022
@brryan
Copy link
Collaborator

brryan commented Oct 24, 2022

It seems that PrgEnv-cray is no longer loaded by default so instead of the above line

module swap PrgEnv-cray PrgEnv-gnu 

I just do

module load PrgEnv-gnu

@Yurlungur
Copy link
Collaborator Author

Thanks @brryan instructions updated.

@brryan
Copy link
Collaborator

brryan commented Oct 27, 2022

If backspace isn't working in vim on chicoma, add this to your ~/.vimrc:

:set backspace=indent,eol,start

@Yurlungur
Copy link
Collaborator Author

The Chicoma environment has changed. Notably, cuda-aware MPI has been removed/disabled. I am updating the top-level comment. But I include the old procedure below for posterity.

Coding environment

We need to enable a programming environment. There's an Nvidia environment, but it seems broken, and is missing cuda headers. I used the GNU backend:

module load PrgEnv-gnu

if the cray environment is already loaded, you may need to do

module swap PrgEnv-cray PrgEnv-gnu 

I also needed to load cpe-cuda which does... something:

module load cpe-cuda

and then we can load relevant modules:

module load cuda cray-hdf5-parallel cmake

Note that I put cmake last in the module list. This shouldn't matter, but it does. Always add cmake last.

MPI

There are two transport layers available on Chicoma: UCX and OFI. UCX is ethernet, OFI is high-speed fiber, so the latter is recommended. (I notice a significant performance degredation with UCX.) For debugging, however, it's useful to include both, as OFI appears to be more finicky.

Regardless, set

export CRAY_ACCEL_TARGET=nvidia80

to enable cuda-aware MPI.

OFI

You must set the following environment variable for OFI

export MPICH_GPU_SUPPORT_ENABLED=1

UCX

To swap to the UCX transport layer:

module swap craype-network-ofi craype-network-ucx
module swap cray-mpich cray-mpich-ucx

Te enable CUDA-aware MPI we additionally need to set the following environment variable

export UCX_TLS="self,ud,cuda,sm"

Compiling

Make a build directory

mkdir -p build && cd build

Here's the line required to build the torus problem

cmake -DPHOEBUS_ENABLE_CUDA=ON -DPHOEBUS_GEOMETRY=FMKS -DPHOEBUS_CACHE_GEOMETRY=ON -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON -DCMAKE_CXX_COMPILER=CC ..
make -j

Note the -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON line. The version of hdf5 available here still doesn't support compression with MPI. Also note -DCMAKE_CXX_COMPILER=CC. This is required to force cmake to use the cray compiler wrappers.

Running

A 2d sim on one GPU works as expected:

./src/phoebus -i ../inputs/torus.pin

When using MPI, use srun to launch the job. mpirun does not work as expected. A 3D sim accross 2 nodes might be launched as:

srun -p gpu -C gpu80 --time=0:30:00 --nodes 2 --ntasks 8 --ntasks-per-node 4  phoebus -i torus.pin

@Yurlungur
Copy link
Collaborator Author

Need to update instructions again. Currently the top-level instructions require PR #177 . However, it should be merged soon. old instructions saved below for posterity.

I will use this issue to document the steps I've taken to get things running on Chicoma. This will be a living document.

Getting source code

First, download the code from git. Phoebus relies heavily on submodules, and so you must use a recursive clone:

git clone --recursive [email protected]:lanl/phoebus.git

To transfer the code to chicoma, I found the easiest thing was to clone recursively to my desktop and then rsync excluding version control directories:

rsync --progress --exclude '.git' --exclude 'build' -rLpt phoebus -e 'ssh ${USER}@wtrw ssh' ch-fe:~

(Note: do not exclude bin directories, as the Kokkos bin directory is needed.)

We need to do this because the whole machine is behind a gateway server and you can't access the web.

Getting a node

Note for compilation to work, you must be on a backend node. I could not get a frontend node to build.

Since Chicoma is a hybrid machine, request the gpu partition with -p gpu and request the 80GB A100s with -C gpu80. (These are flags for slurm.)

Full command:

salloc -p gpu -C gpu80 --time=8:00:00

To get a debug node (only the 40GB GPUs available), use:

salloc --qos=debug --reservation=gpu_debug --partition=gpu_debug --time=2:00:00

Note the debug nodes are available only for 2h at a time.

Coding environment

We need to enable a programming environment. There are two approaches one can follow.

The NVHPC Path

Credit to @bprather for finding this path. nvhpc is NVIDIA's newer compiler toolchain, and Chicoma has a "programming environment" for this toolchain in particular. Here's the build approach that worked:

module purge
module load PrgEnv-nvhpc
export CRAY_CPU_TARGET="x86-64"

The GNU + Cuda approach

This approach is closest to the old workflow and worked for me:

module swap PrgEnv-cray PrgEnv-gnu 

I also needed to load cpe-cuda which does... something:

module load cpe-cuda

and then finally we load cuda:

module load cuda

MPI

Cuda-aware MPI has been disabled on Chicoma for some reason. The only path currently is to use CPU-side MPI with host-pinned memory. This can be enabled by adding the

-DPARTHENON_ENABLE_HOST_COMM_BUFFERS=ON

flag to your cmake configuration line. You'll see this in the config line below.

No need to manually load MPI. CPU-capable cray-mpich is loaded automatically when you load one of the PrgEnv-* modules.

HDF5

There is a cray-hdf5-parallel module. Cray will let you load it, no matter what programming environment you're in. If you use the GNU+CUDA path above, then this is fine. Just write

module load cray-hdf5-parallel

to get working parallel HDF5.

However, if you used the nvhpc path above, this combination of modules is broken and you must compile HDF5 parallel yourself in your home directory. Fortunately this isn't too hard. Just download the HDF5 source tarball, decompress it, enter the source directory, and then write

CC=CC ./configure --prefix=/my/local/install/path --enable-build-mode=production --enable-hl --enable-symbols=yes --enable-parallel
make -j
make install

CMake

Finally load the cmake module last:

module load cmake

This shouldn't matter, but it does. Always load cmake last. Otherwise it does not correctly resolve install paths for loaded modules.

Compiling

Make a build directory

mkdir -p build && cd build

Here's the line required to build the torus problem

cmake -DPARTHENON_ENABLE_HOST_COMM_BUFFERS=ON -DPHOEBUS_ENABLE_CUDA=ON -DPHOEBUS_GEOMETRY=FMKS -DPHOEBUS_CACHE_GEOMETRY=ON -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON -DCMAKE_CXX_COMPILER=CC ..
make -j

Note a few aspects of this command:

  • The -DPARTHENON_ENABLE_HOST_COMM_BUFFERS=ON flag as discussed above.
  • The -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON line. The version of hdf5 available here still doesn't support compression with MPI.
  • The -DCMAKE_CXX_COMPILER=CC flag. This is required to force cmake to use the cray compiler wrappers.

Running

A 2d sim on one GPU works as expected:

./src/phoebus -i ../inputs/torus.pin

When using MPI, use srun to launch the job. mpirun does not work as expected. A 3D sim accross 2 nodes might be launched as:

srun -p gpu -C gpu80 --time=0:30:00 --nodes 2 --ntasks 8 --ntasks-per-node 4  phoebus -i torus.pin

@ajdittmann
Copy link
Collaborator

Note for compilation to work, you must be on a backend node. I could not get a frontend node to build.

I was able to compile on a frontend node by adding the -DKokkos_ARCH_AMPERE80=ON flag.

@Yurlungur
Copy link
Collaborator Author

Ah, thanks for that @ajdittmann I've updated the instructions.

@Yurlungur
Copy link
Collaborator Author

I am currently having trouble compiling singularity-eos and singularity-opac with the same HDF5 version as the one found by phoebus---CMake seems to have changed behaviour and does the FindPackage over again in an inconsistent way. However, the singularity-eos and singularity-opac libraries are both header-only and do not need to pull in HDF5. For now I recommend simply commenting out the lines about HDF5 in both submodules.

@Yurlungur
Copy link
Collaborator Author

singularity-eos and singularity-opac and HDF5 now all seem to behave with my updated branch.

@Yurlungur
Copy link
Collaborator Author

MR #177 now in main. Top level instructions should be the ground truth for Chicoma at this time.

@gshipman
Copy link

On Chicoma, try:
git config --global http.proxy http://proxyout.lanl.gov:8080
and..
git config --global https.proxy https://proxyout.lanl.gov:8080

@mari2895
Copy link
Collaborator

mari2895 commented Mar 27, 2024

We should update this to:

module swap PrgEnv-cray PrgEnv-nvhpc
module load craype-accel-nvidia80
export CRAY_ACCEL_TARGET=nvidia80
export MPICH_GPU_SUPPORT_ENABLED=1
export NVCC_WRAPPER_DEFAULT_COMPILER=CC
export CC=$(which cc) # not sure why these are necessary but they appear to be
export CXX=$(which CC)
export FC=$(which ftn)
module load cray-hdf5-parallel
module load cmake/3.25.1
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_RDZV_THRESHOLD=64000

And then
srun -n ${SLURM_NTASKS} ~/phoebus/external/parthenon/external/Kokkos/bin/hpcbind -- phoebus -i problem.pin

@Yurlungur
Copy link
Collaborator Author

Old version for archival

I will use this issue to document the steps I've taken to get things running on Chicoma. This will be a living document.

Getting source code

First, download the code from git. Phoebus relies heavily on submodules, and so you must use a recursive clone:

git clone --recursive [email protected]:lanl/phoebus.git

To transfer the code to chicoma, I found the easiest thing was to clone recursively to my desktop and then rsync excluding version control directories:

rsync --progress --exclude '.git' --exclude 'build' -rLpt phoebus -e 'ssh ${USER}@wtrw ssh' ch-fe:~

(Note: do not exclude bin directories, as the Kokkos bin directory is needed.)

We need to do this because the whole machine is behind a gateway server and you can't access the web.

Getting a node

Since Chicoma is a hybrid machine, request the gpu partition with -p gpu and request the 80GB A100s with -C gpu80. (These are flags for slurm.)

Full command:

salloc -p gpu -C gpu80 --time=8:00:00

To get a debug node (only the 40GB GPUs available), use:

salloc --qos=debug --reservation=gpu_debug --partition=gpu_debug --time=2:00:00

Note the debug nodes are available only for 2h at a time.

Coding environment

We need to enable a programming environment. Currently, the only code path which supports GPUDirect RDMA is the following:

The NVHPC Path

Credit to @bprather for finding this path. nvhpc is NVIDIA's newer compiler toolchain, and Chicoma has a "programming environment" for this toolchain in particular. Here's the build approach that worked:

module swap PrgEnv-cray PrgEnv-nvhpc
module load craype-accel-nvidia80
export CRAY_ACCEL_TARGET=nvidia80
export MPICH_GPU_SUPPORT_ENABLED=1
export NVCC_WRAPPER_DEFAULT_COMPILER=CC
export CC=$(which cc) # not sure why these are necessary but they appear to be
export CXX=$(which CC)
export FC=$(which ftn)
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_RDZV_THRESHOLD=64000

HDF5

There is a cray-hdf5-parallel module. Cray will let you load it, no matter what programming environment you're in. You can load it with

module load cray-hdf5-parallel

At the moment it appears to work with the PrgEnv-nvhpc environment. Note that for cmake to find HDF5, an additional hint to cmake is required. Described below.

CMake

Finally load the cmake module last:

module load cmake/3.25.1 # version may need to be changed later

This shouldn't matter, but it does. Always load cmake last. Otherwise it does not correctly resolve install paths for loaded modules.

Compiling

Make a build directory

mkdir -p build && cd build

Here's the line required to build the torus problem

cmake  -DPHOEBUS_ENABLE_CUDA=ON -DPHOEBUS_GEOMETRY=FMKS -DPHOEBUS_CACHE_GEOMETRY=ON -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON -DCMAKE_CXX_COMPILER=/path/to/phoebus/scripts/bash/nvcc_wrapper -DHDF5_INCLUDE_DIR=${HDF5_ROOT}/include ..
make -j

Note a few aspects of this command:

  • The -DPARTHENON_DISABLE_HDF5_COMPRESSION=ON line. The version of hdf5 available here still doesn't support compression with MPI.
  • The -DCMAKE_CXX_COMPILER=/path/to/phoebus/scripts/bash/nvcc_wrapper flag. Note you should put in the appropriate full path to the wrapper stored in phoebus/scripts/bash/nvcc_wrapper. This is a modified version of the nvcc_wrapper used by Kokkos, which forces the nvhpc library to use nvcc instead of nvc++ for cuda compilation and the cray compiler wrapper instead of nvc++ for host-side compilation. It also forcibly passes through important flags. Ideally this won't be needed down the road, but currently it is required.
  • The -DHDF5_INCLUDE_DIR=${HDF5_ROOT}/include line is required as a hint to cmake so that it finds the HDF5 installation provided by the system modules. ${HDF5_ROOT} is a variable automatically provided by the module system.

If you want to build on a frontend node, add the following flag

-DKokkos_ARCH_AMPERE80=ON 

You may also need these flags (YMMV)

-DCMAKE_CXX_FLAGS="${PE_MPICH_GTL_DIR_nvidia80} ${PE_MPICH_GTL_LIBS_nvidia80}"

which should be implied by the compiler wrappers, but seem not to be appropriately passed through on the frontend.

Running

A 2d sim on one GPU works as expected:

./src/phoebus -i ../inputs/torus.pin

When using MPI, use srun to launch the job. mpirun does not work as expected. A 3D sim accross 2 nodes might be launched as:

srun -p gpu -C gpu80 --time=0:30:00 --nodes 2 --ntasks 8 --ntasks-per-node 4  ~/phoebus/external/parthenon/external/Kokkos/bin/hpcbind -- phoebus -i torus.pin

Note the hpcbind call, which prevents MPI ranks or GPUs from migrating, which can cause problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants