Argonne 2021 GPU Hackathon

Building

For general instruction see ThetaGPU doc. Specifically, check the bottom of that doc if you're using bash as there are special instructions.

Login to one of the GPU service nodes thetagpusn1,2

ssh thetagpusn1

Put the follow to your ~/.bashrc (or similar) or execute after each login.

# proxies so that can clone Parthenon from the internet (and get other data if required)
export http_proxy=http://proxy.tmi.alcf.anl.gov:3128
export https_proxy=http://proxy.tmi.alcf.anl.gov:3128

# A more recent version is currently not available from the module system so we set the path manually
export PATH=/soft/buildtools/cmake/3.14.5/bin:$PATH

# Parthenon machine file that includes all required paths and options
export MACHINE_CFG=/grand/gpu_hack/parthenon/ref/ThetaGPU.cmake

Compiling needs to be done on the DGX nodes, e.g., in an interactive session via

qsub -t 60 -n 1 -q single-gpu -A gpu_hack -I

Get Parthenon

git clone https://github.com/lanl/parthenon.git
cd parthenon
git submodule init
git submodule update

Build Parthenon

mkdir build
cd build

# builds for cuda with mpi (default)
cmake ..
make

# OR cuda and no mpi
cmake -DMACHINE_VARIANT=cuda ..
make

# OR host(gcc) with mpi
cmake -DMACHINE_VARIANT=mpi ..
make

Test problems

Two input files are located in the /grand/gpu_hack/parthenon/ref/ folder

parthinput.block32 with 32^3 blocks
parthinput.block16 with 16^3 blocks (add more stress to the AMR part)

Some high level numbers (no profiling data) and sample output

A sample output may look like

./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16
...
cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 16 x 16 x 16 MeshBlocks
Total number of MeshBlocks = 7687
Number of physical refinement levels = 2
Number of logical  refinement levels = 6
  Physical level = 0 (logical level = 4): 3753 MeshBlocks, cost = 3753
  Physical level = 1 (logical level = 5): 2574 MeshBlocks, cost = 2574
  Physical level = 2 (logical level = 6): 1360 MeshBlocks, cost = 1360
--------------------------------------------------------------------
cycle=4 time=1.1718750000000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.80e+07 wsec_step=1.75e+00 zone-cycles/wsec=1.80e+07 wsec_AMR=4.92e-04
cycle=5 time=1.4648437500000000e-03 dt=2.9296874999999999e-04 zone-cycles/wsec_step=2.09e+07 wsec_step=1.50e+00 zone-cycles/wsec=2.09e+07 wsec_AMR=5.09e-04

Driver completed.
time=1.46e-03 cycle=5
tlim=1.00e+00 nlim=5

Number of MeshBlocks = 7687; 3591  created, 0 destroyed during this simulation.

The interesting/relevant information here are

the "New Mesh structure after (de)refinement" message indicates that load balancing and/or mesh refinement happened
the performance per cycle, e.g., cycle=3 time=8.7890624999999991e-04 dt=2.9296874999999999e-04 zone-cycles/wsec_step=1.81e+07 wsec_step=1.72e+00 zone-cycles/wsec=4.08e+06 wsec_AMR=5.92e+00 where the last number shows how much time (in wall seconds) was spent just doing AMR/loadbalancing. In this example this is 5.92 seconds compared to 1.72 seconds (wsec_step=1.72e+00) required for a timestep itself (without load balancing/AMR). Note that the timestep number is artificially bad given that not all parts of the test problem have been converted to the "pack of blocks" (MeshBlockPack) approach.
Thus, the (or one of the) main goal of the Hackathon is to reduce the wsec_AMR time (as the wsec_step will automatically become better along the way with existing approaches)

A first reference

For the config files provided above on ThetaGPU

Config	`wsec_AMR`	`wsec_step`	ratio
1 GPU 16^3	5.92	1.72	3.44
1 Host core 16^3	1.1	2.0	0.55
1 GPU 32^3	2.20	0.89	2.47
1 Host core 32^3	1.57	3.25	0.48

Note, that the ratio should interpreted with care (there's little "computation" and lots of (host) "management" tasks in the load balancing and refinement so it's not a fair comparison).

You may also notice that the startup time on GPUs is significantly longer than on host only, which is likely also related to the mesh initialization that also includes refinement and creating lots of blocks (and thus memory allocations).

Collecting profiling data

Build the connector (which allows named regions and kernel names to be properly shown in the profiler rather than the long names deduced from the templates)

git clone https://github.com/kokkos/kokkos-tools.git
cd kokkos-tools/profiling/nvprof-connector
export CUDA_ROOT=/usr/local/cuda
make

Now there should be a kp_nvprof_connector.so file.

Collect the data (using Nsight Systems as nvprof and nvvp are deprecated)

# "enable" the Kokkos profiling tool
export KOKKOS_PROFILE_LIBRARY=/PATH/TO/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so


# collect the actual data
nsys profile -o my_profile ./example/advection/advection-example -i /grand/gpu_hack/parthenon/ref/parthinput.block16

Now there should be a my_profile.qdrep file (that should be copied to system with a desktop environment).

Analyze the data

Start the GUI (nsys-ui)
Go to Tools - Options and set "Rename CUDA Kernels by NVTX" to "Yes" to get the Kokkos labels shown on the GUI.

Multi variable test file (single GPU)

<parthenon/job>
problem_id = advection

<parthenon/mesh>
refinement = adaptive
numlevel = 4 

nx1 = 128 
x1min = -1.50
x1max = 1.50
ix1_bc = periodic
ox1_bc = periodic

nx2 = 128 
x2min = -1.50
x2max = 1.50
ix2_bc = periodic
ox2_bc = periodic

nx3 = 128 
x3min = -1.50
x3max = 1.50
ix3_bc = periodic
ox3_bc = periodic

<parthenon/meshblock>
nx1 = 32
nx2 = 32
nx3 = 32

<parthenon/time>
tlim = 1.0 
integrator = rk1 
nlim = 10
perf_cycle_offset = 1 
ncycle_out_mesh=-10

<Advection>
cfl = 0.30
vx = 1.0 
vy = 1.0 
vz = 1.0 
profile = smooth_gaussian
ang_2 = 0.0 
ang_3 = 0.0 
ang_2_vert = false
ang_3_vert = false
amp = 1.0 

refine_tol = 1.050    # control the package specific refinement tagging function
derefine_tol = 1.001
compute_error = false

num_vars = 10 # number of variables
vec_size = 1 # size of each variable
fill_derived = false # whether to fill one-copy test vars
buffer_send_pack = true  # send all buffers using packs
buffer_recv_pack = true  # receive buffers using packs
buffer_set_pack =  true  # set received buffers using packs

should use about 14G of memory
output should look like

#Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name	Metadata flags
# ---------------------------------------------------
advected_4                Provides,Cell,Independent,FillGhost
advected_5                Provides,Cell,Independent,FillGhost
advected_6                Provides,Cell,Independent,FillGhost
advected_3                Provides,Cell,Independent,FillGhost
advected_1                Provides,Cell,Independent,FillGhost
advected_8                Provides,Cell,Independent,FillGhost
advected_7                Provides,Cell,Independent,FillGhost
advected_9                Provides,Cell,Independent,FillGhost
advected                  Provides,Cell,Independent,FillGhost
advected_2                Provides,Cell,Independent,FillGhost
# ---------------------------------------------------
# Sparse Variables:
# Name	sparse id	Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm	Value	metadata
# ---------------------------------------------------

### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.

Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=8.7890624999999991e-04 zone-cycles/wsec_step=0.00e+00 wsec_step=5.40e-03 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
---------------------- Current Mesh structure ----------------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 232
Number of physical refinement levels = 3
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 2): 56 MeshBlocks, cost = 56
  Physical level = 1 (logical level = 3): 56 MeshBlocks, cost = 56
  Physical level = 2 (logical level = 4): 56 MeshBlocks, cost = 56
  Physical level = 3 (logical level = 5): 64 MeshBlocks, cost = 64
--------------------------------------------------------------------
cycle=1 time=8.7890624999999991e-04 dt=8.7890624999999991e-04 zone-cycles/wsec_step=0.00e+00 wsec_step=7.78e-01 zone-cycles/wsec=0.00e+00 wsec_AMR=4.65e-06
cycle=2 time=1.7578124999999998e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.05e+07 wsec_step=3.71e-01 zone-cycles/wsec=2.05e+07 wsec_AMR=3.87e-06
cycle=3 time=2.6367187499999997e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.05e+07 wsec_step=3.70e-01 zone-cycles/wsec=2.05e+07 wsec_AMR=3.49e-06
cycle=4 time=3.5156249999999997e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.06e+07 wsec_step=3.70e-01 zone-cycles/wsec=2.06e+07 wsec_AMR=3.61e-06
cycle=5 time=4.3945312500000000e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.06e+07 wsec_step=3.68e-01 zone-cycles/wsec=2.06e+07 wsec_AMR=3.54e-06
cycle=6 time=5.2734375000000003e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.06e+07 wsec_step=3.69e-01 zone-cycles/wsec=2.06e+07 wsec_AMR=3.63e-06
cycle=7 time=6.1523437500000007e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=2.05e+07 wsec_step=3.70e-01 zone-cycles/wsec=2.34e+06 wsec_AMR=2.88e+00
-------------- New Mesh structure after (de)refinement -------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 484
Number of physical refinement levels = 3
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 2): 44 MeshBlocks, cost = 44
  Physical level = 1 (logical level = 3): 140 MeshBlocks, cost = 140
  Physical level = 2 (logical level = 4): 140 MeshBlocks, cost = 140
  Physical level = 3 (logical level = 5): 160 MeshBlocks, cost = 160
--------------------------------------------------------------------
cycle=8 time=7.0312500000000010e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=1.18e+07 wsec_step=1.35e+00 zone-cycles/wsec=1.18e+07 wsec_AMR=1.11e-05
cycle=9 time=7.9101562500000014e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=1.96e+07 wsec_step=8.08e-01 zone-cycles/wsec=1.96e+07 wsec_AMR=9.08e-06
cycle=10 time=8.7890625000000017e-03 dt=8.7890624999999991e-04 zone-cycles/wsec_step=1.97e+07 wsec_step=8.05e-01 zone-cycles/wsec=1.97e+07 wsec_AMR=8.55e-06
---------------------- Current Mesh structure ----------------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 484
Number of physical refinement levels = 3
Number of logical  refinement levels = 5
  Physical level = 0 (logical level = 2): 44 MeshBlocks, cost = 44
  Physical level = 1 (logical level = 3): 140 MeshBlocks, cost = 140
  Physical level = 2 (logical level = 4): 140 MeshBlocks, cost = 140
  Physical level = 3 (logical level = 5): 160 MeshBlocks, cost = 160
--------------------------------------------------------------------

Driver completed.
time=8.79e-03 cycle=10
tlim=1.00e+00 nlim=10

Number of MeshBlocks = 484; 420  created, 0 destroyed during this simulation.

walltime used = 8.06e+00
zone-cycles/wallsecond = 1.16e+07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly