NERSC GPU hackathon (Dec 2021)

This page summarises preparations for the hackathon on the 2nd/8th/9th/10th December 2021. We will use CoreNEURON+NMODL with a few sets of test model data.

NOTE If you see any issues that we need to be handled after hackathon should be created in https://github.com/neuronsimulator/gpuhackathon/issues

Preparatory Tasks

Hackathon Tasks

NEURON GPU Enabled Builds on systems like Perlmutter and Ascent

Alex: Check the neuron-gpu-nightly execution on Perlmutter. Do they work with/without MPI?
Alex: NEURON + CoreNEURON - Interviews and NEURON + CoreNEURON + NMODL - Interviews build on Perlmutter & Ascent. Do they work?
- Alex: Should we automatically disable Interviews on the Cray systems and IBM Power systems by detection via CMake?
Alex: Check if there is any improvement needed for building from source on Hackathon systems.

OpenMP migration of CoreNEURON NrnThread / model data transfer

Nico + Alex: Make usage of OpenACC API's (acc_memcpy_to_device and acc_copyin) portable for OpenMP as well as OpenACC. In short https://github.com/BlueBrain/CoreNeuron/blob/master/coreneuron/gpu/nrn_acc_manager.cpp should become OpenACC as well as OpenMP.
Nico + Alex: Test first ringtest and then channel-benchmark for the correctness of the migration
Pramod: Review OpenACC/OpenMP interoperability via API calls, see e.g. https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866.

Random123 and OpenMP portability

Omar: Add OpenMP offload example like https://bbpcode.epfl.ch/reps/user/kumbhar/cunrnran123/tree/mainacc.c (still using CUDA Random123) -- it moved to GitLab https://bbpgitlab.epfl.ch/hpc/user/cunrnran123
Omar: Trying Random123 directly in OpenMP offload regions
Omar: Anything to discuss / ask to mentors about Random123 portability on different gpus?

Hines Solver Analysis and Optimisation

Add an option in CoreNEURON that only executes Hines solver ?
Find out a way to dynamically set a good number for the nwarp variable (used to distribute the cells)
Ioannis + Christos: Profiling of a realistic model and analyse different performance metrics with mentors to understand the limitations.
Ioannis + Christos: With current algorithm, investigate possible performance improvement opportunities
Ioannis + Christos: Performance comparison of OpenACC vs OpenMP vs CUDA implementation
Ioannis + Christos: Would benefit from special memory types usage?
Ioannis + Christos: Look into algorithmic improvements to expose more parallelism (if suitable)

NMODL Code Generation & OpenMP Migration

Olli - Update codegen backend to emit OpenMP constructs with OpenACC (ongoing)
- https://github.com/BlueBrain/nmodl/pull/780
- https://github.com/BlueBrain/CoreNeuron/pull/693
Olli + Omar - Review with the mentors performance aspects (e.g. present vs map aspects)

CoreNEURON OpenMP migration

Build system changes to enable OpenMP target offload as well as OpenACC
Olli: Try simple OpenMP + OpenACC offload test (e.g. with inbuilt ring test)
Olli: Update Hines Solver in coreneuron with OpenMP (ongoing)
- - Christos + Ioannis to review: https://github.com/BlueBrain/CoreNeuron/pull/693
- - Confirm if we need an OpenMP equivalent to #pragma acc loop seq
Pramod: Measure performance difference between OpenMP and OpenACC runs
Olli: Under coreneuron event communication part, start tackling OpenACC pragmas or API calls that are not converted to OpenMP

Eigen Compatibility Issues

Pramod: Discuss compiler bug / error with NVIDIA colleague (Brent): https://forums.developer.nvidia.com/t/enabling-openmp-offload-breaks-openacc-code/196643/6

Portable wheels

Pramod: Prepare a slide describing the goals - what currently we do and what we would like to do?
Pramod: Organise / plan a discussion with a / few nvidia compiler engineers. Central question is - ABI compatibility across different CUDA + OpenMP + OpenACC runtimes/compilers.

Unified memory performance : optional / low-priority

First, check/fix https://github.com/BlueBrain/CoreNeuron/issues/594
Re-measure the slowdown from using unified memory (ringtest + channel-benchmark)
Identify which data structures are causing the slowdown. How to identify this easily with the tools?

Ascent

These are instructions to build + run on Ascent (login1.ascent.olcf.ornl.gov).

module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH

Note that Boost is an optional dependency, but it has an apparently-significant impact on both initialisation time and profiling time when using NSight Compute.

# clone repository
cd $HOME
git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron # or: git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron
git checkout 96498142ce60d15edbd4c5161c7153f41eda6b20 #for the mentors
git submodule update -f --init --recursive
mkdir -p build && cd build
cmake .. -G Ninja -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0"  -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="-gpu=cuda11.0" -DCORENRN_ENABLE_UNIT_TESTS=OFF
cmake --build . --parallel 8

Running

As $HOME is not writable, create your own directory into the project directory:

mkdir -p /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
cd /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER

Now run a tiny, functional test on the GPU:

Allocate a node

bsub -P GEN170 -J neuron -W 90 -nnodes 1 -alloc_flags "gpudefault" -Is $SHELL

Note: gpumps here would enable MPS, but that is incompatible with the CUDA 11.0 version we are using here.

Make necessary modules are loaded:

module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost

Run simple function test on the GPU:

OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi

How to generate NSight System report with simple ringtest:

module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-systems/2021.5.1.117
OMP_NUM_THREADS=1 jsrun -E LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/pami_471/libpami.so --gpu_per_rs 1 -n 1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0,CALI_CONFIG=nvtx,OMP_NUM_THREADS=$OMP_NUM_THREADS --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2

How to generate full NSight Compute report with simple ringtest for a single kernel:

module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-compute/2021.2.2
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 ncu -k <kernel_name_you_want_to_analyze> -f --profile-from-start=off --target-processes all --set full -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2

Which Input Model Datasets To Use?

For the purposes of the hackathon we have provided two datasets that cover the basic use cases for the coming days.

1. Small dataset for development & testing

For running small tests you should use the following data set:

/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank

This data set has a circuit of 1440 cells and is meant to be simulated with 1 MPI rank on 1 GPU.

To run CoreNEURON with this data set you may use the following command:

NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 \
  jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
  --datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank \
  --mpi --gpu --cell-permute=2 --tstop=5

2. Large dataset for profiling and benchmarking

For running the simulator to gather profiling or other performance measurements you should use the following data set:

/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks

This data set resembles a production level simulation with a large enough circuit to fill up most of the GPU memory (~14GB). It is meant to be executed on 6 MPI ranks or 6 threads on 1 GPU and consists of 8800 cells. Note that MPS needs to be enabled if you are using pure MPI execution and MPS is not currently working on Ascent with CUDA 11.0, so we need to run this with 1 rank / 6 threads / 1 GPU.

To run CoreNEURON with this data set you may use the following command:

NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=6 \
  jsrun -n 1 --cpu_per_rs 6 --tasks_per_rs 1 --gpu_per_rs 1 --bind rs $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
  --datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
  --mpi --gpu --cell-permute=2 --tstop=5

Note: The datasets were generated with 1k synapses/cell and 100 stimuli/cell. Number of synapses and stimuli are fewer than the original model.

Perlmutter

Building

The system-wide modules only go up to NVHPC 21.7, which has known issues with NEURON. We have ourselves inserted an installation of NVHPC 21.9 (module nvidia/21.9) into the Cray Programming Environment setup.
Note that NVHPC 21.9 is configured to use the system GCC 7.5 standard library, while we have built some other dependencies with GCC 9.2. This seems to be "close enough".
-tp haswell or -DR123_USE_SSE=0 is required because nvc++ defaults to -tp zen on the Perlmutter nodes, which defines __ABM__ and causes Random123 to try and include intrin.h, which fails.

# clone repository
git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git # or git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron
git submodule update -f --init --recursive
mkdir -p build && cd build

# allocate node
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 1 --account=ntrain9_g -n 1 --cpus-per-task 6

# Use our own hand-crafted modules for cuda 11.4 (from nvhpc/21.9) and PrgEnv-nvidia for nvhpc 21.9
module use /global/cfs/cdirs/ntrain9/neuron/modules
# Also, spack-generated modules for dependencies (caliper, ninja, py-*)
module use /global/cfs/cdirs/ntrain9/neuron/spack_modules/cray-sles15-zen2

# Load modules: prefer CUDA 11.4 from NVHPC/21.9
module load cudatoolkit/21.9_11.4
module load cmake nvidia/21.9 python caliper ninja py-pytest py-pyyaml py-jinja2 py-sympy boost/1.77.0

# Build CoreNEURON
cmake .. -G Ninja \
  -DCORENRN_ENABLE_CALIPER_PROFILING=ON  \
  -DCORENRN_ENABLE_GPU=ON \
  -DCORENRN_ENABLE_NMODL=ON \
  -DCORENRN_EXTERNAL_BENCHMARK_DATA=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021 \
  -DCMAKE_INSTALL_PREFIX=../install \
  -DCMAKE_CUDA_COMPILER=nvcc \
  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" \
  -DCMAKE_CXX_COMPILER=CC \
  -DCMAKE_CUDA_ARCHITECTURES=80 \
  -DCORENRN_ENABLE_UNIT_TESTS=OFF

cmake --build . --parallel
ctest --output-on-failure -j 16 # optional

Running on interactive session

If you haven't allocated a session

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 1 --account=ntrain9_g -n 1 --cpus-per-task 6

Note that we are setting NVCOMPILER_ACC_SYNCHRONOUS=1 below so that we get correct timings for individual kernels. Otherwise kernel timings are incorrect due to async launch.

Run a channel-benchmark test

# NVCOMPILER_ACC_SYNCHRONOUS=1 // if necessary

OMP_NUM_THREADS=6 \
CALI_CONFIG=runtime-report,calc.inclusive \
  srun -n 1 benchmark/x86_64/special-core \
  --datpath=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
  --mpi --gpu --cell-permute=2 --tstop=100

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 de4e433 (2021-11-26 08:50:58 +0100)

 Additional mechanisms from files
 Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod

 num_mpi=6
 num_omp_thread=1


....
Solver Time : 16.2698


 Simulation Statistics
 Number of cells: 1320
 Number of compartments: 648360
 Number of presyns: 3961320
 Number of input presyns: 0
 Number of synapses: 3960002
 Number of point processes: 7921322
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 11969
 Number of spikes with non negative gid-s: 11969
Path                                     Min time/rank Max time/rank Avg time/rank Time %
main                                         40.376340     40.378758     40.377549 99.399072
  checkpoint                                  0.000001      0.000002      0.000002  0.000004
  output-spike                                0.004200      0.004211      0.004205  0.010353
  simulation                                 16.269795     16.269813     16.269804 40.052045
    spike-exchange                            0.019603      0.041487      0.030545  0.075194
      spike-exchange                          0.019578      0.041452      0.030515  0.075120
        communication                         0.000886      0.000906      0.000896  0.002206
        imbalance                             0.018654      0.040551      0.029603  0.072874
    timestep                                 16.225238     16.247191     16.236214 39.969356
      state-update                            4.167419      4.176959      4.172189 10.270849
        state-SKv3_1                          0.084673      0.086011      0.085342  0.210090
        state-SK_E2                           0.086975      0.087625      0.087300  0.214910
        state-ProbGABAAB_EMS                  0.176391      0.182205      0.179298  0.441385
        state-ProbAMPANMDA_EMS                0.512746      0.521091      0.516919  1.272520
        state-nax                             0.135361      0.136370      0.135866  0.334466
        state-NaTg                            0.121694      0.121713      0.121704  0.299603
        state-Nap_Et2                         0.091850      0.092020      0.091935  0.226320
        state-na3                             0.126686      0.127354      0.127020  0.312690
        state-K_Tst                           0.083087      0.083141      0.083114  0.204605
        state-K_Pst                           0.086288      0.086488      0.086388  0.212665
        state-kmb                             0.078860      0.079250      0.079055  0.194613
        state-KdShu2007                       0.077877      0.078125      0.078001  0.192018
        state-kdr                             0.094169      0.095289      0.094729  0.233198
        state-kdrb                            0.092489      0.092745      0.092617  0.227999
        state-kdb                             0.089317      0.089467      0.089392  0.220060
        state-kd2                             0.087586      0.092668      0.090127  0.221869
        state-kca                             0.363553      0.364731      0.364142  0.896423
        state-kap                             0.109070      0.109326      0.109198  0.268817
        state-kad                             0.112513      0.113974      0.113244  0.278776
        state-Ih                              0.098592      0.100036      0.099314  0.244485
        state-hd                              0.093146      0.094326      0.093736  0.230754
        state-cat                             0.121305      0.121554      0.121430  0.298928
        state-can                             0.121067      0.122852      0.121960  0.300233
        state-Ca_LVAst                        0.129896      0.130772      0.130334  0.320849
        state-cal                             0.108788      0.109597      0.109192  0.268804
        state-Ca_HVA2                         0.142881      0.143095      0.142988  0.351999
        state-cagk                            0.097391      0.098042      0.097716  0.240553
        state-cacum                           0.234768      0.235131      0.234950  0.578385
        state-cacumb                          0.175342      0.176663      0.176002  0.433273
        state-IClamp                          0.002348      0.002558      0.002453  0.006039
        state-CaDynamics_DC0                  0.113226      0.113724      0.113475  0.279346
        state-pas                             0.002896      0.003032      0.002964  0.007297
      update                                  0.116486      0.117232      0.116859  0.287677
      second-order-cur                        0.001943      0.002165      0.002054  0.005056
      matrix-solver                           6.071176      6.123533      6.097355 15.010108
      setup-tree-matrix                       4.745456      4.748895      4.747175 11.686317
        cur-SKv3_1                            0.104343      0.105509      0.104926  0.258301
        cur-SK_E2                             0.097890      0.098530      0.098210  0.241768
        cur-ProbGABAAB_EMS                    0.252629      0.255678      0.254154  0.625660
        cur-ProbAMPANMDA_EMS                  0.903469      0.903876      0.903672  2.224608
        cur-nax                               0.113687      0.115076      0.114382  0.281578
        cur-NaTg                              0.113117      0.113994      0.113556  0.279544
        cur-Nap_Et2                           0.084684      0.084767      0.084725  0.208572
        cur-na3                               0.095460      0.096301      0.095880  0.236033
        cur-K_Tst                             0.076803      0.076846      0.076825  0.189122
        cur-K_Pst                             0.083091      0.083296      0.083193  0.204801
        cur-kmb                               0.084998      0.085193      0.085095  0.209483
        cur-KdShu2007                         0.079360      0.079602      0.079481  0.195662
        cur-kdr                               0.084818      0.085221      0.085019  0.209296
        cur-kdrb                              0.083571      0.083837      0.083704  0.206058
        cur-kdb                               0.074649      0.075234      0.074942  0.184487
        cur-kd2                               0.074704      0.075905      0.075305  0.185380
        cur-kca                               0.102095      0.102144      0.102119  0.251392
        cur-kap                               0.093734      0.094259      0.093996  0.231395
        cur-kad                               0.096892      0.096943      0.096917  0.238586
        cur-Ih                                0.099069      0.099392      0.099230  0.244280
        cur-hd                                0.079998      0.080359      0.080179  0.197379
        cur-cat                               0.103624      0.103911      0.103767  0.255449
        cur-can                               0.118858      0.119161      0.119010  0.292971
        cur-Ca_LVAst                          0.123353      0.126097      0.124725  0.307041
        cur-cal                               0.122882      0.123123      0.123003  0.302800
        cur-Ca_HVA2                           0.142775      0.143094      0.142934  0.351868
        cur-cagk                              0.122285      0.123171      0.122728  0.302125
        cur-cacum                             0.081801      0.082285      0.082043  0.201969
        cur-cacumb                            0.074288      0.075040      0.074664  0.183803
        cur-IClamp                            0.085479      0.088503      0.086991  0.214149
        cur-CaDynamics_DC0                    0.064449      0.064505      0.064477  0.158726
        cur-ttx_ion                           0.063089      0.063417      0.063253  0.155713
        cur-ca_ion                            0.100195      0.100529      0.100362  0.247065
        cur-k_ion                             0.063816      0.064154      0.063985  0.157515
        cur-na_ion                            0.066107      0.066272      0.066190  0.162941
        cur-pas                               0.109251      0.109293      0.109272  0.268999
      deliver-events                          0.982300      1.005465      0.993883  2.446681
        net-receive-ProbGABAAB_EMS            0.000950      0.001073      0.001012  0.002490
        net-receive-ProbAMPANMDA_EMS          0.003695      0.003919      0.003807  0.009372
        net-buf-receive-ExpSyn                0.003570      0.003671      0.003621  0.008913
        net-buf-receive-Exp2Syn               0.003701      0.003835      0.003768  0.009276
        net-buf-receive-ProbGABAAB_EMS        0.068577      0.070341      0.069459  0.170990
        net-buf-receive-ProbAMPANMDA_EMS      0.090366      0.094045      0.092206  0.226986
        update-net-receive-buf                0.399891      0.406366      0.403129  0.992398
          net-receive-buf-cpu2gpu             0.379896      0.385496      0.382696  0.942098
          net-receive-buf-order               0.003375      0.003533      0.003454  0.008503
        check-threshold                       0.182114      0.184026      0.183070  0.450671
  finitialize                                 2.094235      2.094272      2.094254  5.155510
    spike-exchange                            0.000035      0.105358      0.052696  0.129725
      spike-exchange                          0.000031      0.105354      0.052693  0.129715
        communication                         0.000019      0.000022      0.000020  0.000050
        imbalance                             0.000006      0.105331      0.052669  0.129656
    cur-SKv3_1                                0.000031      0.000032      0.000031  0.000078
    cur-SK_E2                                 0.000029      0.000031      0.000030  0.000074
    cur-ProbGABAAB_EMS                        0.000072      0.000073      0.000073  0.000178
    cur-ProbAMPANMDA_EMS                      0.000265      0.000267      0.000266  0.000655
    cur-nax                                   0.000034      0.000034      0.000034  0.000084
    cur-NaTg                                  0.000032      0.000034      0.000033  0.000081
    cur-Nap_Et2                               0.000025      0.000026      0.000025  0.000063
    cur-na3                                   0.000029      0.000030      0.000029  0.000073
    cur-K_Tst                                 0.000023      0.000024      0.000024  0.000058
    cur-K_Pst                                 0.000025      0.000026      0.000025  0.000063
    cur-kmb                                   0.000026      0.000027      0.000027  0.000065
    cur-KdShu2007                             0.000024      0.000024      0.000024  0.000059
    cur-kdr                                   0.000026      0.000026      0.000026  0.000064
    cur-kdrb                                  0.000025      0.000025      0.000025  0.000062
    cur-kdb                                   0.000023      0.000023      0.000023  0.000057
    cur-kd2                                   0.000024      0.000024      0.000024  0.000059
    cur-kca                                   0.000031      0.000032      0.000031  0.000078
    cur-kap                                   0.000027      0.000028      0.000027  0.000068
    cur-kad                                   0.000028      0.000030      0.000029  0.000071
    cur-Ih                                    0.000029      0.000030      0.000029  0.000073
    cur-hd                                    0.000024      0.000025      0.000024  0.000060
    cur-cat                                   0.000030      0.000032      0.000031  0.000076
    cur-can                                   0.000035      0.000035      0.000035  0.000086
    cur-Ca_LVAst                              0.000035      0.000036      0.000035  0.000087
    cur-cal                                   0.000036      0.000037      0.000036  0.000090
    cur-Ca_HVA2                               0.000039      0.000041      0.000040  0.000098
    cur-cagk                                  0.000037      0.000038      0.000037  0.000092
    cur-cacum                                 0.000025      0.000026      0.000025  0.000063
    cur-cacumb                                0.000024      0.000025      0.000024  0.000060
    cur-IClamp                                0.000027      0.000028      0.000027  0.000068
    cur-CaDynamics_DC0                        0.000021      0.000022      0.000022  0.000053
    cur-ttx_ion                               0.000026      0.000026      0.000026  0.000064
    cur-ca_ion                                0.000028      0.000029      0.000029  0.000070
    cur-k_ion                                 0.000018      0.000019      0.000018  0.000046
    cur-na_ion                                0.000022      0.000023      0.000022  0.000055
    cur-pas                                   0.000034      0.000034      0.000034  0.000084
    update-net-receive-buf                    0.000021      0.000023      0.000022  0.000054
  load-model                                 21.857876     21.860913     21.859395 53.812170

Produce an NSight Systems profile

To produce a useful profile with NSight Systems we need to configure Caliper to emit NVTX markers (CALI_CONFIG=nvtx) and tell NSight Systems to record regions with names that are not registered strings (NSYS_NVTX_PROFILER_REGISTER_ONLY=0). To avoid profiling model initialisation and setup, you may want to only record the actual simulation (--capture-range=nvtx --nvtx-capture=simulation). Additionally, NSight Systems seems to have trouble profiling multiple OpenMP host threads launching GPU kernels at once, so you may want to disable that (OMP_NUM_THREADS=1). Taken together, an example prefix could be:

CALI_CONFIG=nvtx OMP_NUM_THREADS=1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0 --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation ./x86_64/special-core ...

Some other notes:

during development, just use simple ring test (for quick iteration)
install nmodl master into project space $CFS/ntrain9/neuron so that mentors or other people can just use standard version for profiling or other non-codegen related tasks.

Additional useful repositories

Repository and branch which includes lots of tests of OpenACC and OpenMP interoperability https://github.com/iomaganaris/openMP_GPU/tree/magkanar/example. Also the code mentioned in https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866/2

Bugs

With the hackathon_main branch, if we build with -DCMAKE_BUILD_TYPE=Debug, ring_permute2_TEST and ring_gap_permute2_TEST are failing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NERSC GPU hackathon (Dec 2021)

NERSC GPU hackathon (Dec 2021)

Preparatory Tasks

Hackathon Tasks

NEURON GPU Enabled Builds on systems like Perlmutter and Ascent

OpenMP migration of CoreNEURON NrnThread / model data transfer

Random123 and OpenMP portability

Hines Solver Analysis and Optimisation

NMODL Code Generation & OpenMP Migration

CoreNEURON OpenMP migration

Eigen Compatibility Issues

Portable wheels

Unified memory performance : optional / low-priority

Ascent

Running

Which Input Model Datasets To Use?

1. Small dataset for development & testing

2. Large dataset for profiling and benchmarking

Perlmutter

Building

Running on interactive session

Run a channel-benchmark test

Produce an NSight Systems profile

Additional useful repositories

Bugs

Clone this wiki locally