Try to speed up compilation of params under nvhpc #1007

Yurlungur · 2024-02-21T01:00:06Z

… performance I hope

PR Summary

For some time @bprather has been reporting very slow compilation times on nvhpc, with src/interface/params.cpp as the major bottleneck. I've seen the same thing when compiling Phoebus. The reason for this is that we have to compile every type that we want to read or write in params. And when I wrote that code, I enumerated quite a few we might want.

This is an attempt to reduce compile times on device by splitting out some of those compilations. I do this by explicitly instantiating some of the template specializations for parthenon::HDF5::HDF5ReadAttribute and parthenon::HDF5::HDF5WriteAttribute in separate files.

I only do this for the Kokkos::View types and the primitive types, and only for device memory. Only the kokkos views are needed, as under the hood that's how the ParArray*D read/writes happen. Only device views because if you compile with the serial backend, host and device have the same type, and the extern declaration fails. This is fine though---the device-backend ones are the ones that take a long time to compile anyway; and splitting things out into multiple files helps regardless.

@bprather is my guinea pig and will report improvements to compile times. Anecdotally, compiling with the serial backend on my desktop is slightly slower, but I didn't measure carefully. I suspect this is due to the fact that parthenon_hdf5.hpp actually contains quite a few new lines declaring extern void function<type> to tell the compiler that the template instantiations are available elsewhere.

@pdmullen , @pgrete and/or @forrestglines : I'd appreciate if you all would try this out downstream as well and let me know if compile times are improved.

PR Checklist

… performance I hope

bprather · 2024-02-21T16:30:37Z

Baseline: params.cpp as it stands in dev takes ~13:54 to compile on an ARM/Ampere node of Darwin, where I do a bunch of my testing. The numbers are slightly better on x86, but I think Darwin will illustrate the differences better.

This change reduces the overall compile time to ~6.5 minutes (to compile parthenon_hdf5_attributes_read.cpp) with an additional 5 minutes (params.cpp) able to be parallelized. These are still the two longest files to compile by a large margin, the next worst is swarm.cpp at 3-4 minutes or so. So, there's probably an additional factor of ~2 compile time improvement available here.

Note this is assuming enough compile jobs to process every file in Parthenon at once! The improvement if the two new files have to be serialized is negligible. Since NVHPC tends only to be required by Cray software stacks, i.e. on supercomputers, I think it's fair to fix this with an eye to an 80-process compilation -- just felt it needs mentioned this will likely slow down other compiles.

Yurlungur · 2024-02-23T01:03:00Z

@bprather can you try again? I reduced the number of types params compiles slightly and I also pre-instantiate several kokkos views. Also add -DPARTHENON_PRE_INSTANTIATE_KOKKOS_VIEWS=ON to your cmake line.

bprather · 2024-02-23T17:08:20Z

Good news and bad news. We're down to 4:30 or so for the attributes_read compile, so the overall KHARMA compile has gone from an initial 25 minutes to "just" 14:30. params.cpp is now below the next few files at a svelte 2:30 or so, whereas swarm.cpp and update.cpp still sit around 3:30.

However, with the PRE_INSTANTIATE flag, every other file in Parthenon now takes much longer to compile, on the order of ~2min for files which were previously 15 seconds or something. I think this is also true of some KHARMA files even. So, the overall timing output looks like:
old:

real    25m1.992s
user    113m8.034s

new:

real    14m34.853s
user    293m32.396s

Yurlungur · 2024-02-23T19:59:40Z

Good news and bad news. We're down to 4:30 or so for the attributes_read compile, so the overall KHARMA compile has gone from an initial 25 minutes to "just" 14:30. params.cpp is now below the next few files at a svelte 2:30 or so, whereas swarm.cpp and update.cpp still sit around 3:30.

However, with the PRE_INSTANTIATE flag, every other file in Parthenon now takes much longer to compile, on the order of ~2min for files which were previously 15 seconds or something. I think this is also true of some KHARMA files even. So, the overall timing output looks like: old:
real    25m1.992s
user    113m8.034s
new:
real    14m34.853s
user    293m32.396s

Ok one more tweak. I've shrunk the pre-instantiated views by a factor of 3 or so... and I also split up swarm.cpp. How's the build time now? Also can you try with and without PARTHENON_PRE_INSTANTIATE_KOKKOS_VIEWS just to see the impact of that option? It's possible that that just makes things worse and I should remove it.

…tiated.

Yurlungur · 2024-02-23T20:01:50Z

@brryan please give your OK for the split to swarms.cpp that I did

bprather · 2024-02-23T20:43:03Z

PRE_INSTANTIATE ON:

real    13m10.451s
user    294m27.067s

OFF:

real    13m18.610s
user    294m32.551s

So I was wrong thinking it was PRE_INSTANTIATE spreading the compiling work around, it must have been something else moved to a header in that round of changes...

Yurlungur · 2024-02-23T21:12:25Z

Hmm... Try now.

bprather · 2024-02-23T21:55:35Z

ON

real    13m21.531s
user    294m27.996s

OFF

real    13m7.706s
user    294m31.277s

Implying that any differences with/without pre-compile are within measurement error, and that the most recent changes are also within measurement error. I'm not sure the extra work wasn't in 77a4254, I can try that too.

Yurlungur · 2024-02-23T22:01:39Z

ON
real    13m21.531s
user    294m27.996s
OFF
real    13m7.706s
user    294m31.277s
Implying that any differences with/without pre-compile are within measurement error, and that the most recent changes are also within measurement error. I'm not sure the extra work wasn't in 77a4254, I can try that too.

Sure---please give that a shot. Not sure what's going on here.

…clude params

Yurlungur · 2024-02-23T22:51:00Z

Ok I found the issue and I think I resolved it. Problem was that many may lines of template class blah were getting included into basically every file through the following chain of includes:
parthenon_hdf5.hpp -> interface/params.hpp -> basically_every_file.cpp
To resolve, I split parthenon_hdf5.hpp into two files, one containing a minimum implementation of HDF5 stuff needed for the params header and for restarts.hpp. So the full evil parthenon_hdf5.hpp is only included in a select few implementation files.

Yurlungur · 2024-02-24T00:14:37Z

I still need to try on nvhpc but I'm seeing modest improvements on nvcc and no degredation in compile times for other files:

CPU, gcc, old

real	3m56.872s
user	46m40.763s

CPU, gcc, new

real	2m28.920s
user	47m54.812s

A100, nvcc, old

real	8m59.111s
user	116m49.777s

A100, nvcc, new

real	6m59.280s
user	117m34.953s

brryan

This will be a quality of life improvement for me downstream, thanks for looking at this @Yurlungur . Also no concerns from me about splitting up swarm.cpp, didn't realize that file took so much time to compile.

brryan · 2024-02-26T16:18:01Z

doc/sphinx/src/building.rst

+|| Kokkos\_ROOT                               || unset                         || String || Path to a Kokkos source directory (containing CMakeLists.txt)                                                                                               |
+|| PARTHENON\_IMPORT\_KOKKOS                  || ON/OFF                        || Option || If ON, attempt to link to an external Kokkos library. If OFF, build Kokkos from source and package with Parthenon                                           |
+|| BUILD\_SHARED\_LIBS                        || OFF                           || Option || If installing Parthenon, whether to build as shared rather than static                                                                                      |
+---------------------------------------------+--------------------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+


Weird that github can't (?) render this diff correctly since it should just be one added line

I think it's all lines because the first column increased in number of characters.

brryan · 2024-02-26T16:21:33Z

src/outputs/parthenon_hdf5.hpp

@@ -307,7 +211,7 @@ void HDF5ReadAttribute(hid_t location, const std::string &name, T &view) {
  auto *pdata = view.data();
  auto view_h = Kokkos::create_mirror_view(view);
  if constexpr (!std::is_same<typename T::memory_space, Kokkos::HostSpace>::value) {
-    Kokkos::deep_copy(view_h, view);
+    //Kokkos::deep_copy(view_h, view); // JMM: Pretty sure this deep copy is unnecessary


Are you sure? I thought Kokkos::create_mirror_view populated the new view with zeros and so the deep_copy is necessary

Alternatively, there's a Kokkos::create_mirror_view_and_copy which should handle the case automatically.

This call is made to check type before reading the param... so I believe a mirror view is required, but copying is not needed. That said, it doesn't hurt (other than maybe hurting performance/compile times slightly). I think if restart tests pass, we can assume we don't need this deep copy. But if you're nervous I'm happy to put it back in.

pgrete · 2024-02-29T09:19:17Z

General question: with the culprit pointing to all the different types around HDF5, should we expect another significant increase in compile time for other (say ADIOS2) outputs where we also eventually want to be able to store params?

pgrete · 2024-02-29T10:18:20Z

Here are some AthenaPK numbers for HIP on Frontier (using 16 processes to build):

baseline (`develop`)

real	6m1.186s
user	61m6.505s
sys	0m46.940s

this branch

real	5m30.193s
user	61m10.902s
sys	0m46.334s

this branch with `PRE_INSTANTIATE=ON`

real	6m42.812s
user	61m12.670s
sys	0m46.967s

Yurlungur · 2024-02-29T17:09:44Z

Broadly it looks like PRE_INSTANTIATE doesn't really help and sometimes hurts, so I'll remove it.

General question: with the culprit pointing to all the different types around HDF5, should we expect another significant increase in compile time for other (say ADIOS2) outputs where we also eventually want to be able to store params?

I think adding ADIOS2 will introduce an additional file that will be expensive to compile and will require similar care.

Yurlungur · 2024-03-01T02:22:59Z

@pgrete please approve given the discussion today.

jonahm-LANL added 2 commits February 20, 2024 17:14

pre-instantiate some parthenon attribute read/write functionality for…

aa39eff

… performance I hope

clang format

770fadd

Yurlungur added help wanted Extra attention is needed build configuration experiment Outcome unknown. Some testing/eval needed. labels Feb 21, 2024

Yurlungur requested review from bprather and pgrete February 21, 2024 01:00

Yurlungur self-assigned this Feb 21, 2024

CHANGELOG

77a4254

jonahm-LANL added 2 commits February 22, 2024 17:24

add new compile option to pre-instantiate kokkos views

ee92f26

only do for float and double

cb39824

split up swarm.cpp and shrink the number of views that are pre-instan…

f43bca9

…tiated.

Yurlungur requested a review from brryan February 23, 2024 20:01

fix formatting in tasking docs

0392d43

don't pre-compile ParArrayGeneric

03e7c98

split the parthenon hdf5 header to reduce LOC included whenever we in…

6ff098f

…clude params

remove extranoues deep copy

d7304e1

brryan approved these changes Feb 26, 2024

View reviewed changes

jonahm-LANL added 3 commits February 29, 2024 19:15

remove pre-instantiation of views

bf282a7

explanatory comment

9c500f8

oops lint fail

4667b32

Yurlungur enabled auto-merge March 1, 2024 02:23

pgrete approved these changes Mar 1, 2024

View reviewed changes

Yurlungur merged commit 0aa104d into develop Mar 1, 2024
49 checks passed

Yurlungur mentioned this pull request Mar 4, 2024

[trivial] [hotfix] Remove some code that somehow got copied #1012

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to speed up compilation of params under nvhpc #1007

Try to speed up compilation of params under nvhpc #1007

Yurlungur commented Feb 21, 2024 •

edited

Loading

bprather commented Feb 21, 2024

Yurlungur commented Feb 23, 2024

bprather commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

bprather commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

bprather commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

Yurlungur commented Feb 24, 2024

brryan left a comment

brryan Feb 26, 2024

pgrete Feb 29, 2024

brryan Feb 26, 2024

pgrete Feb 29, 2024

Yurlungur Feb 29, 2024

pgrete commented Feb 29, 2024

pgrete commented Feb 29, 2024

Yurlungur commented Feb 29, 2024

Yurlungur commented Mar 1, 2024

Try to speed up compilation of params under nvhpc #1007

Try to speed up compilation of params under nvhpc #1007

Conversation

Yurlungur commented Feb 21, 2024 • edited Loading

PR Summary

PR Checklist

bprather commented Feb 21, 2024

Yurlungur commented Feb 23, 2024

bprather commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

bprather commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

bprather commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

Yurlungur commented Feb 23, 2024

Yurlungur commented Feb 24, 2024

CPU, gcc, old

CPU, gcc, new

A100, nvcc, old

A100, nvcc, new

brryan left a comment

Choose a reason for hiding this comment

brryan Feb 26, 2024

Choose a reason for hiding this comment

pgrete Feb 29, 2024

Choose a reason for hiding this comment

brryan Feb 26, 2024

Choose a reason for hiding this comment

pgrete Feb 29, 2024

Choose a reason for hiding this comment

Yurlungur Feb 29, 2024

Choose a reason for hiding this comment

pgrete commented Feb 29, 2024

pgrete commented Feb 29, 2024

baseline (develop)

this branch

this branch with PRE_INSTANTIATE=ON

Yurlungur commented Feb 29, 2024

Yurlungur commented Mar 1, 2024

Yurlungur commented Feb 21, 2024 •

edited

Loading

baseline (`develop`)

this branch with `PRE_INSTANTIATE=ON`