Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to speed up compilation of params under nvhpc #1007

Merged
merged 13 commits into from
Mar 1, 2024

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Feb 21, 2024

… performance I hope

PR Summary

For some time @bprather has been reporting very slow compilation times on nvhpc, with src/interface/params.cpp as the major bottleneck. I've seen the same thing when compiling Phoebus. The reason for this is that we have to compile every type that we want to read or write in params. And when I wrote that code, I enumerated quite a few we might want.

This is an attempt to reduce compile times on device by splitting out some of those compilations. I do this by explicitly instantiating some of the template specializations for parthenon::HDF5::HDF5ReadAttribute and parthenon::HDF5::HDF5WriteAttribute in separate files.

I only do this for the Kokkos::View types and the primitive types, and only for device memory. Only the kokkos views are needed, as under the hood that's how the ParArray*D read/writes happen. Only device views because if you compile with the serial backend, host and device have the same type, and the extern declaration fails. This is fine though---the device-backend ones are the ones that take a long time to compile anyway; and splitting things out into multiple files helps regardless.

@bprather is my guinea pig and will report improvements to compile times. Anecdotally, compiling with the serial backend on my desktop is slightly slower, but I didn't measure carefully. I suspect this is due to the fact that parthenon_hdf5.hpp actually contains quite a few new lines declaring extern void function<type> to tell the compiler that the template instantiations are available elsewhere.

@pdmullen , @pgrete and/or @forrestglines : I'd appreciate if you all would try this out downstream as well and let me know if compile times are improved.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@Yurlungur Yurlungur added help wanted Extra attention is needed build configuration experiment Outcome unknown. Some testing/eval needed. labels Feb 21, 2024
@Yurlungur Yurlungur self-assigned this Feb 21, 2024
@bprather
Copy link
Collaborator

Baseline: params.cpp as it stands in dev takes ~13:54 to compile on an ARM/Ampere node of Darwin, where I do a bunch of my testing. The numbers are slightly better on x86, but I think Darwin will illustrate the differences better.

This change reduces the overall compile time to ~6.5 minutes (to compile parthenon_hdf5_attributes_read.cpp) with an additional 5 minutes (params.cpp) able to be parallelized. These are still the two longest files to compile by a large margin, the next worst is swarm.cpp at 3-4 minutes or so. So, there's probably an additional factor of ~2 compile time improvement available here.

Note this is assuming enough compile jobs to process every file in Parthenon at once! The improvement if the two new files have to be serialized is negligible. Since NVHPC tends only to be required by Cray software stacks, i.e. on supercomputers, I think it's fair to fix this with an eye to an 80-process compilation -- just felt it needs mentioned this will likely slow down other compiles.

@Yurlungur
Copy link
Collaborator Author

@bprather can you try again? I reduced the number of types params compiles slightly and I also pre-instantiate several kokkos views. Also add -DPARTHENON_PRE_INSTANTIATE_KOKKOS_VIEWS=ON to your cmake line.

@bprather
Copy link
Collaborator

Good news and bad news. We're down to 4:30 or so for the attributes_read compile, so the overall KHARMA compile has gone from an initial 25 minutes to "just" 14:30. params.cpp is now below the next few files at a svelte 2:30 or so, whereas swarm.cpp and update.cpp still sit around 3:30.

However, with the PRE_INSTANTIATE flag, every other file in Parthenon now takes much longer to compile, on the order of ~2min for files which were previously 15 seconds or something. I think this is also true of some KHARMA files even. So, the overall timing output looks like:
old:

real    25m1.992s
user    113m8.034s

new:

real    14m34.853s
user    293m32.396s

@Yurlungur
Copy link
Collaborator Author

Good news and bad news. We're down to 4:30 or so for the attributes_read compile, so the overall KHARMA compile has gone from an initial 25 minutes to "just" 14:30. params.cpp is now below the next few files at a svelte 2:30 or so, whereas swarm.cpp and update.cpp still sit around 3:30.

However, with the PRE_INSTANTIATE flag, every other file in Parthenon now takes much longer to compile, on the order of ~2min for files which were previously 15 seconds or something. I think this is also true of some KHARMA files even. So, the overall timing output looks like: old:

real    25m1.992s
user    113m8.034s

new:

real    14m34.853s
user    293m32.396s

Ok one more tweak. I've shrunk the pre-instantiated views by a factor of 3 or so... and I also split up swarm.cpp. How's the build time now? Also can you try with and without PARTHENON_PRE_INSTANTIATE_KOKKOS_VIEWS just to see the impact of that option? It's possible that that just makes things worse and I should remove it.

@Yurlungur
Copy link
Collaborator Author

@brryan please give your OK for the split to swarms.cpp that I did

@bprather
Copy link
Collaborator

PRE_INSTANTIATE ON:

real    13m10.451s
user    294m27.067s

OFF:

real    13m18.610s
user    294m32.551s

So I was wrong thinking it was PRE_INSTANTIATE spreading the compiling work around, it must have been something else moved to a header in that round of changes...

@Yurlungur
Copy link
Collaborator Author

Hmm... Try now.

@bprather
Copy link
Collaborator

ON

real    13m21.531s
user    294m27.996s

OFF

real    13m7.706s
user    294m31.277s

Implying that any differences with/without pre-compile are within measurement error, and that the most recent changes are also within measurement error. I'm not sure the extra work wasn't in 77a4254, I can try that too.

@Yurlungur
Copy link
Collaborator Author

ON

real    13m21.531s
user    294m27.996s

OFF

real    13m7.706s
user    294m31.277s

Implying that any differences with/without pre-compile are within measurement error, and that the most recent changes are also within measurement error. I'm not sure the extra work wasn't in 77a4254, I can try that too.

Sure---please give that a shot. Not sure what's going on here.

@Yurlungur
Copy link
Collaborator Author

Ok I found the issue and I think I resolved it. Problem was that many may lines of template class blah were getting included into basically every file through the following chain of includes:
parthenon_hdf5.hpp -> interface/params.hpp -> basically_every_file.cpp
To resolve, I split parthenon_hdf5.hpp into two files, one containing a minimum implementation of HDF5 stuff needed for the params header and for restarts.hpp. So the full evil parthenon_hdf5.hpp is only included in a select few implementation files.

@Yurlungur
Copy link
Collaborator Author

I still need to try on nvhpc but I'm seeing modest improvements on nvcc and no degredation in compile times for other files:

CPU, gcc, old

real	3m56.872s
user	46m40.763s

CPU, gcc, new

real	2m28.920s
user	47m54.812s

A100, nvcc, old

real	8m59.111s
user	116m49.777s

A100, nvcc, new

real	6m59.280s
user	117m34.953s

Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a quality of life improvement for me downstream, thanks for looking at this @Yurlungur . Also no concerns from me about splitting up swarm.cpp, didn't realize that file took so much time to compile.

|| Kokkos\_ROOT || unset || String || Path to a Kokkos source directory (containing CMakeLists.txt) |
|| PARTHENON\_IMPORT\_KOKKOS || ON/OFF || Option || If ON, attempt to link to an external Kokkos library. If OFF, build Kokkos from source and package with Parthenon |
|| BUILD\_SHARED\_LIBS || OFF || Option || If installing Parthenon, whether to build as shared rather than static |
+---------------------------------------------+--------------------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that github can't (?) render this diff correctly since it should just be one added line

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's all lines because the first column increased in number of characters.

@@ -307,7 +211,7 @@ void HDF5ReadAttribute(hid_t location, const std::string &name, T &view) {
auto *pdata = view.data();
auto view_h = Kokkos::create_mirror_view(view);
if constexpr (!std::is_same<typename T::memory_space, Kokkos::HostSpace>::value) {
Kokkos::deep_copy(view_h, view);
//Kokkos::deep_copy(view_h, view); // JMM: Pretty sure this deep copy is unnecessary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I thought Kokkos::create_mirror_view populated the new view with zeros and so the deep_copy is necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, there's a Kokkos::create_mirror_view_and_copy which should handle the case automatically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is made to check type before reading the param... so I believe a mirror view is required, but copying is not needed. That said, it doesn't hurt (other than maybe hurting performance/compile times slightly). I think if restart tests pass, we can assume we don't need this deep copy. But if you're nervous I'm happy to put it back in.

@pgrete
Copy link
Collaborator

pgrete commented Feb 29, 2024

General question: with the culprit pointing to all the different types around HDF5, should we expect another significant increase in compile time for other (say ADIOS2) outputs where we also eventually want to be able to store params?

@pgrete
Copy link
Collaborator

pgrete commented Feb 29, 2024

Here are some AthenaPK numbers for HIP on Frontier (using 16 processes to build):

baseline (develop)

real	6m1.186s
user	61m6.505s
sys	0m46.940s

this branch

real	5m30.193s
user	61m10.902s
sys	0m46.334s

this branch with PRE_INSTANTIATE=ON

real	6m42.812s
user	61m12.670s
sys	0m46.967s

@Yurlungur
Copy link
Collaborator Author

Broadly it looks like PRE_INSTANTIATE doesn't really help and sometimes hurts, so I'll remove it.

General question: with the culprit pointing to all the different types around HDF5, should we expect another significant increase in compile time for other (say ADIOS2) outputs where we also eventually want to be able to store params?

I think adding ADIOS2 will introduce an additional file that will be expensive to compile and will require similar care.

@Yurlungur
Copy link
Collaborator Author

@pgrete please approve given the discussion today.

@Yurlungur Yurlungur enabled auto-merge March 1, 2024 02:23
@Yurlungur Yurlungur merged commit 0aa104d into develop Mar 1, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build configuration experiment Outcome unknown. Some testing/eval needed. help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants