-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to speed up compilation of params under nvhpc #1007
Conversation
… performance I hope
Baseline: This change reduces the overall compile time to ~6.5 minutes (to compile Note this is assuming enough compile jobs to process every file in Parthenon at once! The improvement if the two new files have to be serialized is negligible. Since NVHPC tends only to be required by Cray software stacks, i.e. on supercomputers, I think it's fair to fix this with an eye to an 80-process compilation -- just felt it needs mentioned this will likely slow down other compiles. |
@bprather can you try again? I reduced the number of types params compiles slightly and I also pre-instantiate several kokkos views. Also add |
Good news and bad news. We're down to 4:30 or so for the However, with the
new:
|
Ok one more tweak. I've shrunk the pre-instantiated views by a factor of 3 or so... and I also split up swarm.cpp. How's the build time now? Also can you try with and without |
@brryan please give your OK for the split to |
OFF:
So I was wrong thinking it was |
Hmm... Try now. |
ON
OFF
Implying that any differences with/without pre-compile are within measurement error, and that the most recent changes are also within measurement error. I'm not sure the extra work wasn't in 77a4254, I can try that too. |
Sure---please give that a shot. Not sure what's going on here. |
Ok I found the issue and I think I resolved it. Problem was that many may lines of |
I still need to try on CPU, gcc, old
CPU, gcc, new
A100, nvcc, old
A100, nvcc, new
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a quality of life improvement for me downstream, thanks for looking at this @Yurlungur . Also no concerns from me about splitting up swarm.cpp
, didn't realize that file took so much time to compile.
doc/sphinx/src/building.rst
Outdated
|| Kokkos\_ROOT || unset || String || Path to a Kokkos source directory (containing CMakeLists.txt) | | ||
|| PARTHENON\_IMPORT\_KOKKOS || ON/OFF || Option || If ON, attempt to link to an external Kokkos library. If OFF, build Kokkos from source and package with Parthenon | | ||
|| BUILD\_SHARED\_LIBS || OFF || Option || If installing Parthenon, whether to build as shared rather than static | | ||
+---------------------------------------------+--------------------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that github can't (?) render this diff correctly since it should just be one added line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's all lines because the first column increased in number of characters.
src/outputs/parthenon_hdf5.hpp
Outdated
@@ -307,7 +211,7 @@ void HDF5ReadAttribute(hid_t location, const std::string &name, T &view) { | |||
auto *pdata = view.data(); | |||
auto view_h = Kokkos::create_mirror_view(view); | |||
if constexpr (!std::is_same<typename T::memory_space, Kokkos::HostSpace>::value) { | |||
Kokkos::deep_copy(view_h, view); | |||
//Kokkos::deep_copy(view_h, view); // JMM: Pretty sure this deep copy is unnecessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I thought Kokkos::create_mirror_view
populated the new view with zeros and so the deep_copy
is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, there's a Kokkos::create_mirror_view_and_copy
which should handle the case automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is made to check type before reading the param... so I believe a mirror view is required, but copying is not needed. That said, it doesn't hurt (other than maybe hurting performance/compile times slightly). I think if restart tests pass, we can assume we don't need this deep copy. But if you're nervous I'm happy to put it back in.
General question: with the culprit pointing to all the different types around HDF5, should we expect another significant increase in compile time for other (say ADIOS2) outputs where we also eventually want to be able to store |
Here are some AthenaPK numbers for HIP on Frontier (using 16 processes to build): baseline (
|
Broadly it looks like
I think adding |
@pgrete please approve given the discussion today. |
… performance I hope
PR Summary
For some time @bprather has been reporting very slow compilation times on
nvhpc
, withsrc/interface/params.cpp
as the major bottleneck. I've seen the same thing when compiling Phoebus. The reason for this is that we have to compile every type that we want to read or write inparams
. And when I wrote that code, I enumerated quite a few we might want.This is an attempt to reduce compile times on device by splitting out some of those compilations. I do this by explicitly instantiating some of the template specializations for
parthenon::HDF5::HDF5ReadAttribute
andparthenon::HDF5::HDF5WriteAttribute
in separate files.I only do this for the
Kokkos::View
types and the primitive types, and only for device memory. Only the kokkos views are needed, as under the hood that's how theParArray*D
read/writes happen. Only device views because if you compile with the serial backend, host and device have the same type, and theextern
declaration fails. This is fine though---the device-backend ones are the ones that take a long time to compile anyway; and splitting things out into multiple files helps regardless.@bprather is my guinea pig and will report improvements to compile times. Anecdotally, compiling with the serial backend on my desktop is slightly slower, but I didn't measure carefully. I suspect this is due to the fact that
parthenon_hdf5.hpp
actually contains quite a few new lines declaringextern void function<type>
to tell the compiler that the template instantiations are available elsewhere.@pdmullen , @pgrete and/or @forrestglines : I'd appreciate if you all would try this out downstream as well and let me know if compile times are improved.
PR Checklist