Merge ompoffload devturbo#2
Conversation
…me + updates test (NOAA-GFDL#1150)
…ead of the module variable (NOAA-GFDL#1160)
…s + minor bug fixes (NOAA-GFDL#1176)
…l + subregional files update (NOAA-GFDL#1261)
* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call
To enable this, had to be removed - otherwise segfaults happen on the GPU.
This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.
Some very minor changes to the OpenMP target MPI PR: * use_device_ptr -> use_device_addr This appears to be the updated form (or at least nvfortran says it is) * Whitespace added to `!$ use omp_lib` Does not seem crucial but from our previous discussion it appears more correct. * Removal of some trailing whitespace.
This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.
The no-comm (no MPI) interface has been updated to support the new omp_offload argument.
This ensures that (un)packing steps in do_group_update is performed with openmp cpu parallelism if ompoffload=.false.. Previously it would only do serial. This is implemented by undefining the GPU macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the (un)packing files. To make this work, the default(shared) was used in all the relevant OpenMP directives. If default(none) is used, the loops would hang or segfault.
|
@mnlevy1981 Wow, that is a lot of changes! Would it be possible to identify changes made by Ed to support the GPU work versus other changes? I started to look at the changed files but quickly became overwhelmed by changes that had nothing to do with GPU enablement. |
|
@mnlevy1981 I just looked at all of the commit messages in this PR. Only the last 13 or 14 appear to have been performed by Ed and Jorge. Can we cherry-pick those changes to the FMS2? |
We could, or we could merge |
|
ugh, I just noticed what repo I'm in... I guess we'd need to bring FMS up to date with |
* add gpu2gpu mpi transer with flag for do_group_update * add missing collapse(3) clauses * Use __NVCOMPILER macro for target regions * add back old omp directive wrapped in #ifndef __NVCOMPILER * port remaining un/pack loops * add multi gpu support (#2) * add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call * sub __NVCOMPILER with __NVCOMPILER_OPENMP_GPU * allow choice of gpu or cpu parallel To enable this, had to be removed - otherwise segfaults happen on the GPU. * fix omp set device call * Revert "allow choice of gpu or cpu parallel" This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause. * OMP MPI: Minor cleanups Some very minor changes to the OpenMP target MPI PR: * use_device_ptr -> use_device_addr This appears to be the updated form (or at least nvfortran says it is) * Whitespace added to `!$ use omp_lib` Does not seem crucial but from our previous discussion it appears more correct. * Removal of some trailing whitespace. * OMP target MPI: line length compliance This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines. * OMP MPI: Update nocomm interface The no-comm (no MPI) interface has been updated to support the new omp_offload argument. * use openmp cpu if ompoffload=.false. This ensures that (un)packing steps in do_group_update is performed with openmp cpu parallelism if ompoffload=.false.. Previously it would only do serial. This is implemented by undefining the GPU macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the (un)packing files. To make this work, the default(shared) was used in all the relevant OpenMP directives. If default(none) is used, the loops would hang or segfault. * Linting clean-up Removed trailing whitespace, replaced tabs with spaces, and kept all lines at 121 lines or fewer * One more lint clean-up Missed one of the long lines in cesm_constants.fh * One more linting commit I thought I shortened all the long lines in cesm_constants.fh but my awk command to find the line length was off-by-one * Update github actions Use a container with gcc 15.1.0 * Turn off autoconf CI testing Also, don't define CESM_CONSTANTS macro (will let us drop CESM share code from the turbo-stack submodules, and is bit-for-bit) * Update containers for intel and coupler CI tests * Drop FMS coupler CI test I think our version of FMS is much older than what the FMS coupler is expecting, so tests are failing... but we don't want to update our branch to include newer FMS code so we should drop the test instead * Drop YAML from configure.ac * Drop SKIP_PARSER_TESTS from makefiles I'm trying to make FMS configure look more like TIM configure, and some I removed some stuff from configure.ac that is now breaking Makefiles * Add mppnccombine.c to diag_manager makefile This c file is included in the TIM version of Makefile.ac but not the FMS2 version, and may be responsible for failing CI build * Remove bad test test_gather2DV does not exist in TIM or in more recent FMS commits, and it has some code that doesn't compile... I suspect that's related to why the test was removed in subsequent PRs * Increase filename size test_simple_domain.nc does not fit into character(len=20); increased to len=25 just to have some breathing room * Revert "Drop SKIP_PARSER_TESTS from makefiles" This reverts commit b018921. * Revert "Drop YAML from configure.ac" This reverts commit f88a095. --------- Co-authored-by: Edward Yang <edward_yang_125@hotmail.com> Co-authored-by: Edward Yang <edward.yang@anu.edu.au> Co-authored-by: Jorge Luis Gálvez Vallejo <jorgegalvez1694@gmail.com> Co-authored-by: Marshall Ward <marshall.ward@noaa.gov>
Description
Merged the
ompoffloadbranch from @edoyango intodev/turbo. I have verified that it has built (my initial conflict-resolution was bad and it didn't built at first :), and I can run the code in TURBO-ESM/MOM6#20 with this version of FMS.How Has This Been Tested?
I built an updated version of MOM6 with this FMS branch, the
double_gyreexample runs and is bit-for-bit with what is currently available on turbo-stack