Cherrypick ompoffload devturbo take2#4
Conversation
* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call
To enable this, had to be removed - otherwise segfaults happen on the GPU.
This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.
Some very minor changes to the OpenMP target MPI PR: * use_device_ptr -> use_device_addr This appears to be the updated form (or at least nvfortran says it is) * Whitespace added to `!$ use omp_lib` Does not seem crucial but from our previous discussion it appears more correct. * Removal of some trailing whitespace.
This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.
The no-comm (no MPI) interface has been updated to support the new omp_offload argument.
This ensures that (un)packing steps in do_group_update is performed with openmp cpu parallelism if ompoffload=.false.. Previously it would only do serial. This is implemented by undefining the GPU macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the (un)packing files. To make this work, the default(shared) was used in all the relevant OpenMP directives. If default(none) is used, the loops would hang or segfault.
Removed trailing whitespace, replaced tabs with spaces, and kept all lines at 121 lines or fewer
Missed one of the long lines in cesm_constants.fh
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks the ompoffload-related changes onto dev/turbo, adding an optional OpenMP-offload path for mpp_transmit (MPI) and domain group-update pack/unpack, plus a set of whitespace/formatting cleanups across diagnostics, NetCDF IO, and constants.
Changes:
- Add
omp_offloadoptional plumbing through scalarmpp_send/mpp_recvand into the MPIMPP_TRANSMIT_implementation using!$omp target data use_device_addr(...). - Add OpenMP-offloaded pack/unpack implementations for group updates (via
group_update_pack.inc/group_update_unpack.inc) and integrate them intoMPP_DO_GROUP_UPDATE_. - Apply minor formatting/whitespace cleanup in
diag_manager,netcdf_io, and constants headers.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| mpp/include/mpp_transmit.inc | Adds omp_offload optional to scalar send/recv wrappers and forwards it into mpp_transmit. |
| mpp/include/mpp_transmit_nocomm.fh | Extends MPP_TRANSMIT_ signature to accept omp_offload (unused in nocomm). |
| mpp/include/mpp_transmit_mpi.fh | Uses omp_offload to optionally wrap MPI ISEND/IRECV with use_device_addr in OpenMP target data regions. |
| mpp/include/mpp_group_update.fh | Adds omp_offload option to MPP_DO_GROUP_UPDATE_ and toggles inclusion of GPU vs CPU pack/unpack implementations. |
| mpp/include/group_update_pack.inc | Implements GPU offloaded packing loops under __NVCOMPILER_OPENMP_GPU. |
| mpp/include/group_update_unpack.inc | Implements GPU offloaded unpacking loops under __NVCOMPILER_OPENMP_GPU. |
| mpp/include/mpp_comm_mpi.inc | Sets OpenMP/OpenACC default device during mpp_init to support multi-GPU. |
| fms2_io/netcdf_io.F90 | Whitespace cleanup only. |
| diag_manager/mppnccombine.c | Whitespace cleanup only. |
| diag_manager/diag_util.F90 | Minor formatting cleanup (line splitting/whitespace). |
| diag_manager/diag_manager.F90 | Minor formatting cleanup (continuations/whitespace). |
| constants/cesm_constants.fh | Formatting and comment wrapping cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| call mpp_clock_begin(group_recv_clock) | ||
| #ifdef __NVCOMPILER_OPENMP_GPU | ||
| !$omp target enter data map(alloc: buffer) if(use_device_ptr) | ||
| #endif |
| !$omp target exit data map(release: buffer) if(use_device_ptr) | ||
| #else | ||
| #include <group_update_unpack.inc> | ||
| #endif | ||
| endif |
| !$omp target teams distribute parallel do collapse(3) default(shared) & | ||
| !$omp private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_fieldx,ptr) & | ||
| !$omp map(to: buffer(pos+1:pos+ksize*nj*ni)) & | ||
| !$omp map(from: fieldx(is:ie,js:je,1:ksize)) if(use_device_ptr) | ||
| #endif |
| !$omp target teams distribute parallel do collapse(3) default(shared) & | ||
| !$omp private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_fieldy,ptr) & | ||
| !$omp map(to: buffer(pos+1:pos+ksize*nj*ni)) & | ||
| !$omp map(from: fieldy(is:ie,js:je,1:ksize)) if(use_device_ptr) | ||
| #endif |
| !$omp target teams distribute parallel do collapse(3) if(use_device_ptr) default(shared) & | ||
| !$omp private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_field,ptr) & | ||
| !$omp map(to: buffer(pos+1:pos+ksize*nj*ni)) & | ||
| !$omp map(from: field(is:ie,js:je,1:ksize)) |
There was a problem hiding this comment.
In the offload path, map(to: buffer(...)) will copy the host buffer slice to the device at region entry. If MPI has already received into the device-mapped buffer (via mpp_recv(..., omp_offload=.true.) + use_device_addr), this copy-in can overwrite the freshly received device data before unpacking. Use a no-copy mapping (e.g., map(present: buffer(...)) / use_device_addr / avoid mapping the already-present buffer).
| !$omp target teams distribute parallel do collapse(2) default(shared) & | ||
| !$omp private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_field,ptr) & | ||
| !$omp map(to: buffer(pos+1:pos+nj*ni)) map(from: field(is:ie,js:je,k)) if(use_device_ptr) | ||
| #endif |
| !$omp target teams distribute parallel do collapse(2) default(shared) & | ||
| !$omp private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_fieldy,ptr) & | ||
| !$omp map(to: buffer(pos+1:pos+nj*ni)) map(from: fieldy(is:ie,js:je,k)) if(use_device_ptr) | ||
| #endif |
| ! set default device to enable multi GPU parallelism | ||
| ! calls to both OpenACC and OpenMP runtimes are needed | ||
| ! because we use both do-concurrent and openmp | ||
| ! if you remove either, the code will run multiple | ||
| ! ranks on a _single_ GPU. Be careful out there! | ||
| !$ call omp_set_default_device(pe) | ||
| !$acc set device_num(pe) | ||
|
|
| !$omp target teams distribute parallel do collapse(2) default(shared) & | ||
| !$omp private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_fieldx,ptr) & | ||
| !$omp map(to: buffer(pos+1:pos+nj*ni)) map(from: fieldx(is:ie,js:je,k)) if(use_device_ptr) | ||
| #endif |
There was a problem hiding this comment.
This offload loop maps buffer with map(to: ...), which copies host values to the device at entry. If buffer is intended to be device-resident and populated by MPI when use_device_ptr is enabled, the copy-in can overwrite valid received data. Prefer map(present: buffer(...)) / avoid mapping buffer here.
I thought I shortened all the long lines in cesm_constants.fh but my awk command to find the line length was off-by-one
johnmauff
left a comment
There was a problem hiding this comment.
@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?
I brought up the failing CI on slack, it's concerning to me as well. Our tests are all using gcc, but I think we're testing with a pretty old version: this one says GCC 9.3 (which is 6 years old) and OpenMP 4.5. Rather than touching the code we imported from Ed et al. I'd rather look into updating the environment used for the CI |
|
AFAIK no one has ever tried to do device transfer with GCC. In our experience, OpenMP device support for GCC is severely lagging. GCC in general was almost clueless until around v15 or so. We had to disable OpenMP for our Linux Ubuntu builds, but we use latest GCC in our macOS builds. These are very perfunctory tests to verify OpenMP syntax and threading. In practice, NVIDIA is all that really matters.
@edoyango and @JorgeG94 are the ones to ask about these issues though. |
|
@marshallward Thanks for your comments. We are using a very old version of GCC, which explains why all the CI fails. We are not testing on NVHPC for functionality yet. The _addr() version is present in the code that @mnlevy1981 used to create this PR. |
just adding on that I used NB the LLMs are trained on slightly outdated info wrt nvhpc. |
|
@edoyango, thanks for the clarification on use_device_ptr() versus use_device_addr(). This all makes sense now. |
Use a container with gcc 15.1.0
Also, don't define CESM_CONSTANTS macro (will let us drop CESM share code from the turbo-stack submodules, and is bit-for-bit)
I think our version of FMS is much older than what the FMS coupler is expecting, so tests are failing... but we don't want to update our branch to include newer FMS code so we should drop the test instead
I'm trying to make FMS configure look more like TIM configure, and some I removed some stuff from configure.ac that is now breaking Makefiles
This c file is included in the TIM version of Makefile.ac but not the FMS2 version, and may be responsible for failing CI build
test_gather2DV does not exist in TIM or in more recent FMS commits, and it has some code that doesn't compile... I suspect that's related to why the test was removed in subsequent PRs
test_simple_domain.nc does not fit into character(len=20); increased to len=25 just to have some breathing room
This reverts commit b018921.
This reverts commit f88a095.
I needed to clean up the autoconf build system to get a couple of tests to pass with intel, maybe this will fix whatever issue was happening with gnu 15 as well? If not, I'll revert this commit and call the PR good
Try replacing build_ubuntu_gnu.yml with github_autotools_gnu.yml from GFDL main branch
The "Configure the build" step contained the command
./configure ${DISTCHECK_CONFIGURE_FLAGS} FCFLAGS="$FCFLAGS $DEBUG_FLAGS" || cat config.log
And the "||" meant this line would always return status 0 (either from
configure or from the cat command). This hides the true cause of the failure
(not finding netcdf)
Maybe the reason configure is failing is something that changed in this file? Grasping at straws...
Went back to dev/turbo configure.ac file, then grabbed a few specific changes from main that looked appropriate
Grabbed what GFDL uses in github_autotools_intel_openapi.yml (but I skip building with YAML support for now)
I'm not sure why configure is failing; it's complaining about ./configure: line 19069: syntax error near unexpected token `FCFLAGS="$FCFLAGS $FC_MOD_CASE_FLAG"' ./configure: line 19069: `GX_FC_MOD_CASE_FLAG(FCFLAGS="$FCFLAGS $FC_MOD_CASE_FLAG")' But I didn't change anything in the GX_FC_MOD_CASE part of the file
I think I found where GX_FC_MOD_CASE_FLAG should be set
The main branch of FMS doesn't have the mosaic/ directory (just mosaic2/); I'm trying to get this to compile without completely removing this subdirectory because I think the directory name shows up quite a bit in autoconf files
|
closed because it went off the rails; #5 picks up from last successful CI commit |
Description
This replaces #3, which was an attempt to replace the merge in #2 with a cherry-pick but was configured incorrectly.
Details:
@edoyango's
ompoffloadbranch started from the4fdd530commit of GFDL'smain. For this PR, I started a new branch from the head ofdev/turboand ran(In #3 I had run
git cherry-pick 19fedef..b4fae8dbecause19fedefis the first ompoffload commit). It is worth noting that there were no conflicts to resolve on this branch, unlike my first attempt.How Has This Been Tested?
Via
turbo-stack, I ran the MOM6 infrastructure unit tests (which all passed) and also the double gyre example to verify it is bit-for-bit.Checklist: