Skip to content

Cherrypick ompoffload devturbo take2#4

Closed
mnlevy1981 wants to merge 45 commits into
TURBO-ESM:dev/turbofrom
mnlevy1981:cherrypick_ompoffload_devturbo_take2
Closed

Cherrypick ompoffload devturbo take2#4
mnlevy1981 wants to merge 45 commits into
TURBO-ESM:dev/turbofrom
mnlevy1981:cherrypick_ompoffload_devturbo_take2

Conversation

@mnlevy1981
Copy link
Copy Markdown

Description
This replaces #3, which was an attempt to replace the merge in #2 with a cherry-pick but was configured incorrectly.

Details:

@edoyango's ompoffload branch started from the 4fdd530 commit of GFDL's main. For this PR, I started a new branch from the head of dev/turbo and ran

git cherry-pick  4fdd530..b4fae8d

(In #3 I had run git cherry-pick 19fedef..b4fae8d because 19fedef is the first ompoffload commit). It is worth noting that there were no conflicts to resolve on this branch, unlike my first attempt.

How Has This Been Tested?
Via turbo-stack, I ran the MOM6 infrastructure unit tests (which all passed) and also the double gyre example to verify it is bit-for-bit.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

edoyango and others added 14 commits May 6, 2026 12:11
* add multi gpu support

* address review comments, add helpful comment for the acc/mp runbtime call
To enable this,  had to be removed -
otherwise segfaults happen on the GPU.
This reverts commit 0cc2a77.
Having both the CPU and GPU OpenMP directives compiled caused
a significant slowdown in GPU packing/unpacking performance -
even if parallelism is controlled using OpenMP "if" clause.
Some very minor changes to the OpenMP target MPI PR:

* use_device_ptr -> use_device_addr

  This appears to be the updated form (or at least nvfortran says it is)

* Whitespace added to `!$ use omp_lib`

  Does not seem crucial but from our previous discussion it appears more
  correct.

* Removal of some trailing whitespace.
This patch refactors several lines to keep within the 121-character line
length limit prescribed by the FMS style guidelines.
The no-comm (no MPI) interface has been updated to support the new
omp_offload argument.
This ensures that (un)packing steps in do_group_update is performed
with openmp cpu parallelism if ompoffload=.false.. Previously it
would only do serial. This is implemented by undefining the GPU
macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the
(un)packing files.

To make this work, the default(shared) was used in all the relevant
OpenMP directives. If default(none) is used, the loops would hang
or segfault.
@mnlevy1981 mnlevy1981 requested a review from johnmauff May 6, 2026 18:20
Removed trailing whitespace, replaced tabs with spaces, and kept all lines at
121 lines or fewer
Missed one of the long lines in cesm_constants.fh
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks the ompoffload-related changes onto dev/turbo, adding an optional OpenMP-offload path for mpp_transmit (MPI) and domain group-update pack/unpack, plus a set of whitespace/formatting cleanups across diagnostics, NetCDF IO, and constants.

Changes:

  • Add omp_offload optional plumbing through scalar mpp_send/mpp_recv and into the MPI MPP_TRANSMIT_ implementation using !$omp target data use_device_addr(...).
  • Add OpenMP-offloaded pack/unpack implementations for group updates (via group_update_pack.inc / group_update_unpack.inc) and integrate them into MPP_DO_GROUP_UPDATE_.
  • Apply minor formatting/whitespace cleanup in diag_manager, netcdf_io, and constants headers.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
mpp/include/mpp_transmit.inc Adds omp_offload optional to scalar send/recv wrappers and forwards it into mpp_transmit.
mpp/include/mpp_transmit_nocomm.fh Extends MPP_TRANSMIT_ signature to accept omp_offload (unused in nocomm).
mpp/include/mpp_transmit_mpi.fh Uses omp_offload to optionally wrap MPI ISEND/IRECV with use_device_addr in OpenMP target data regions.
mpp/include/mpp_group_update.fh Adds omp_offload option to MPP_DO_GROUP_UPDATE_ and toggles inclusion of GPU vs CPU pack/unpack implementations.
mpp/include/group_update_pack.inc Implements GPU offloaded packing loops under __NVCOMPILER_OPENMP_GPU.
mpp/include/group_update_unpack.inc Implements GPU offloaded unpacking loops under __NVCOMPILER_OPENMP_GPU.
mpp/include/mpp_comm_mpi.inc Sets OpenMP/OpenACC default device during mpp_init to support multi-GPU.
fms2_io/netcdf_io.F90 Whitespace cleanup only.
diag_manager/mppnccombine.c Whitespace cleanup only.
diag_manager/diag_util.F90 Minor formatting cleanup (line splitting/whitespace).
diag_manager/diag_manager.F90 Minor formatting cleanup (continuations/whitespace).
constants/cesm_constants.fh Formatting and comment wrapping cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 483 to +486
call mpp_clock_begin(group_recv_clock)
#ifdef __NVCOMPILER_OPENMP_GPU
!$omp target enter data map(alloc: buffer) if(use_device_ptr)
#endif
Comment on lines +550 to +554
!$omp target exit data map(release: buffer) if(use_device_ptr)
#else
#include <group_update_unpack.inc>
#endif
endif
Comment on lines +59 to +63
!$omp target teams distribute parallel do collapse(3) default(shared) &
!$omp private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_fieldx,ptr) &
!$omp map(to: buffer(pos+1:pos+ksize*nj*ni)) &
!$omp map(from: fieldx(is:ie,js:je,1:ksize)) if(use_device_ptr)
#endif
Comment on lines +78 to +82
!$omp target teams distribute parallel do collapse(3) default(shared) &
!$omp private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_fieldy,ptr) &
!$omp map(to: buffer(pos+1:pos+ksize*nj*ni)) &
!$omp map(from: fieldy(is:ie,js:je,1:ksize)) if(use_device_ptr)
#endif
Comment on lines +40 to +43
!$omp target teams distribute parallel do collapse(3) if(use_device_ptr) default(shared) &
!$omp private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_field,ptr) &
!$omp map(to: buffer(pos+1:pos+ksize*nj*ni)) &
!$omp map(from: field(is:ie,js:je,1:ksize))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the offload path, map(to: buffer(...)) will copy the host buffer slice to the device at region entry. If MPI has already received into the device-mapped buffer (via mpp_recv(..., omp_offload=.true.) + use_device_addr), this copy-in can overwrite the freshly received device data before unpacking. Use a no-copy mapping (e.g., map(present: buffer(...)) / use_device_addr / avoid mapping the already-present buffer).

Comment on lines +112 to +115
!$omp target teams distribute parallel do collapse(2) default(shared) &
!$omp private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_field,ptr) &
!$omp map(to: buffer(pos+1:pos+nj*ni)) map(from: field(is:ie,js:je,k)) if(use_device_ptr)
#endif
Comment on lines +144 to +147
!$omp target teams distribute parallel do collapse(2) default(shared) &
!$omp private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_fieldy,ptr) &
!$omp map(to: buffer(pos+1:pos+nj*ni)) map(from: fieldy(is:ie,js:je,k)) if(use_device_ptr)
#endif
Comment on lines +58 to +65
! set default device to enable multi GPU parallelism
! calls to both OpenACC and OpenMP runtimes are needed
! because we use both do-concurrent and openmp
! if you remove either, the code will run multiple
! ranks on a _single_ GPU. Be careful out there!
!$ call omp_set_default_device(pe)
!$acc set device_num(pe)

Comment on lines +128 to +131
!$omp target teams distribute parallel do collapse(2) default(shared) &
!$omp private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_fieldx,ptr) &
!$omp map(to: buffer(pos+1:pos+nj*ni)) map(from: fieldx(is:ie,js:je,k)) if(use_device_ptr)
#endif
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This offload loop maps buffer with map(to: ...), which copies host values to the device at entry. If buffer is intended to be device-resident and populated by MPI when use_device_ptr is enabled, the copy-in can overwrite valid received data. Prefer map(present: buffer(...)) / avoid mapping buffer here.

I thought I shortened all the long lines in cesm_constants.fh but my awk
command to find the line length was off-by-one
Copy link
Copy Markdown

@johnmauff johnmauff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?

@mnlevy1981
Copy link
Copy Markdown
Author

@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?

I brought up the failing CI on slack, it's concerning to me as well. Our tests are all using gcc, but I think we're testing with a pretty old version: this one says GCC 9.3 (which is 6 years old) and OpenMP 4.5. Rather than touching the code we imported from Ed et al. I'd rather look into updating the environment used for the CI

@marshallward
Copy link
Copy Markdown

marshallward commented May 6, 2026

AFAIK no one has ever tried to do device transfer with GCC. In our experience, OpenMP device support for GCC is severely lagging.

GCC in general was almost clueless until around v15 or so. We had to disable OpenMP for our Linux Ubuntu builds, but we use latest GCC in our macOS builds. These are very perfunctory tests to verify OpenMP syntax and threading. In practice, NVIDIA is all that really matters.

use_device_addr() and use_device_ptr() were interchangeable on the versions of NVIDIA that we were using, but newer versions were warning us to switch to use_device_addr(). I'm not sure where this version of FMS is up to, but I had thought that we swapped to use_device_addr(). (Maybe this branch is out of date with the candidate PR to FMS?)

@edoyango and @JorgeG94 are the ones to ask about these issues though.

@johnmauff
Copy link
Copy Markdown

@marshallward Thanks for your comments. We are using a very old version of GCC, which explains why all the CI fails. We are not testing on NVHPC for functionality yet. The _addr() version is present in the code that @mnlevy1981 used to create this PR.

@edoyango
Copy link
Copy Markdown

edoyango commented May 6, 2026

@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?

just adding on that I used use_device_ptr() initially because it didn't do the right thing with an older nvhpc (I think 24.11?), but as marshall said, use_device_addr works for current versions.

NB the LLMs are trained on slightly outdated info wrt nvhpc.

@johnmauff
Copy link
Copy Markdown

@edoyango, thanks for the clarification on use_device_ptr() versus use_device_addr(). This all makes sense now.

mnlevy1981 added 2 commits May 8, 2026 14:04
Use a container with gcc 15.1.0
Also, don't define CESM_CONSTANTS macro (will let us drop CESM share code from
the turbo-stack submodules, and is bit-for-bit)
mnlevy1981 added 26 commits May 8, 2026 16:05
I think our version of FMS is much older than what the FMS coupler is
expecting, so tests are failing... but we don't want to update our branch to
include newer FMS code so we should drop the test instead
I'm trying to make FMS configure look more like TIM configure, and some I
removed some stuff from configure.ac that is now breaking Makefiles
This c file is included in the TIM version of Makefile.ac but not the FMS2
version, and may be responsible for failing CI build
test_gather2DV does not exist in TIM or in more recent FMS commits, and it has
some code that doesn't compile... I suspect that's related to why the test was
removed in subsequent PRs
test_simple_domain.nc does not fit into character(len=20); increased to len=25
just to have some breathing room
I needed to clean up the autoconf build system to get a couple of tests to pass
with intel, maybe this will fix whatever issue was happening with gnu 15 as
well? If not, I'll revert this commit and call the PR good
Try replacing build_ubuntu_gnu.yml with github_autotools_gnu.yml from GFDL main
branch
The "Configure the build" step contained the command

./configure ${DISTCHECK_CONFIGURE_FLAGS} FCFLAGS="$FCFLAGS $DEBUG_FLAGS" || cat config.log

And the "||" meant this line would always return status 0 (either from
configure or from the cat command). This hides the true cause of the failure
(not finding netcdf)
Maybe the reason configure is failing is something that changed in this file?
Grasping at straws...
Went back to dev/turbo configure.ac file, then grabbed a few specific changes
from main that looked appropriate
Grabbed what GFDL uses in github_autotools_intel_openapi.yml (but I skip
building with YAML support for now)
I'm not sure why configure is failing; it's complaining about

./configure: line 19069: syntax error near unexpected token `FCFLAGS="$FCFLAGS $FC_MOD_CASE_FLAG"'
./configure: line 19069: `GX_FC_MOD_CASE_FLAG(FCFLAGS="$FCFLAGS $FC_MOD_CASE_FLAG")'

But I didn't change anything in the GX_FC_MOD_CASE part of the file
I think I found where GX_FC_MOD_CASE_FLAG should be set
The main branch of FMS doesn't have the mosaic/ directory (just mosaic2/); I'm
trying to get this to compile without completely removing this subdirectory
because I think the directory name shows up quite a bit in autoconf files
@mnlevy1981
Copy link
Copy Markdown
Author

closed because it went off the rails; #5 picks up from last successful CI commit

@mnlevy1981 mnlevy1981 closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants