Cherrypick ompoffload devturbo take2 by mnlevy1981 · Pull Request #4 · TURBO-ESM/FMS

mnlevy1981 · 2026-05-06T18:20:50Z

Description
This replaces #3, which was an attempt to replace the merge in #2 with a cherry-pick but was configured incorrectly.

Details:

@edoyango's ompoffload branch started from the 4fdd530 commit of GFDL's main. For this PR, I started a new branch from the head of dev/turbo and ran

git cherry-pick  4fdd530..b4fae8d

(In #3 I had run git cherry-pick 19fedef..b4fae8d because 19fedef is the first ompoffload commit). It is worth noting that there were no conflicts to resolve on this branch, unlike my first attempt.

How Has This Been Tested?
Via turbo-stack, I ran the MOM6 infrastructure unit tests (which all passed) and also the double gyre example to verify it is bit-for-bit.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call

To enable this, had to be removed - otherwise segfaults happen on the GPU.

This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.

Some very minor changes to the OpenMP target MPI PR: * use_device_ptr -> use_device_addr This appears to be the updated form (or at least nvfortran says it is) * Whitespace added to `!$ use omp_lib` Does not seem crucial but from our previous discussion it appears more correct. * Removal of some trailing whitespace.

This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.

The no-comm (no MPI) interface has been updated to support the new omp_offload argument.

This ensures that (un)packing steps in do_group_update is performed with openmp cpu parallelism if ompoffload=.false.. Previously it would only do serial. This is implemented by undefining the GPU macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the (un)packing files. To make this work, the default(shared) was used in all the relevant OpenMP directives. If default(none) is used, the loops would hang or segfault.

Removed trailing whitespace, replaced tabs with spaces, and kept all lines at 121 lines or fewer

Missed one of the long lines in cesm_constants.fh

Copilot

Pull request overview

This PR cherry-picks the ompoffload-related changes onto dev/turbo, adding an optional OpenMP-offload path for mpp_transmit (MPI) and domain group-update pack/unpack, plus a set of whitespace/formatting cleanups across diagnostics, NetCDF IO, and constants.

Changes:

Add omp_offload optional plumbing through scalar mpp_send/mpp_recv and into the MPI MPP_TRANSMIT_ implementation using !$omp target data use_device_addr(...).
Add OpenMP-offloaded pack/unpack implementations for group updates (via group_update_pack.inc / group_update_unpack.inc) and integrate them into MPP_DO_GROUP_UPDATE_.
Apply minor formatting/whitespace cleanup in diag_manager, netcdf_io, and constants headers.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
mpp/include/mpp_transmit.inc	Adds `omp_offload` optional to scalar send/recv wrappers and forwards it into `mpp_transmit`.
mpp/include/mpp_transmit_nocomm.fh	Extends `MPP_TRANSMIT_` signature to accept `omp_offload` (unused in nocomm).
mpp/include/mpp_transmit_mpi.fh	Uses `omp_offload` to optionally wrap MPI ISEND/IRECV with `use_device_addr` in OpenMP target data regions.
mpp/include/mpp_group_update.fh	Adds `omp_offload` option to `MPP_DO_GROUP_UPDATE_` and toggles inclusion of GPU vs CPU pack/unpack implementations.
mpp/include/group_update_pack.inc	Implements GPU offloaded packing loops under `__NVCOMPILER_OPENMP_GPU`.
mpp/include/group_update_unpack.inc	Implements GPU offloaded unpacking loops under `__NVCOMPILER_OPENMP_GPU`.
mpp/include/mpp_comm_mpi.inc	Sets OpenMP/OpenACC default device during `mpp_init` to support multi-GPU.
fms2_io/netcdf_io.F90	Whitespace cleanup only.
diag_manager/mppnccombine.c	Whitespace cleanup only.
diag_manager/diag_util.F90	Minor formatting cleanup (line splitting/whitespace).
diag_manager/diag_manager.F90	Minor formatting cleanup (continuations/whitespace).
constants/cesm_constants.fh	Formatting and comment wrapping cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  call mpp_clock_begin(group_recv_clock)
+#ifdef __NVCOMPILER_OPENMP_GPU
+  !$omp target enter data map(alloc: buffer) if(use_device_ptr)
+#endif


+  !$omp target exit data map(release: buffer) if(use_device_ptr)
+#else
+#include <group_update_unpack.inc>
+#endif
+  endif


+           !$omp target teams distribute parallel do collapse(3) default(shared) &
+           !$omp   private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_fieldx,ptr) &
+           !$omp   map(to: buffer(pos+1:pos+ksize*nj*ni)) &
+           !$omp   map(from: fieldx(is:ie,js:je,1:ksize)) if(use_device_ptr)
+#endif


+           !$omp target teams distribute parallel do collapse(3) default(shared) &
+           !$omp   private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_fieldy,ptr) &
+           !$omp   map(to: buffer(pos+1:pos+ksize*nj*ni)) &
+           !$omp   map(from: fieldy(is:ie,js:je,1:ksize)) if(use_device_ptr)
+#endif


Copilot · 2026-05-06T18:41:47Z

+           !$omp target teams distribute parallel do collapse(3) if(use_device_ptr) default(shared) &
+           !$omp   private(i,j,k,idx) shared(ksize,js,je,is,ie,pos,nj,ni,ptr_field,ptr) &
+           !$omp   map(to: buffer(pos+1:pos+ksize*nj*ni)) &
+           !$omp   map(from: field(is:ie,js:je,1:ksize))


In the offload path, map(to: buffer(...)) will copy the host buffer slice to the device at region entry. If MPI has already received into the device-mapped buffer (via mpp_recv(..., omp_offload=.true.) + use_device_addr), this copy-in can overwrite the freshly received device data before unpacking. Use a no-copy mapping (e.g., map(present: buffer(...)) / use_device_addr / avoid mapping the already-present buffer).

+           !$omp target teams distribute parallel do collapse(2) default(shared) &
+           !$omp   private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_field,ptr) &
+           !$omp   map(to: buffer(pos+1:pos+nj*ni)) map(from: field(is:ie,js:je,k)) if(use_device_ptr)
+#endif


+           !$omp target teams distribute parallel do collapse(2) default(shared) &
+           !$omp   private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_fieldy,ptr) &
+           !$omp   map(to: buffer(pos+1:pos+nj*ni)) map(from: fieldy(is:ie,js:je,k)) if(use_device_ptr)
+#endif


+  ! set default device to enable multi GPU parallelism
+  ! calls to both OpenACC and OpenMP runtimes are needed
+  ! because we use both do-concurrent and openmp
+  ! if you remove either, the code will run multiple
+  ! ranks on a _single_ GPU. Be careful out there!
+  !$ call omp_set_default_device(pe)
+  !$acc set device_num(pe)
+


Copilot · 2026-05-06T18:41:47Z

+           !$omp target teams distribute parallel do collapse(2) default(shared) &
+           !$omp   private(i,j,idx) shared(k,js,je,is,ie,pos,ni,ptr_fieldx,ptr) &
+           !$omp   map(to: buffer(pos+1:pos+nj*ni)) map(from: fieldx(is:ie,js:je,k)) if(use_device_ptr)
+#endif


This offload loop maps buffer with map(to: ...), which copies host values to the device at entry. If buffer is intended to be device-resident and populated by MPI when use_device_ptr is enabled, the copy-in can overwrite valid received data. Prefer map(present: buffer(...)) / avoid mapping buffer here.

I thought I shortened all the long lines in cesm_constants.fh but my awk command to find the line length was off-by-one

johnmauff

@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?

mnlevy1981 · 2026-05-06T20:45:45Z

@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?

I brought up the failing CI on slack, it's concerning to me as well. Our tests are all using gcc, but I think we're testing with a pretty old version: this one says GCC 9.3 (which is 6 years old) and OpenMP 4.5. Rather than touching the code we imported from Ed et al. I'd rather look into updating the environment used for the CI

marshallward · 2026-05-06T20:58:36Z

AFAIK no one has ever tried to do device transfer with GCC. In our experience, OpenMP device support for GCC is severely lagging.

GCC in general was almost clueless until around v15 or so. We had to disable OpenMP for our Linux Ubuntu builds, but we use latest GCC in our macOS builds. These are very perfunctory tests to verify OpenMP syntax and threading. In practice, NVIDIA is all that really matters.

use_device_addr() and use_device_ptr() were interchangeable on the versions of NVIDIA that we were using, but newer versions were warning us to switch to use_device_addr(). I'm not sure where this version of FMS is up to, but I had thought that we swapped to use_device_addr(). (Maybe this branch is out of date with the candidate PR to FMS?)

@edoyango and @JorgeG94 are the ones to ask about these issues though.

johnmauff · 2026-05-06T21:35:47Z

@marshallward Thanks for your comments. We are using a very old version of GCC, which explains why all the CI fails. We are not testing on NVHPC for functionality yet. The _addr() version is present in the code that @mnlevy1981 used to create this PR.

edoyango · 2026-05-06T22:23:16Z

@mnlevy1981 I am concerned that this PR fails all CI except for the lint test. I was discussing this with ChatGPT, and it stated that use_device_addr() is supported under GCC but not under NVHPC. It sounds like NVHPC does not have a complete OpenMP 5.0 standard yet. Instead, it suggests using the use_device_ptr() clause instead, which should work for both GCC and NVHPC. Not quite sure how @marshallward has gotten this to work. Could you see if the use of use_device_ptr() would fix the CI errors?

just adding on that I used use_device_ptr() initially because it didn't do the right thing with an older nvhpc (I think 24.11?), but as marshall said, use_device_addr works for current versions.

NB the LLMs are trained on slightly outdated info wrt nvhpc.

johnmauff · 2026-05-07T15:18:11Z

@edoyango, thanks for the clarification on use_device_ptr() versus use_device_addr(). This all makes sense now.

Use a container with gcc 15.1.0

Also, don't define CESM_CONSTANTS macro (will let us drop CESM share code from the turbo-stack submodules, and is bit-for-bit)

I think our version of FMS is much older than what the FMS coupler is expecting, so tests are failing... but we don't want to update our branch to include newer FMS code so we should drop the test instead

I'm trying to make FMS configure look more like TIM configure, and some I removed some stuff from configure.ac that is now breaking Makefiles

This c file is included in the TIM version of Makefile.ac but not the FMS2 version, and may be responsible for failing CI build

test_gather2DV does not exist in TIM or in more recent FMS commits, and it has some code that doesn't compile... I suspect that's related to why the test was removed in subsequent PRs

test_simple_domain.nc does not fit into character(len=20); increased to len=25 just to have some breathing room

This reverts commit b018921.

This reverts commit f88a095.

I needed to clean up the autoconf build system to get a couple of tests to pass with intel, maybe this will fix whatever issue was happening with gnu 15 as well? If not, I'll revert this commit and call the PR good

Try replacing build_ubuntu_gnu.yml with github_autotools_gnu.yml from GFDL main branch

The "Configure the build" step contained the command ./configure ${DISTCHECK_CONFIGURE_FLAGS} FCFLAGS="$FCFLAGS $DEBUG_FLAGS" || cat config.log And the "||" meant this line would always return status 0 (either from configure or from the cat command). This hides the true cause of the failure (not finding netcdf)

Maybe the reason configure is failing is something that changed in this file? Grasping at straws...

Went back to dev/turbo configure.ac file, then grabbed a few specific changes from main that looked appropriate

Grabbed what GFDL uses in github_autotools_intel_openapi.yml (but I skip building with YAML support for now)

I'm not sure why configure is failing; it's complaining about ./configure: line 19069: syntax error near unexpected token `FCFLAGS="$FCFLAGS $FC_MOD_CASE_FLAG"' ./configure: line 19069: `GX_FC_MOD_CASE_FLAG(FCFLAGS="$FCFLAGS $FC_MOD_CASE_FLAG")' But I didn't change anything in the GX_FC_MOD_CASE part of the file

I think I found where GX_FC_MOD_CASE_FLAG should be set

The main branch of FMS doesn't have the mosaic/ directory (just mosaic2/); I'm trying to get this to compile without completely removing this subdirectory because I think the directory name shows up quite a bit in autoconf files

mnlevy1981 · 2026-05-11T21:40:25Z

closed because it went off the rails; #5 picks up from last successful CI commit

edoyango and others added 14 commits May 6, 2026 12:11

add gpu2gpu mpi transer with flag for do_group_update

7d77de7

add missing collapse(3) clauses

5a85ab4

Use __NVCOMPILER macro for target regions

aa8f22b

add back old omp directive wrapped in #ifndef __NVCOMPILER

1e82747

port remaining un/pack loops

dc36e57

add multi gpu support (ESCOMP#2)

972a97c

* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call

sub __NVCOMPILER with __NVCOMPILER_OPENMP_GPU

78b436d

allow choice of gpu or cpu parallel

5ac547f

To enable this, had to be removed - otherwise segfaults happen on the GPU.

fix omp set device call

85ae7ef

Revert "allow choice of gpu or cpu parallel"

b4993f5

This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.

OMP target MPI: line length compliance

8d83224

This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.

OMP MPI: Update nocomm interface

236aef3

The no-comm (no MPI) interface has been updated to support the new omp_offload argument.

mnlevy1981 requested a review from johnmauff May 6, 2026 18:20

Linting clean-up

4152115

Removed trailing whitespace, replaced tabs with spaces, and kept all lines at 121 lines or fewer

johnmauff requested a review from Copilot May 6, 2026 18:34

Copilot started reviewing on behalf of johnmauff May 6, 2026 18:34 View session

One more lint clean-up

d26f787

Missed one of the long lines in cesm_constants.fh

Copilot AI reviewed May 6, 2026

View reviewed changes

One more linting commit

6b71b2d

I thought I shortened all the long lines in cesm_constants.fh but my awk command to find the line length was off-by-one

johnmauff requested changes May 6, 2026

View reviewed changes

mnlevy1981 mentioned this pull request May 8, 2026

API changes needed to run dev/gpu TURBO-ESM/TIM#14

Merged

4 tasks

mnlevy1981 added 2 commits May 8, 2026 14:04

Update github actions

1d377f2

Use a container with gcc 15.1.0

Turn off autoconf CI testing

695a88b

Also, don't define CESM_CONSTANTS macro (will let us drop CESM share code from the turbo-stack submodules, and is bit-for-bit)

mnlevy1981 added 26 commits May 8, 2026 16:05

Update containers for intel and coupler CI tests

a440407

Drop FMS coupler CI test

207a465

I think our version of FMS is much older than what the FMS coupler is expecting, so tests are failing... but we don't want to update our branch to include newer FMS code so we should drop the test instead

Drop YAML from configure.ac

f88a095

Drop SKIP_PARSER_TESTS from makefiles

b018921

I'm trying to make FMS configure look more like TIM configure, and some I removed some stuff from configure.ac that is now breaking Makefiles

Add mppnccombine.c to diag_manager makefile

67cafb5

This c file is included in the TIM version of Makefile.ac but not the FMS2 version, and may be responsible for failing CI build

Remove bad test

9369bf0

test_gather2DV does not exist in TIM or in more recent FMS commits, and it has some code that doesn't compile... I suspect that's related to why the test was removed in subsequent PRs

Increase filename size

9a76ece

test_simple_domain.nc does not fit into character(len=20); increased to len=25 just to have some breathing room

Revert "Drop SKIP_PARSER_TESTS from makefiles"

e68d003

This reverts commit b018921.

Revert "Drop YAML from configure.ac"

5284362

This reverts commit f88a095.

Autoconf tests might work with gnu again

5fb51e2

I needed to clean up the autoconf build system to get a couple of tests to pass with intel, maybe this will fix whatever issue was happening with gnu 15 as well? If not, I'll revert this commit and call the PR good

Add exclude flags to ubuntu tests

79e7cbc

More CI cleanup

f8696f5

Try replacing build_ubuntu_gnu.yml with github_autotools_gnu.yml from GFDL main branch

Bring in new configure.ac

2a87ebb

Maybe the reason configure is failing is something that changed in this file? Grasping at straws...

Take two with configure.ac

2ec746f

Went back to dev/turbo configure.ac file, then grabbed a few specific changes from main that looked appropriate

More mixed-mode -> r8_default cleanup

a8b0038

Update intel versions

2e25a51

Grabbed what GFDL uses in github_autotools_intel_openapi.yml (but I skip building with YAML support for now)

Run multiple commands for configure

630d6c3

Update gx_fortran_options

fed3dfa

I think I found where GX_FC_MOD_CASE_FLAG should be set

Drop skip_mosaic_tests stuff

a7a762b

drop nc_strerror

6acf0a2

The main branch of FMS doesn't have the mosaic/ directory (just mosaic2/); I'm trying to get this to compile without completely removing this subdirectory because I think the directory name shows up quite a bit in autoconf files

Remove mosaic (leave mosaic2)

6d60cc3

remove one more mosaic

7384450

More mosaic clean-up

2920505

Strip out a bunch of netcdf ifdefs

b7cb57a

mnlevy1981 mentioned this pull request May 11, 2026

Cherrypick ompoffload devturbo take3 #5

Merged

4 tasks

mnlevy1981 closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherrypick ompoffload devturbo take2#4

Cherrypick ompoffload devturbo take2#4
mnlevy1981 wants to merge 45 commits into
TURBO-ESM:dev/turbofrom
mnlevy1981:cherrypick_ompoffload_devturbo_take2

mnlevy1981 commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 6, 2026

Uh oh!

Copilot AI May 6, 2026

Uh oh!

johnmauff left a comment

Uh oh!

mnlevy1981 commented May 6, 2026

Uh oh!

marshallward commented May 6, 2026 •

edited

Loading

Uh oh!

johnmauff commented May 6, 2026

Uh oh!

edoyango commented May 6, 2026

Uh oh!

johnmauff commented May 7, 2026

Uh oh!

mnlevy1981 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

mnlevy1981 commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 6, 2026

Choose a reason for hiding this comment

Uh oh!

johnmauff left a comment

Choose a reason for hiding this comment

Uh oh!

mnlevy1981 commented May 6, 2026

Uh oh!

marshallward commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnmauff commented May 6, 2026

Uh oh!

edoyango commented May 6, 2026

Uh oh!

johnmauff commented May 7, 2026

Uh oh!

mnlevy1981 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

marshallward commented May 6, 2026 •

edited

Loading