Skip to content

Merge devgpu to turbo#20

Merged
mnlevy1981 merged 585 commits into
TURBO-ESM:dev/turbofrom
mnlevy1981:merge_devgpu_to_turbo
May 19, 2026
Merged

Merge devgpu to turbo#20
mnlevy1981 merged 585 commits into
TURBO-ESM:dev/turbofrom
mnlevy1981:merge_devgpu_to_turbo

Conversation

@mnlevy1981
Copy link
Copy Markdown

This PR brings the latest changes from @marshallward and the NCI Australia / ACCESS-NRI team to dev/turbo. We were pretty far behind mom-ocean:main, and there are answer changes that came through that branch -- you can reproduce our previous answers by adding the following to MOM_override but I don't think we have any need to preserve these specific bugs (perhaps we should set ENABLE_BUGS_BY_DEFAULT = False and not re-enable any specific bugs?).

ENABLE_BUGS_BY_DEFAULT = False
VISC_REM_BUG = True
VISC_REM_TIMESTEP_BUG = True
RHO_PGF_REF_BUG = True
FRICTWORK_BUG = True
VISC_REM_BT_WEIGHT_BUG = True

In my testing with turbo-stack, I needed to update FMS2 to @edoyango's ompoffload branch, so I suspect the CI testing will fail when I first open this PR. I'll link to the necessary FMS PR once it is open (we will also need to update TIM, though that's basically cherry-picking some of Ed's FMS changes). I'll leave this as a draft until the infrastructure changes are in place and the CI testing works.

awallcraft and others added 30 commits September 7, 2025 21:02
SHALLOW_ALE_RESOLUTION implements a HYBGEN-style Z-sigma-Z near surface
fixed coordinate for HYCOM1.  For example the US Navy's GOFS 3.1 HYCOM
setup has 41 layers, with the top 14 layers in a Z-sigma-Z configuration.
For MOM6 HYCOM1 this is: SHALLOW_ALE_RESOLUTION = 14*1.0,27*0.0 for 14
1m "shallow" layers.

Let N_SIGMA be the number of consecutive non-zero entries, typically < NK.
When rest depth is shallower than SUM(SHALLOW_ALE_RESOLUTION(1:N_SIGMA))
use SHALLOW_ALE_RESOLUTION.  When rest depth is deeper than
SUM(SHALLOW_ALE_RESOLUTION(1:N_SIGMA)) use ALE_RESOLUTION.  Otherwise
use a linear sum of the two weighted by rest depth.

The default of all zeros turns this option off, and when off answers are
unchanged.  The new parmeter SHALLOW_ALE_RESOLUTION is only present when
using HYCOM1.
The 2-d REAL map array in HYBRID_MAP usually contains integer values
each referencing one profile.  It can instead contain non-integer
values of the form I+frac, which indicate a weighted sum of profiles:
(1-frac) p(I) + (frac) p(I+1).  The same profile can be used multiple
times, e.g. if 1st profile is also 4th can get profiles between 1 and 2
and between 1 and 3.

HYBRID_3D is more general, but HYBRID_MAP covers most practical uses.
  Added the new runtime option RESCALE_STRONG_DRAG, that can be set to true to
reduce the barotropic contribution to the layer accelerations to account for the
difference between the forces that can be counteracted by the stronger drag with
BT_STRONG_DRAG and the average of the layer viscous remnants after a baroclinic
timestep.  In testing, this new capability eliminates some of the growing
instabilities that can occur with an ice shelf and BT_STRONG_DRAG set to true.
This commit also adds new diagnostics of the barotropic step viscous
remnants and the eta anomalies contributing to barotropic pressure forces,
either averaged over the barotropic step or at each barotropic step.  By
default all answers are bitwise identical, but there is a new runtime parameter
and 4 new diagnostics.
mom-ocean#967)

* Add option to horizontally homogenize the Stokes drift when used via the dataoverride surfbands procedure.

* Add variable description in new method for horizontally averaging Stokes drift.

---------

Co-authored-by: brandon.reichl <brandon.reichl@noaa.gov>
  Corrected a horizontal indexing bug in the calculation of the CAv_Stokes
diagnostic, making it rotationally consistent and consistent with the
calculation of CAu_Stokes.  This bug has been there since the CAv_Stokes
diagnostic was originally added.  The loop range over which qS is calculated was
also reduced to the range over which it is used.  All solutions are bitwise
identical, but this commit does change the values of a (perhaps infrequently
used) diagnostic.
The interpreter directive ("shebang") of makedep is updated to
`python3`, rather than the version-agnostic `python`.

Although we never invoke the shebang of the script, there are OS
environments out there which will object to any presence of a
versionless python.  PEP 394 also strongly recommends the adoption of
python3 as the executable name, regardless of Py2 support.
* present_vhbt_or_set_bt_cont: merge couple of loops

* meridional_flux_thickness: cpu optimize a bit

* meridional_flux_adjust: back to jki

* set_merid_BT_cont: pull out meridional_flux_adjst

* set_meridional_BT_cont: optimize cpu

* rm remaining merid/zonal_flux_layer

* zonal_flux_layere: improve naming a bit

* zonal_flux_layere_OBC: improve naming a bit

* improve flux_elem line wraps

* optimize data transfers a bit

* pass elem not arr

* add some missing documentation

* fix trailing spaces and missing var docs

* add last param docs

* meridional_flux_adjust: fix fpe err

* fix another fpe

* cleanup args for new helper subroutines

* clean up enter/exit data

* fix passing h twice

* meridional_mass_flux: do concurrent

* zonal_flux_adjust: use a few 3d tmp arrays to mirror meridional_flux_adjust

* move target update out of continuity

* initialize pbv%por_face_area[U/V] on GPU

* cleanup some transfers from continuity_PPM

* clean up a few minor things

* zonal/meridional_flux_adjust: use scalar u/v_new

* declare vp,up,h_tmp on gpu

* remove h update

* continuity_PPM: minimize mapping stmts

* zonal_flux_adjust: minimize mapping stmts

* set_zonal_bt_cont: minimize mapping stmts

* merional_flux_adjust: minimize mapping stmts

* set_merid_bt_cont: minimize mapping stmts

* zonal/meridional_flux_adjust: tmp vars duhdu/dvhdv 3d -> scalar

* separate alloc of private variables for gcc

* u/vh_aux: 3d->2d

* target teams loop recognizable by amdflang

* Continuity CS outside of init

This moves the Continuity CS to the dycore init function.  For some
reason, this avoids an answer change with CPU.  (Possibly because alloc
inside of a function doesn't quite match the CS outside of it?)

A few minor data transfers are also added to fix up differences in the
chksum log output.

* zonal_mass_flux: isolate zonal_flux_layer

* zonal_mass_flux: seperate local_specified_BC block loop

* zonal_mass_flux: add j dim to tmp vars

* zonal_mass_flux: seperate visc_rem_max init loops

* zonal_mass_flux: seperate du_min/max_CFL init loops

* zonal_mass_flux: separate duhdu/uh_tot_0 init loop

* zonal_mass_flux: separate du_min/max_CFL aggress_adjust update case (untested)

* zonal_mass_flux: separate du_min/max_CFL non-aggress_adjust update case

* zonal_mass_flux: separate du_min/max_CFL non-use_visc_rem update case (untested)

* zonal_mass_flux: separate du_min/max_CFL 0-clamp loop

* zonal_mass_flux: separate do_I local_specified_BC init loop (untested)

* copy zonal_flux_adjust that accepts 2d args

* zonal_mass_flux move j loop into zonal_flux_adjust copy

* zonal_flux_adjust_fused: use 2d internal arr

* zonal_flux_adjusted_fused: separate all loops

* zonal_mass_flux: separate u/du_cor update (untested)

* zonal_mass_flux: replicate former present(uhbt) control flow

* copy set_zonal_BT_cont with 2d args

* zonal_mass_flux: move j-loop into set_zonal_BT_cont_fused

* set_zonal_BT_cont_fused: use zonal_flux_adjust_fused

* set_zonal_BT_cont_fused: separate init loop

* set_zonal_BT_cont_fused: separate duR/L init loop

* set_zonal_BT_cont_fused: separate u_0/L/R init loop

* set_zonal_BT_cont_fused: use zonal_flux_layer_fused

* set_zonal_BT_cont_fused: separate last 2 loops

* copy merid_flux_layer that accepts entire arrs

* merid_flux_layer_fused: separate loops

* merididional_mass_flux: separate local_specified_BC loop (untested)

* merididional_mass_flux: separate tmp variable init loops

* merididional_mass_flux: separate dv_min/max_CFL calc loops

* merididional_mass_flux: separate dv_min/max_CFL 0 clamp loop

* merididional_mass_flux: separate simple_OBC_pt init loop (untested)

* copy meridional_flux_adjust that accepts entire arrs

* meridional_mass_flux: separate meridional_flux_adjust_fused

* meridional_mass_flux: move j-loop into meridional_flux_adjust_fused

* meridional_flux_adjust_fused: separate vh_aux,dvhdv init loops

* meridional_flux_adjust_fused: 1d->2d tmp arrs

* meridional_flux_adjust_fused: separate all arrs

* meridional_mass_flux: separate (d)v_cor assignment loops

* copy set_merid_BT_cont that accepts entire arrs

* meridional_mass_flux: move j-loop into set_merid_BT_cont_fused

* set_merid_BT_cont_fused: use meridional_flux_adjust_fused

* set_merid_BT_cont_fused: separate tmp var init

* set_merid_BT_cont_fused: separate short circuit loop

* set_merid_BT_cont_fused: rm redundant k loop in short circuit

* set_merid_BT_cont_fused: separate dvL/R init loop

* set_merid_BT_cont_fused: make remaining tmp arrs 3d

* set_merid_BT_cont_fused: use merid_flux_layer_fused

* set_merid_BT_cont_fused: separate last loop

* meridional_mass_flux: separate any_simple_OBC loop

* zonal_edge_thickness: move k loop->PPM_reconstruction_x

* meridional_edge_thickness: move k loop->PPM_reconstruction_y

* PPM_reconstruction_x/y: move k loop->PPM_limit_pos

* PPM_reconstruction_x/y: move k loop->PPM_limit_cw84 (untested)

* zonal_BT_mass_flux: separate all loops (untested)

* meridional_BT_mass_flux: separate all loops (untested)

* set_zonal_BT_cont_fused: clean up var defs

* use SZJB_(G) in do_I dclrns in merid* subroutines

* set_merid_BT_cont_fused: clean up var defs

* set_merid/zonal_BT_cont_fused -> set_merid/zonal_BT_cont

* zonal_flux_adjust_fused: clean up var defs

* zonal_flux_adjust_fused -> zonal_flux_adjust

* zonal_flux_layer_fused -> zonal_flux_layer

* meridional_flux_adjust_fused: clean up var defs

* meridional_flux_adjust_fused -> meridional_flux_adjust

* merid_flux_layer_fused: clean up var defs

* merid_flux_layer_fused -> merid_flux_layer

* fix call merid/zonal_flux_layer line conts

* zonal_mass_flux: use visc_rem_u_tmp

* zonal_mass_flux: separate du_min/max_CFL non-use_visc_rem update case (untested)

* copy set_zonal_BT_cont with 2d args

* meridional_mass_flux: use visc_rem_v_tmp

* copy meridional_flux_adjust that accepts entire arrs

* copy set_merid_BT_cont that accepts entire arrs

* set_merid_BT_cont_fused: rm redundant k loop in short circuit

* set_merid_BT_cont_fused: rm redundant k loop in short circuit

* zonal_flux_adjust_fused -> zonal_flux_adjust

* meridional_flux_adjust_fused -> meridional_flux_adjust

* zonal_flux_layer: move GV in var dclrn

* zonal/meridional_mass_flux: remove old visc_rem vars

* remove redundant kloop

* remove problematic omp directives

* port PPM_limit_pos/CW84, PPM_reconstruction_x, zonal_edge_thickness

* PPM_reconstruction_x: add enter/exit data stmts

* continuity_zonal_convergence: save loop range for porting convenience

* port continuity_zonal_convergence

* zonal_mass_flux: array init -> loop init

* zonal_mass_flux: port init loops

* port zonal_flux_layer

* zonal_flux_layer: add enter/exit data stmts

* port zonal_flux_adjust

* port set_zonal_bt_cont

* zonal_flux_adjust: add more arrs in enter/exit data

* set_zonal_BT_cont: add enter/exit data stmts

* port zonal_flux_thickness loops

* zonal_flux_thickness add enter/exit data stmts

* zonal_mass_flux prepare obc for porting

* zonal_mass_flux: merge any_somple_OBC loops

* zonal_mass_flux port main loops

* zonal_mass_flux port OBC loops

* zonal_mass_flux: add enter/exit data stmts

* continuity_ppm: add initial enter/exit data stmts

* port PPM_reconstruction_y, meridional_edge_thickness

* port continuity_merdional_convergence

* ppm_reconstruction_y: add enter/exit data stmts

* port merid_flux_layer

* meridional_flux_adjust: port loops

* meridional_flux_adjust: add enter/exit data stmts

* meridional_mass_flux: port core loops

* meridional_mass_flux: port OBC loops (untested)

* meridional_flux_thickness: port core loops

* meridional_flux_thickness: attempt port OBC loops (untested)

* meridional_flux_thickness: add enter/exit data stmts

* port set_merid_BT_cont loops

* set_merid_bt_cont: add enter/exit data stmts

* meridional_mass_flux: add enter/exit data stmts

* zonal/meridional_mass_flux: add missing vars in enter/exit stmts

* continuity_PPM: complete enter/exit data stmts

* *_edge_thickness: do concurrent

* zonal_mass_flux: do concurrent-ify

* set_zonal_BT_cont: do concurrent-ify

* zonal_flux_adjust: do concurrent-ify

* zonal_flux_layer: do concurrent

* zonal_flux_thickness: do concurrent

* continuity_zonal_convergence: do concurrent

* meridional_mass_flux: do concurrent

* set_merid_bt_cont: do concurrent

* meridional_flux_adjust: do concurrent

* merid_flux_layer: do concurrent

* meridional_flux_thickness: do concurrent

* continuity_merdional_convergence: do concurrent

* formatting

* continuity_PPM: update LB

* meridional/zonal_flux_adjust: separate ij-reduction

* zonal_flux_adjust: a couple jki loops

* set_zonal_bt_cont: some jki loops

* zonal_mass_flux: some jki loops

* optimise loops by using scalar zonal_flux_layer

* zonal_flux_adjust: duhdu -> scalar

* elemental zonal_flux_layere and separated OBC

* zonal_flux_adjust: back to jki

* zonal_flux_thickness: improve cpu perf a bit

* zonal_flux_adjust: add some comments and guard early exit

* zonal_flux_adjust: use omp target loop for private arrs

* set_zonal_BT_cont: use omp target loop for private arrs

* set_zonal_BT_cont: do conc jki loop

* zonal_mass_flux: rm useless do_i init

* zonal_flux_layere: precalc g_dy_Cu*por_face_areaU

* zonal_flux_layere: precalc dh

* zonal_flux_thickness: precalc dh

* zonal_flux_adjust: make uhbt optional

* zonal_flux_thickness: assign to outputs directly

* mv zonal_flux_adjust from set_zonal_bt_cont->zonal_mass_flux

* zonal_mass_flux: reuse visc_rem_u_tmp more

* zonal_mass_flux: remove redundant if stmt

* zonal_mass_flux: force inline zonal_flux_layere

* zonal_flux_layere: inlinable gfortran -O3 but slower for ifort

* zonal_flux_layere: make a bit "smaller"

* zonal_mass_flux: move present(uhbt) or set_BT_cont to new subroutine

* Revert "zonal_flux_layere: make a bit "smaller""

This reverts commit 316152e.

* Revert "zonal_flux_layere: inlinable gfortran -O3 but slower for ifort"

This reverts commit 51ee896.

* fix gcc line truncation

* pass doxygen tests

* remove forceinline dirs

* attempt document vars

* a couple long lines

* last trailing space

* meridional_mass_flux: use zonal_flux_layere

* zonal_flux_layere_OBC: make elemental

* meridional_mass_flux: move dv_min/max_CFL calc into j-loop

* meridional_mass_flux: move big chunk into separate subroutine

* present_vhbt_or_set_bt_cont: merge couple of loops

* meridional_flux_thickness: cpu optimize a bit

* meridional_flux_adjust: back to jki

* set_merid_BT_cont: pull out meridional_flux_adjst

* set_meridional_BT_cont: optimize cpu

* rm remaining merid/zonal_flux_layer

* zonal_flux_layere: improve naming a bit

* zonal_flux_layere_OBC: improve naming a bit

* improve flux_elem line wraps

* optimize data transfers a bit

* pass elem not arr

* add some missing documentation

* fix trailing spaces and missing var docs

* add last param docs

* meridional_flux_adjust: fix fpe err

* fix another fpe

* cleanup args for new helper subroutines

* clean up enter/exit data

* fix passing h twice

* meridional_mass_flux: do concurrent

* zonal_flux_adjust: use a few 3d tmp arrays to mirror meridional_flux_adjust

* move target update out of continuity

* initialize pbv%por_face_area[U/V] on GPU

* cleanup some transfers from continuity_PPM

* clean up a few minor things

* zonal/meridional_flux_adjust: use scalar u/v_new

* declare vp,up,h_tmp on gpu

* remove h update

* continuity_PPM: minimize mapping stmts

* zonal_flux_adjust: minimize mapping stmts

* set_zonal_bt_cont: minimize mapping stmts

* merional_flux_adjust: minimize mapping stmts

* set_merid_bt_cont: minimize mapping stmts

* zonal/meridional_flux_adjust: tmp vars duhdu/dvhdv 3d -> scalar

* separate alloc of private variables for gcc

* u/vh_aux: 3d->2d

* target teams loop recognizable by amdflang

* Continuity CS outside of init

This moves the Continuity CS to the dycore init function.  For some
reason, this avoids an answer change with CPU.  (Possibly because alloc
inside of a function doesn't quite match the CS outside of it?)

A few minor data transfers are also added to fix up differences in the
chksum log output.

* Continuity: Add locality to do concurrent

Do concurrent inside of !$omp target teams loop seems to fail standard
openmp tests if locality is not correctly set.

This patch adds the correct locality to the four `!$omp target teams
loop` directives.

The domore argument has also been removed, and replaced with a
`.not.any(do_i(:))` test.

---------

Co-authored-by: Marshall Ward <marshall.ward@noaa.gov>
update MOM6 to its main repo. 20250818 updating (default parameter changes)
Added the new argument dyn_h_stencil to initialize_dyn_split_RK2 and the
other 3 dynamic core initialization routines to return the size of the stencil
for thicknesses as used by the dynamic core, depending on the options that are
being used for the Coriolis and continuity schemes, and then used this in a set
of halo updates in step_MOM_dynamics.  With this change some additional halo
updates that have recently been added inside of step_MOM_dyn_split_RK2 and the
other 3 dynamic core time stepping routines could be eliminated.  All answers
are bitwise identical, but there is a new argument to 4 public interfaces.

Co-authored-by: Marshall Ward <marshall.ward@gmail.com>
Modified the conditions that determine when to increment the count of velocity
truncations to use the thickness as interpolated to velocity points to determine
whether layers are thick enough to be counted, rather than the arithmetic mean
thickness, and only count truncations that occur in the non-symmetric
computational domain to avoid double counting.  The filtering thicknesses
should be very similar in the ocean interior, but they will differ at open
boundary condition points.  The corrected counting was verified by running the
sloshing/layer test case with a maximum CFL set to 0.01 to create lots of
truncations, and then verifying that the truncation count is now the same on 1
and 10 PEs, whereas before it was not.  All solutions are bitwise identical, but
the reported truncation counts in the ocean.stats files can change in
multiple-PE cases with velocity truncations.
* Add MOM_ANN module

* Mesoscale momentum parameterization with ANN

- Computes subgrid stress using ANN in MOM_Zanna_Bolton
- Uses MOM_ANN module for ANN inference

Equivalent MOM_override for defaults
```
USE_ZB2020 = True
ZB2020_USE_ANN = True
USE_CIRCULATION_IN_HORVISC = True
ZB2020_ANN_FILE_TALL = /path/to/ocean3d/subfilter/FGR3/EXP1/model/Tall.nc
```

* Mesoscale momentum parameterization with ANN (#2)

Blank commit after squash/rebase was handled on command line

* Moved MOM_ANN.F90 to src/framework/

* Minor refactor of MOM_ANN

- Removed unused modules
- Removed unused MOM_memory.h
- Added input and output means which default to 0 and
  do not need to be present in the weights file
- Gave defaults to means, norms, tests so that they do
  no need to be present in file
- Added missing array notation "(:)"
- Minor formatting

* Adds unit tests and timing test to MOM_ANN

- Added ANN_allocate, set_layer, set_input_normalization, and
  set_output_normalization methods to allow reconfiguration during
  unit tests
- Added ANN_unit_tests with some simple constructed-by-code
  networks with known solutions
- Added config_src/drivers/unit_tests/test_MOM_ANN.F90 to drive
  unit tests
- Added  config_src/drivers/timing_tests/time_MOM_ANN.F90 as
  rudimentary for timing inference

* Adding multiple forms of inference

- Adds inference operating on array (instead of single vector of
  features)
- Implements several different versions of inference with various
  loop orders
  - Involves storing the transpose of A in the type
  - Tested by checking inference on same inputs is identical between
    variants
    - Added randomizers to assist in unit testing
- Adds timing of variants to config_src/drivers/timing/time_MOM_ANN.F90
- Adds an interface (MOM_apply) to select preferred version of
  inference subroutine
- Added command line args to time_MOM_ANN.F90 to allow more rapid
  evaluation of performance

Variants explored, timed with gfortran (13.2) -O3 on Xeon:
- vector_v1:
  - original inference from Pavel
- vector_v2:
  - allocate work arrays just once, using widest layer
  - loop over layers in 2's to avoid pointer calculations and copies
  - speed up, x0.8 relative to v1
- vector_v3:
  - transpose loops
  - slow down, x1.54 relative to v1
- vector_v4:
  - transpose weights with same loop order as v1
  - slow down, x1.03 relative to v1
- array_v1:
  - same structure as v2, working on x(space,feature) input/outputs
  - speed up, x0.41 relative to v1
- array_v2:
  - as for array_v1 but with transposed loop order
  - apply activation function on vector of first index while in cache
  - speed up, x0.35 relative to v1
- array_v3:
  - same structure as v2, working on x(feature,space) input/outputs
  - speed up, x0.58 relative to v1

* Renamed ANN variants and added some module documentation

- Added module dox
- Renamed _v1, _v2 etc to labels
- Added ANN_apply_array_sio to ANN_apply interface
- Replaced "flops" with "MBps" in timing output

* Removed alternative variants of ANN in favor of optimized

- Deleted variants of ANN that did not perform as well as the two
  versions that remain.

* Apply array_sio function in ANN inference for momentum fluxes (#5)

* Apply array_sio ANN inference for computation of momentum fluxes

* remove trailing space

* Initial commit

* address Robert Hallberg code review

* Restore deafult value of ZB_SCALING coefficient

---------

Co-authored-by: Alistair Adcroft <Alistair.Adcroft@noaa.gov>
Co-authored-by: Alistair Adcroft <adcroft@users.noreply.github.com>
- fixed the restart trouble
Update 1: The accumulator of FtF is now associated with each FtSSH,
instead of shared between all FtSSH, such that both are updated at
the same model time steps, eliminating the potential inconsistency
when different FtSSH are called at different time steps.
Update 2: HA_register are called inside HA_init. This should be done
for all variables/fields.
Update 3: A logical flag is set to control whether harmonic analysis
is to be enabled.
Update 4: time_ref and const_name are defined in HA_init, instead
of being copied from MOM_tidal_forcing. This commit prepares for
the separation of MOM_harmonic_analysis and MOM_tidal_forcing.
Update 5: MOM_harmonic_analysis is now independent of MOM_tidal_forcing,
providing more flexibility for performing harmonic analyses on tidal
constituents not available in MOM_tidal_forcing (e.g., MK3, M4).
Update 6: The frequencies of 8 overtides/compound tides (MK3, MN4,
M4, MS4, S4, M6, S6, M8) have been added for the harmonic analysis.
Fixed the inconsistency for defining the reference time of tides in
MOM_tidal_forcing and MOM_open_boundary.
  Use the I0 format that was introduced with Fortran 95 in 155 lines scattered
across 40 files to simplify or shorten some error messages.  In 21 cases,
adjustl() calls that are no longer necessary for intended formatting were also
eliminated.  These changes have the effect of ensuring that there are still
appropriate messages if there are, for example, more than 99 vertical layers or
9999 points (total) in a horizontal directions or more than 9999 PEs.  In 15
cases, this change allowed for the elimination or reduction of if tests that
formatted output based on the size of an integer.  All answers are bitwise
identical but there there may be some minor formatting changes in some error
messages.
This patch fixes two class of index errors in multiple functions of
`MOM_vert_friction.F90`:

* `j=G%isc,G%jec` had been incorrectly applied to multiple loops.  This
  went undetected because we almost exclusively use local indexing where
  `G%isc == G%jsc`, but is nonetheless a serious error.  Thanks to Jorge
  Luis Gálvez Vallejo for reporting.

* One errant loop in the shelf code had `i=is,je`.  This was undetected
  due to poor ice shelf coverage testing.  Thanks to Claire Yung for
  reporting.
This patch moves the k-column loops inside of ji-layer loops, rather
than outer-k loops of layers.

The primary motivation is to restore performance at high-bandwidth runs,
which were insufficiently tested during development of the k-j-i form.

The inner-column loops show improved performance for both low and
high-bandwidth runs.

The high-bandwidth benchmark case: (128-core, 256x128 x 75 layer)
```
                                   Profile   Reference
      (Ocean vertical viscosity):   7.158s,  15.047s (-52.4%)
```
The low bandwidth case: (1-core, 32x32 x 75 layer)
```
                                   Profile   Reference
      (Ocean vertical viscosity):   3.911s,   4.788s (-18.3%)
```
For the GFDL OM5 production configuration at 503, runtimes of the slowest ranks
were reduced in proportion to the high-bandwidth case above.

For the reference dev/gfdl,
```
                                      hits          tmin          tmax          tavg
(Ocean vertical viscosity)             288      4.303819     21.483670     14.452196
```
After apply this patch, times reduce ~40%
```
                                      hits          tmin          tmax          tavg
(Ocean vertical viscosity)             288      0.976130     13.398768      8.689331
```

* Moving to columns allowed for removal of many `do_i` tests, since the test is
  applied before starting the loop.

* The `touch_ij` dummy function was removed, since we're no longer trying to
  force an IPO optimization.

* The shelf requires a re-calculation of the various thickness averages
  (h_arith, etc).  These could be saved as 1D if it becomes a problem.

* In addition to the usual regression testing, I also found no regressions in
  selected ice shelf configurations.
* Linear wave drag is limited to be only applied to land points, using
velocity point masks mask2dC[uv].

* Rayleigh_[uv] calculation and bt_rem_[uv] update from linear wave drag
 is limited for Htot>0 only.

This patch eliminates potential NaN in Rayleigh_[uv] in an unusual
scenario that Htot==0.0 and lin_drag_[uv]/=0. The changes do not change
answers: bt_rem_[uv] is zero at land points regardless. Rayleigh_[uv]
is added to [uv]_accel_bt which is masked before updating velocity.
In MOM_barotropic and non-Boussinesq mode, warning message on negative
eta is now only issued at wet points, consistently with Boussinesq.
In MOM_dynamics_split_RK2, now accleration chksum is printed before
velocity with debug on, so that we could know which accleration term is
responsible for a NaN in velocity.
* Vertvisc: First part of loop

Moving gradually

* add nvtx ranges and module warpper

* port loop to openmp for now -> will move to do concurrent. Focused on porting to GPU now

* more gpu

* port the vertical velocity to GPUs

* port the following vertical velocity loops to GPUs

* more loops using omp target teams, validated

* fix compute sanitizer death

* meridional velcoity component ported

* finish porting vertvisc

* offload another loop

* revert the last thing

* add collapse(2) to the expensive loops

* fix

* fix validation

* updates to porting: find coupling coef

* almost done, just missingt he killer loop

* lable the block of death

* duct taped version of all funcs ported to GPUs

* back to format and delete block constructs

* remove the maps ins remnant yay!

* remove need for local vertvisc_u/v variables

* death loop of death has been ported

* vertvisc coef does not have explicit mappings anymore

* the maps are gone!

* limit vel using single data transfer

* do concurrent

* some memory cleanup and code cleanup - profiling time

* update and lots of nvtx markers for opt

* limit vel betterments and notes for optimization

* make limit vel good

* h_ml transfer

* Vertfrict: Move CS allocations to init

Allocations of the vert frictions and visc control structures were moved
from the dycore subroutine loops to the model initialization.

Redundant grid (G) transfers were removed (although they were not
triggering any transfers).

Most (but not all) visc arrays have also been moved into the
initialization.  Kv_shear and Ray_[uv] need to be added.

Trailing whitespace also (inadvertently) got removed, hopefully not a
distraction.

* coupling coeff fixes

This appears to improve the CPU/GPU repro of the latest merge of dev/gpu
into dev_gpu_vertvisc.

* Resolves some of the u/v/h/dz field states between the barotropic and
  vertvisc functions.  (Development of each assumed that the other was
  not complete).

* This seems to fix some issues with a_[uv], Kv_tot, and dz inside of
  find_coupling_coef and vertvisc coef

  * chksum transfers were added to ensure consistent

* The diff has moved to frhat[uv] in btcalc.  The fields are all zero on
  GPU, so they are probably not transferred.

Work in progress...

* Remove redundant BT_cont transfer

The BT_cont fields are now on GPU, so they don't need to be transferred
before the btcalc call.

Checksum transfers for h_ml were also added.  Still assuming that it's
computed on the CPU.

* delete nvtcx

* test to see if guarding and reintroducing the check helps

* an ugly fix for a simple problem: just want to be sure before I dive

* spaces

* Vertvisc: formatting/style cleanup

This patch reduces the formatting changes to this branch and brings it
closer to both dev/gfdl and the MOM6 style guide.

One minor non-cosmetic change is the reversion of ADp%dv_dt_str(i,J,1)
calculation into one of the main Thomas loop.  This returns it to a
separate loop.

* Vert friction: Explicit CS alloc in dycore init

This patch removes the NVIDIA compiler preprocessing and explicitly
compiles the `CS` in the dynamic core initilization.

`vertvisc_init` is no longer responsible for allocation.  The allocation
test has also been removed.  We just have to trust each other now (or
segfault when it refs an unallocated field).

The previous error in the CI was the absense of `CS` allocation in the
other timestep methods (unsplit, unsplit RK, split RK2b).

* Vert friction: Data transfer reduction

Several modifications to the data transfer statements

* Some redundant transfers were removed

* Input/output transfers were moved up to the dycore loop

* Vert friction: Cosmetic cleanup

Remove some redundant comments, and an unused array loop.

* Vert friction: nk_in_ml bugfix

---------

Co-authored-by: Marshall Ward <marshall.ward@noaa.gov>
  Added the new runtime parameter RESOLN_FUNCTION_OBC_BUG that can be set to
false to take open boundary conditions into account when calculating the
resolution functions at u-, v- or q-points.  By default the wave speeds used to
calculate resolution functions do not take OBCs into account and all answers are
bitwise identical.
edoyango and others added 12 commits April 27, 2026 14:50
Some odd behaviour with num_srt where if it was map alloc'ed
upon entering the last big loop, the loop would segfault.
Flattening these arrays halves time when compiling for
GPU. Lots of time was being spent "attaching" and
"detaching" the member arrays to/from each struct on
the GPU. The disadvantage is that now a bit more memory
is required - in benchmark, the most elements allocated
per array/struct in any timestep has increased from
~700k to ~830k (~20% increase).
NVHPC 25.11 didn't like the early exit and would
give wrong answers.
Port loops in find_eta to GPUs.

Thanks to Ed Yang (@edoyango) for catching some errors in the the
cumulant sums.

The ssh_rint loop in the MOM CS was also migrated to GPU.  Several
derived types inside of MOM CS were redefined as allocatable in order to
properly decouple the array from changes in these types.
Improper managemement of domore_k was leading to
anwer changes on multiple GPUs. This fixes the
issue by proper specification of reductions in the
do concurrents that reduce domore_k(k). To make
this work, a new tmp scalar was added since do
concurrent can't use array elems yet
was map(alloc: Reg%Tr(:)) when it should be map(to...

Co-authored-by: Jorge Luis Gálvez Vallejo <jorgegalvez1694@gmail.com>
@marshallward
Copy link
Copy Markdown

marshallward commented May 3, 2026

If you reference (or pull from) the latest MOM6-examples and update HOR_VISC_ANSWER_DATE to 20250101 then you might be able to work out the paramter changes.

Edit: I think I misspoke. The GPU solvers required an update HOR_VISC_ANSWER_DATE in order to preserve CPU-GPU reproducibility. You may want everything but HOR_VISC_ANSWER_DATE to recover answers, but I would recommend updating HOR_VISC_ANSWER_DATE to 2025.

@mnlevy1981
Copy link
Copy Markdown
Author

I think we are expecting answer changes, so it's okay the regression tests failed. I was able to reproduce double_gyre answers with

VISC_REM_TIMESTEP_BUG = True
VISC_REM_BT_WEIGHT_BUG = True

but don't know if we want to keep those bugs in the code just for testing purposes...

@mnlevy1981 mnlevy1981 marked this pull request as ready for review May 12, 2026 18:35
@mnlevy1981
Copy link
Copy Markdown
Author

mnlevy1981 commented May 15, 2026

The dev/gpu specific changes should be visible in this comparison -- still 51 files changed, though some are specific to the test system and a lot of the changes are pretty minor. They might also be some discrepancies in how I resolved conflicts from the merge, though I've tried to clean that up already.

Also, just to document the process in case we need to do something similar in the future... I did the following:

  1. Find the latest dev/gfdl commit on dev/gpu
    $ git merge-base dev/gfdl dev/gpu
    ea47b2096fb0bf899e9cdc2b9d94cfa8c47d1d59
    
  2. Merge that commit onto the head of dev/turbo (this is how I created merge_devgfdl_to_turbo; try to resolve conflicts consistent with what I did when creating the branch this PR is merging in
  3. Compare the dev/gpu -> dev/turbo branch to this newly created branch

@johnmauff johnmauff self-requested a review May 16, 2026 18:50
@johnmauff
Copy link
Copy Markdown

@mnlevy1981, @alperaltuntas, I took a look at the difference files that Mike created. While the number of lines is smaller, only 6K lines were added, and 4K lines were deleted, it is still a massive number of changes. I did a quick examination of the changes, and I think I understand approximately 70-80% of the changes. There are some that involve more significant loop transformations and subroutine rearrangements that I would need to spend more time with before I can say that I totally understand what is going on. Honestly, I don't really know how to review something this large. I am advocating for accepting the PR as is if we can still build turbo-stack in OpenACC mode and it passes the CI. As part of the conversion to AMReX, we will be looking at almost every loop. So we will be visiting all the changes that @marshallward and NCI Australia / ACCESS-NRI have made, just not upfront before the PR is accepted.

Copy link
Copy Markdown

@johnmauff johnmauff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive PR that involves changes that have been reviewed by colleagues at GFDL and ACCESS AI. While I fundamentally understand ~70-80% of the changes, that still leaves a rather large number of changes that would require additional time. Because I anticipate looking at every loop as part of the AMReX conversion process, I am approving this PR as is.

Copy link
Copy Markdown

@alperaltuntas alperaltuntas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity check, I tested this with TURBO-ESM/turbo-stack#231 and confirmed it's working in GPU offload mode.

@mnlevy1981 mnlevy1981 merged commit 440a38e into TURBO-ESM:dev/turbo May 19, 2026
50 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.