Merge devgpu to turbo#20
Conversation
SHALLOW_ALE_RESOLUTION implements a HYBGEN-style Z-sigma-Z near surface fixed coordinate for HYCOM1. For example the US Navy's GOFS 3.1 HYCOM setup has 41 layers, with the top 14 layers in a Z-sigma-Z configuration. For MOM6 HYCOM1 this is: SHALLOW_ALE_RESOLUTION = 14*1.0,27*0.0 for 14 1m "shallow" layers. Let N_SIGMA be the number of consecutive non-zero entries, typically < NK. When rest depth is shallower than SUM(SHALLOW_ALE_RESOLUTION(1:N_SIGMA)) use SHALLOW_ALE_RESOLUTION. When rest depth is deeper than SUM(SHALLOW_ALE_RESOLUTION(1:N_SIGMA)) use ALE_RESOLUTION. Otherwise use a linear sum of the two weighted by rest depth. The default of all zeros turns this option off, and when off answers are unchanged. The new parmeter SHALLOW_ALE_RESOLUTION is only present when using HYCOM1.
The 2-d REAL map array in HYBRID_MAP usually contains integer values each referencing one profile. It can instead contain non-integer values of the form I+frac, which indicate a weighted sum of profiles: (1-frac) p(I) + (frac) p(I+1). The same profile can be used multiple times, e.g. if 1st profile is also 4th can get profiles between 1 and 2 and between 1 and 3. HYBRID_3D is more general, but HYBRID_MAP covers most practical uses.
Added the new runtime option RESCALE_STRONG_DRAG, that can be set to true to reduce the barotropic contribution to the layer accelerations to account for the difference between the forces that can be counteracted by the stronger drag with BT_STRONG_DRAG and the average of the layer viscous remnants after a baroclinic timestep. In testing, this new capability eliminates some of the growing instabilities that can occur with an ice shelf and BT_STRONG_DRAG set to true. This commit also adds new diagnostics of the barotropic step viscous remnants and the eta anomalies contributing to barotropic pressure forces, either averaged over the barotropic step or at each barotropic step. By default all answers are bitwise identical, but there is a new runtime parameter and 4 new diagnostics.
mom-ocean#967) * Add option to horizontally homogenize the Stokes drift when used via the dataoverride surfbands procedure. * Add variable description in new method for horizontally averaging Stokes drift. --------- Co-authored-by: brandon.reichl <brandon.reichl@noaa.gov>
Corrected a horizontal indexing bug in the calculation of the CAv_Stokes diagnostic, making it rotationally consistent and consistent with the calculation of CAu_Stokes. This bug has been there since the CAv_Stokes diagnostic was originally added. The loop range over which qS is calculated was also reduced to the range over which it is used. All solutions are bitwise identical, but this commit does change the values of a (perhaps infrequently used) diagnostic.
The interpreter directive ("shebang") of makedep is updated to
`python3`, rather than the version-agnostic `python`.
Although we never invoke the shebang of the script, there are OS
environments out there which will object to any presence of a
versionless python. PEP 394 also strongly recommends the adoption of
python3 as the executable name, regardless of Py2 support.
* present_vhbt_or_set_bt_cont: merge couple of loops * meridional_flux_thickness: cpu optimize a bit * meridional_flux_adjust: back to jki * set_merid_BT_cont: pull out meridional_flux_adjst * set_meridional_BT_cont: optimize cpu * rm remaining merid/zonal_flux_layer * zonal_flux_layere: improve naming a bit * zonal_flux_layere_OBC: improve naming a bit * improve flux_elem line wraps * optimize data transfers a bit * pass elem not arr * add some missing documentation * fix trailing spaces and missing var docs * add last param docs * meridional_flux_adjust: fix fpe err * fix another fpe * cleanup args for new helper subroutines * clean up enter/exit data * fix passing h twice * meridional_mass_flux: do concurrent * zonal_flux_adjust: use a few 3d tmp arrays to mirror meridional_flux_adjust * move target update out of continuity * initialize pbv%por_face_area[U/V] on GPU * cleanup some transfers from continuity_PPM * clean up a few minor things * zonal/meridional_flux_adjust: use scalar u/v_new * declare vp,up,h_tmp on gpu * remove h update * continuity_PPM: minimize mapping stmts * zonal_flux_adjust: minimize mapping stmts * set_zonal_bt_cont: minimize mapping stmts * merional_flux_adjust: minimize mapping stmts * set_merid_bt_cont: minimize mapping stmts * zonal/meridional_flux_adjust: tmp vars duhdu/dvhdv 3d -> scalar * separate alloc of private variables for gcc * u/vh_aux: 3d->2d * target teams loop recognizable by amdflang * Continuity CS outside of init This moves the Continuity CS to the dycore init function. For some reason, this avoids an answer change with CPU. (Possibly because alloc inside of a function doesn't quite match the CS outside of it?) A few minor data transfers are also added to fix up differences in the chksum log output. * zonal_mass_flux: isolate zonal_flux_layer * zonal_mass_flux: seperate local_specified_BC block loop * zonal_mass_flux: add j dim to tmp vars * zonal_mass_flux: seperate visc_rem_max init loops * zonal_mass_flux: seperate du_min/max_CFL init loops * zonal_mass_flux: separate duhdu/uh_tot_0 init loop * zonal_mass_flux: separate du_min/max_CFL aggress_adjust update case (untested) * zonal_mass_flux: separate du_min/max_CFL non-aggress_adjust update case * zonal_mass_flux: separate du_min/max_CFL non-use_visc_rem update case (untested) * zonal_mass_flux: separate du_min/max_CFL 0-clamp loop * zonal_mass_flux: separate do_I local_specified_BC init loop (untested) * copy zonal_flux_adjust that accepts 2d args * zonal_mass_flux move j loop into zonal_flux_adjust copy * zonal_flux_adjust_fused: use 2d internal arr * zonal_flux_adjusted_fused: separate all loops * zonal_mass_flux: separate u/du_cor update (untested) * zonal_mass_flux: replicate former present(uhbt) control flow * copy set_zonal_BT_cont with 2d args * zonal_mass_flux: move j-loop into set_zonal_BT_cont_fused * set_zonal_BT_cont_fused: use zonal_flux_adjust_fused * set_zonal_BT_cont_fused: separate init loop * set_zonal_BT_cont_fused: separate duR/L init loop * set_zonal_BT_cont_fused: separate u_0/L/R init loop * set_zonal_BT_cont_fused: use zonal_flux_layer_fused * set_zonal_BT_cont_fused: separate last 2 loops * copy merid_flux_layer that accepts entire arrs * merid_flux_layer_fused: separate loops * merididional_mass_flux: separate local_specified_BC loop (untested) * merididional_mass_flux: separate tmp variable init loops * merididional_mass_flux: separate dv_min/max_CFL calc loops * merididional_mass_flux: separate dv_min/max_CFL 0 clamp loop * merididional_mass_flux: separate simple_OBC_pt init loop (untested) * copy meridional_flux_adjust that accepts entire arrs * meridional_mass_flux: separate meridional_flux_adjust_fused * meridional_mass_flux: move j-loop into meridional_flux_adjust_fused * meridional_flux_adjust_fused: separate vh_aux,dvhdv init loops * meridional_flux_adjust_fused: 1d->2d tmp arrs * meridional_flux_adjust_fused: separate all arrs * meridional_mass_flux: separate (d)v_cor assignment loops * copy set_merid_BT_cont that accepts entire arrs * meridional_mass_flux: move j-loop into set_merid_BT_cont_fused * set_merid_BT_cont_fused: use meridional_flux_adjust_fused * set_merid_BT_cont_fused: separate tmp var init * set_merid_BT_cont_fused: separate short circuit loop * set_merid_BT_cont_fused: rm redundant k loop in short circuit * set_merid_BT_cont_fused: separate dvL/R init loop * set_merid_BT_cont_fused: make remaining tmp arrs 3d * set_merid_BT_cont_fused: use merid_flux_layer_fused * set_merid_BT_cont_fused: separate last loop * meridional_mass_flux: separate any_simple_OBC loop * zonal_edge_thickness: move k loop->PPM_reconstruction_x * meridional_edge_thickness: move k loop->PPM_reconstruction_y * PPM_reconstruction_x/y: move k loop->PPM_limit_pos * PPM_reconstruction_x/y: move k loop->PPM_limit_cw84 (untested) * zonal_BT_mass_flux: separate all loops (untested) * meridional_BT_mass_flux: separate all loops (untested) * set_zonal_BT_cont_fused: clean up var defs * use SZJB_(G) in do_I dclrns in merid* subroutines * set_merid_BT_cont_fused: clean up var defs * set_merid/zonal_BT_cont_fused -> set_merid/zonal_BT_cont * zonal_flux_adjust_fused: clean up var defs * zonal_flux_adjust_fused -> zonal_flux_adjust * zonal_flux_layer_fused -> zonal_flux_layer * meridional_flux_adjust_fused: clean up var defs * meridional_flux_adjust_fused -> meridional_flux_adjust * merid_flux_layer_fused: clean up var defs * merid_flux_layer_fused -> merid_flux_layer * fix call merid/zonal_flux_layer line conts * zonal_mass_flux: use visc_rem_u_tmp * zonal_mass_flux: separate du_min/max_CFL non-use_visc_rem update case (untested) * copy set_zonal_BT_cont with 2d args * meridional_mass_flux: use visc_rem_v_tmp * copy meridional_flux_adjust that accepts entire arrs * copy set_merid_BT_cont that accepts entire arrs * set_merid_BT_cont_fused: rm redundant k loop in short circuit * set_merid_BT_cont_fused: rm redundant k loop in short circuit * zonal_flux_adjust_fused -> zonal_flux_adjust * meridional_flux_adjust_fused -> meridional_flux_adjust * zonal_flux_layer: move GV in var dclrn * zonal/meridional_mass_flux: remove old visc_rem vars * remove redundant kloop * remove problematic omp directives * port PPM_limit_pos/CW84, PPM_reconstruction_x, zonal_edge_thickness * PPM_reconstruction_x: add enter/exit data stmts * continuity_zonal_convergence: save loop range for porting convenience * port continuity_zonal_convergence * zonal_mass_flux: array init -> loop init * zonal_mass_flux: port init loops * port zonal_flux_layer * zonal_flux_layer: add enter/exit data stmts * port zonal_flux_adjust * port set_zonal_bt_cont * zonal_flux_adjust: add more arrs in enter/exit data * set_zonal_BT_cont: add enter/exit data stmts * port zonal_flux_thickness loops * zonal_flux_thickness add enter/exit data stmts * zonal_mass_flux prepare obc for porting * zonal_mass_flux: merge any_somple_OBC loops * zonal_mass_flux port main loops * zonal_mass_flux port OBC loops * zonal_mass_flux: add enter/exit data stmts * continuity_ppm: add initial enter/exit data stmts * port PPM_reconstruction_y, meridional_edge_thickness * port continuity_merdional_convergence * ppm_reconstruction_y: add enter/exit data stmts * port merid_flux_layer * meridional_flux_adjust: port loops * meridional_flux_adjust: add enter/exit data stmts * meridional_mass_flux: port core loops * meridional_mass_flux: port OBC loops (untested) * meridional_flux_thickness: port core loops * meridional_flux_thickness: attempt port OBC loops (untested) * meridional_flux_thickness: add enter/exit data stmts * port set_merid_BT_cont loops * set_merid_bt_cont: add enter/exit data stmts * meridional_mass_flux: add enter/exit data stmts * zonal/meridional_mass_flux: add missing vars in enter/exit stmts * continuity_PPM: complete enter/exit data stmts * *_edge_thickness: do concurrent * zonal_mass_flux: do concurrent-ify * set_zonal_BT_cont: do concurrent-ify * zonal_flux_adjust: do concurrent-ify * zonal_flux_layer: do concurrent * zonal_flux_thickness: do concurrent * continuity_zonal_convergence: do concurrent * meridional_mass_flux: do concurrent * set_merid_bt_cont: do concurrent * meridional_flux_adjust: do concurrent * merid_flux_layer: do concurrent * meridional_flux_thickness: do concurrent * continuity_merdional_convergence: do concurrent * formatting * continuity_PPM: update LB * meridional/zonal_flux_adjust: separate ij-reduction * zonal_flux_adjust: a couple jki loops * set_zonal_bt_cont: some jki loops * zonal_mass_flux: some jki loops * optimise loops by using scalar zonal_flux_layer * zonal_flux_adjust: duhdu -> scalar * elemental zonal_flux_layere and separated OBC * zonal_flux_adjust: back to jki * zonal_flux_thickness: improve cpu perf a bit * zonal_flux_adjust: add some comments and guard early exit * zonal_flux_adjust: use omp target loop for private arrs * set_zonal_BT_cont: use omp target loop for private arrs * set_zonal_BT_cont: do conc jki loop * zonal_mass_flux: rm useless do_i init * zonal_flux_layere: precalc g_dy_Cu*por_face_areaU * zonal_flux_layere: precalc dh * zonal_flux_thickness: precalc dh * zonal_flux_adjust: make uhbt optional * zonal_flux_thickness: assign to outputs directly * mv zonal_flux_adjust from set_zonal_bt_cont->zonal_mass_flux * zonal_mass_flux: reuse visc_rem_u_tmp more * zonal_mass_flux: remove redundant if stmt * zonal_mass_flux: force inline zonal_flux_layere * zonal_flux_layere: inlinable gfortran -O3 but slower for ifort * zonal_flux_layere: make a bit "smaller" * zonal_mass_flux: move present(uhbt) or set_BT_cont to new subroutine * Revert "zonal_flux_layere: make a bit "smaller"" This reverts commit 316152e. * Revert "zonal_flux_layere: inlinable gfortran -O3 but slower for ifort" This reverts commit 51ee896. * fix gcc line truncation * pass doxygen tests * remove forceinline dirs * attempt document vars * a couple long lines * last trailing space * meridional_mass_flux: use zonal_flux_layere * zonal_flux_layere_OBC: make elemental * meridional_mass_flux: move dv_min/max_CFL calc into j-loop * meridional_mass_flux: move big chunk into separate subroutine * present_vhbt_or_set_bt_cont: merge couple of loops * meridional_flux_thickness: cpu optimize a bit * meridional_flux_adjust: back to jki * set_merid_BT_cont: pull out meridional_flux_adjst * set_meridional_BT_cont: optimize cpu * rm remaining merid/zonal_flux_layer * zonal_flux_layere: improve naming a bit * zonal_flux_layere_OBC: improve naming a bit * improve flux_elem line wraps * optimize data transfers a bit * pass elem not arr * add some missing documentation * fix trailing spaces and missing var docs * add last param docs * meridional_flux_adjust: fix fpe err * fix another fpe * cleanup args for new helper subroutines * clean up enter/exit data * fix passing h twice * meridional_mass_flux: do concurrent * zonal_flux_adjust: use a few 3d tmp arrays to mirror meridional_flux_adjust * move target update out of continuity * initialize pbv%por_face_area[U/V] on GPU * cleanup some transfers from continuity_PPM * clean up a few minor things * zonal/meridional_flux_adjust: use scalar u/v_new * declare vp,up,h_tmp on gpu * remove h update * continuity_PPM: minimize mapping stmts * zonal_flux_adjust: minimize mapping stmts * set_zonal_bt_cont: minimize mapping stmts * merional_flux_adjust: minimize mapping stmts * set_merid_bt_cont: minimize mapping stmts * zonal/meridional_flux_adjust: tmp vars duhdu/dvhdv 3d -> scalar * separate alloc of private variables for gcc * u/vh_aux: 3d->2d * target teams loop recognizable by amdflang * Continuity CS outside of init This moves the Continuity CS to the dycore init function. For some reason, this avoids an answer change with CPU. (Possibly because alloc inside of a function doesn't quite match the CS outside of it?) A few minor data transfers are also added to fix up differences in the chksum log output. * Continuity: Add locality to do concurrent Do concurrent inside of !$omp target teams loop seems to fail standard openmp tests if locality is not correctly set. This patch adds the correct locality to the four `!$omp target teams loop` directives. The domore argument has also been removed, and replaced with a `.not.any(do_i(:))` test. --------- Co-authored-by: Marshall Ward <marshall.ward@noaa.gov>
update MOM6 to its main repo. 20250818 updating (default parameter changes)
Added the new argument dyn_h_stencil to initialize_dyn_split_RK2 and the other 3 dynamic core initialization routines to return the size of the stencil for thicknesses as used by the dynamic core, depending on the options that are being used for the Coriolis and continuity schemes, and then used this in a set of halo updates in step_MOM_dynamics. With this change some additional halo updates that have recently been added inside of step_MOM_dyn_split_RK2 and the other 3 dynamic core time stepping routines could be eliminated. All answers are bitwise identical, but there is a new argument to 4 public interfaces. Co-authored-by: Marshall Ward <marshall.ward@gmail.com>
Modified the conditions that determine when to increment the count of velocity truncations to use the thickness as interpolated to velocity points to determine whether layers are thick enough to be counted, rather than the arithmetic mean thickness, and only count truncations that occur in the non-symmetric computational domain to avoid double counting. The filtering thicknesses should be very similar in the ocean interior, but they will differ at open boundary condition points. The corrected counting was verified by running the sloshing/layer test case with a maximum CFL set to 0.01 to create lots of truncations, and then verifying that the truncation count is now the same on 1 and 10 PEs, whereas before it was not. All solutions are bitwise identical, but the reported truncation counts in the ocean.stats files can change in multiple-PE cases with velocity truncations.
* Add MOM_ANN module * Mesoscale momentum parameterization with ANN - Computes subgrid stress using ANN in MOM_Zanna_Bolton - Uses MOM_ANN module for ANN inference Equivalent MOM_override for defaults ``` USE_ZB2020 = True ZB2020_USE_ANN = True USE_CIRCULATION_IN_HORVISC = True ZB2020_ANN_FILE_TALL = /path/to/ocean3d/subfilter/FGR3/EXP1/model/Tall.nc ``` * Mesoscale momentum parameterization with ANN (#2) Blank commit after squash/rebase was handled on command line * Moved MOM_ANN.F90 to src/framework/ * Minor refactor of MOM_ANN - Removed unused modules - Removed unused MOM_memory.h - Added input and output means which default to 0 and do not need to be present in the weights file - Gave defaults to means, norms, tests so that they do no need to be present in file - Added missing array notation "(:)" - Minor formatting * Adds unit tests and timing test to MOM_ANN - Added ANN_allocate, set_layer, set_input_normalization, and set_output_normalization methods to allow reconfiguration during unit tests - Added ANN_unit_tests with some simple constructed-by-code networks with known solutions - Added config_src/drivers/unit_tests/test_MOM_ANN.F90 to drive unit tests - Added config_src/drivers/timing_tests/time_MOM_ANN.F90 as rudimentary for timing inference * Adding multiple forms of inference - Adds inference operating on array (instead of single vector of features) - Implements several different versions of inference with various loop orders - Involves storing the transpose of A in the type - Tested by checking inference on same inputs is identical between variants - Added randomizers to assist in unit testing - Adds timing of variants to config_src/drivers/timing/time_MOM_ANN.F90 - Adds an interface (MOM_apply) to select preferred version of inference subroutine - Added command line args to time_MOM_ANN.F90 to allow more rapid evaluation of performance Variants explored, timed with gfortran (13.2) -O3 on Xeon: - vector_v1: - original inference from Pavel - vector_v2: - allocate work arrays just once, using widest layer - loop over layers in 2's to avoid pointer calculations and copies - speed up, x0.8 relative to v1 - vector_v3: - transpose loops - slow down, x1.54 relative to v1 - vector_v4: - transpose weights with same loop order as v1 - slow down, x1.03 relative to v1 - array_v1: - same structure as v2, working on x(space,feature) input/outputs - speed up, x0.41 relative to v1 - array_v2: - as for array_v1 but with transposed loop order - apply activation function on vector of first index while in cache - speed up, x0.35 relative to v1 - array_v3: - same structure as v2, working on x(feature,space) input/outputs - speed up, x0.58 relative to v1 * Renamed ANN variants and added some module documentation - Added module dox - Renamed _v1, _v2 etc to labels - Added ANN_apply_array_sio to ANN_apply interface - Replaced "flops" with "MBps" in timing output * Removed alternative variants of ANN in favor of optimized - Deleted variants of ANN that did not perform as well as the two versions that remain. * Apply array_sio function in ANN inference for momentum fluxes (#5) * Apply array_sio ANN inference for computation of momentum fluxes * remove trailing space * Initial commit * address Robert Hallberg code review * Restore deafult value of ZB_SCALING coefficient --------- Co-authored-by: Alistair Adcroft <Alistair.Adcroft@noaa.gov> Co-authored-by: Alistair Adcroft <adcroft@users.noreply.github.com>
- fixed the restart trouble
Update 1: The accumulator of FtF is now associated with each FtSSH, instead of shared between all FtSSH, such that both are updated at the same model time steps, eliminating the potential inconsistency when different FtSSH are called at different time steps.
Update 2: HA_register are called inside HA_init. This should be done for all variables/fields.
Update 3: A logical flag is set to control whether harmonic analysis is to be enabled.
Update 4: time_ref and const_name are defined in HA_init, instead of being copied from MOM_tidal_forcing. This commit prepares for the separation of MOM_harmonic_analysis and MOM_tidal_forcing.
Update 5: MOM_harmonic_analysis is now independent of MOM_tidal_forcing, providing more flexibility for performing harmonic analyses on tidal constituents not available in MOM_tidal_forcing (e.g., MK3, M4).
Update 6: The frequencies of 8 overtides/compound tides (MK3, MN4, M4, MS4, S4, M6, S6, M8) have been added for the harmonic analysis.
Fixed the inconsistency for defining the reference time of tides in MOM_tidal_forcing and MOM_open_boundary.
Use the I0 format that was introduced with Fortran 95 in 155 lines scattered across 40 files to simplify or shorten some error messages. In 21 cases, adjustl() calls that are no longer necessary for intended formatting were also eliminated. These changes have the effect of ensuring that there are still appropriate messages if there are, for example, more than 99 vertical layers or 9999 points (total) in a horizontal directions or more than 9999 PEs. In 15 cases, this change allowed for the elimination or reduction of if tests that formatted output based on the size of an integer. All answers are bitwise identical but there there may be some minor formatting changes in some error messages.
This patch fixes two class of index errors in multiple functions of `MOM_vert_friction.F90`: * `j=G%isc,G%jec` had been incorrectly applied to multiple loops. This went undetected because we almost exclusively use local indexing where `G%isc == G%jsc`, but is nonetheless a serious error. Thanks to Jorge Luis Gálvez Vallejo for reporting. * One errant loop in the shelf code had `i=is,je`. This was undetected due to poor ice shelf coverage testing. Thanks to Claire Yung for reporting.
This patch moves the k-column loops inside of ji-layer loops, rather
than outer-k loops of layers.
The primary motivation is to restore performance at high-bandwidth runs,
which were insufficiently tested during development of the k-j-i form.
The inner-column loops show improved performance for both low and
high-bandwidth runs.
The high-bandwidth benchmark case: (128-core, 256x128 x 75 layer)
```
Profile Reference
(Ocean vertical viscosity): 7.158s, 15.047s (-52.4%)
```
The low bandwidth case: (1-core, 32x32 x 75 layer)
```
Profile Reference
(Ocean vertical viscosity): 3.911s, 4.788s (-18.3%)
```
For the GFDL OM5 production configuration at 503, runtimes of the slowest ranks
were reduced in proportion to the high-bandwidth case above.
For the reference dev/gfdl,
```
hits tmin tmax tavg
(Ocean vertical viscosity) 288 4.303819 21.483670 14.452196
```
After apply this patch, times reduce ~40%
```
hits tmin tmax tavg
(Ocean vertical viscosity) 288 0.976130 13.398768 8.689331
```
* Moving to columns allowed for removal of many `do_i` tests, since the test is
applied before starting the loop.
* The `touch_ij` dummy function was removed, since we're no longer trying to
force an IPO optimization.
* The shelf requires a re-calculation of the various thickness averages
(h_arith, etc). These could be saved as 1D if it becomes a problem.
* In addition to the usual regression testing, I also found no regressions in
selected ice shelf configurations.
* Linear wave drag is limited to be only applied to land points, using velocity point masks mask2dC[uv]. * Rayleigh_[uv] calculation and bt_rem_[uv] update from linear wave drag is limited for Htot>0 only. This patch eliminates potential NaN in Rayleigh_[uv] in an unusual scenario that Htot==0.0 and lin_drag_[uv]/=0. The changes do not change answers: bt_rem_[uv] is zero at land points regardless. Rayleigh_[uv] is added to [uv]_accel_bt which is masked before updating velocity.
In MOM_barotropic and non-Boussinesq mode, warning message on negative eta is now only issued at wet points, consistently with Boussinesq.
In MOM_dynamics_split_RK2, now accleration chksum is printed before velocity with debug on, so that we could know which accleration term is responsible for a NaN in velocity.
* Vertvisc: First part of loop Moving gradually * add nvtx ranges and module warpper * port loop to openmp for now -> will move to do concurrent. Focused on porting to GPU now * more gpu * port the vertical velocity to GPUs * port the following vertical velocity loops to GPUs * more loops using omp target teams, validated * fix compute sanitizer death * meridional velcoity component ported * finish porting vertvisc * offload another loop * revert the last thing * add collapse(2) to the expensive loops * fix * fix validation * updates to porting: find coupling coef * almost done, just missingt he killer loop * lable the block of death * duct taped version of all funcs ported to GPUs * back to format and delete block constructs * remove the maps ins remnant yay! * remove need for local vertvisc_u/v variables * death loop of death has been ported * vertvisc coef does not have explicit mappings anymore * the maps are gone! * limit vel using single data transfer * do concurrent * some memory cleanup and code cleanup - profiling time * update and lots of nvtx markers for opt * limit vel betterments and notes for optimization * make limit vel good * h_ml transfer * Vertfrict: Move CS allocations to init Allocations of the vert frictions and visc control structures were moved from the dycore subroutine loops to the model initialization. Redundant grid (G) transfers were removed (although they were not triggering any transfers). Most (but not all) visc arrays have also been moved into the initialization. Kv_shear and Ray_[uv] need to be added. Trailing whitespace also (inadvertently) got removed, hopefully not a distraction. * coupling coeff fixes This appears to improve the CPU/GPU repro of the latest merge of dev/gpu into dev_gpu_vertvisc. * Resolves some of the u/v/h/dz field states between the barotropic and vertvisc functions. (Development of each assumed that the other was not complete). * This seems to fix some issues with a_[uv], Kv_tot, and dz inside of find_coupling_coef and vertvisc coef * chksum transfers were added to ensure consistent * The diff has moved to frhat[uv] in btcalc. The fields are all zero on GPU, so they are probably not transferred. Work in progress... * Remove redundant BT_cont transfer The BT_cont fields are now on GPU, so they don't need to be transferred before the btcalc call. Checksum transfers for h_ml were also added. Still assuming that it's computed on the CPU. * delete nvtcx * test to see if guarding and reintroducing the check helps * an ugly fix for a simple problem: just want to be sure before I dive * spaces * Vertvisc: formatting/style cleanup This patch reduces the formatting changes to this branch and brings it closer to both dev/gfdl and the MOM6 style guide. One minor non-cosmetic change is the reversion of ADp%dv_dt_str(i,J,1) calculation into one of the main Thomas loop. This returns it to a separate loop. * Vert friction: Explicit CS alloc in dycore init This patch removes the NVIDIA compiler preprocessing and explicitly compiles the `CS` in the dynamic core initilization. `vertvisc_init` is no longer responsible for allocation. The allocation test has also been removed. We just have to trust each other now (or segfault when it refs an unallocated field). The previous error in the CI was the absense of `CS` allocation in the other timestep methods (unsplit, unsplit RK, split RK2b). * Vert friction: Data transfer reduction Several modifications to the data transfer statements * Some redundant transfers were removed * Input/output transfers were moved up to the dycore loop * Vert friction: Cosmetic cleanup Remove some redundant comments, and an unused array loop. * Vert friction: nk_in_ml bugfix --------- Co-authored-by: Marshall Ward <marshall.ward@noaa.gov>
Added the new runtime parameter RESOLN_FUNCTION_OBC_BUG that can be set to false to take open boundary conditions into account when calculating the resolution functions at u-, v- or q-points. By default the wave speeds used to calculate resolution functions do not take OBCs into account and all answers are bitwise identical.
Some odd behaviour with num_srt where if it was map alloc'ed upon entering the last big loop, the loop would segfault.
Flattening these arrays halves time when compiling for GPU. Lots of time was being spent "attaching" and "detaching" the member arrays to/from each struct on the GPU. The disadvantage is that now a bit more memory is required - in benchmark, the most elements allocated per array/struct in any timestep has increased from ~700k to ~830k (~20% increase).
NVHPC 25.11 didn't like the early exit and would give wrong answers.
Port loops in find_eta to GPUs. Thanks to Ed Yang (@edoyango) for catching some errors in the the cumulant sums. The ssh_rint loop in the MOM CS was also migrated to GPU. Several derived types inside of MOM CS were redefined as allocatable in order to properly decouple the array from changes in these types.
…tellar A100s with nvfortran 25.5
Improper managemement of domore_k was leading to anwer changes on multiple GPUs. This fixes the issue by proper specification of reductions in the do concurrents that reduce domore_k(k). To make this work, a new tmp scalar was added since do concurrent can't use array elems yet
was map(alloc: Reg%Tr(:)) when it should be map(to... Co-authored-by: Jorge Luis Gálvez Vallejo <jorgegalvez1694@gmail.com>
|
If you reference (or pull from) the latest MOM6-examples Edit: I think I misspoke. The GPU solvers required an update |
|
I think we are expecting answer changes, so it's okay the regression tests failed. I was able to reproduce but don't know if we want to keep those bugs in the code just for testing purposes... |
|
The Also, just to document the process in case we need to do something similar in the future... I did the following:
|
|
@mnlevy1981, @alperaltuntas, I took a look at the difference files that Mike created. While the number of lines is smaller, only 6K lines were added, and 4K lines were deleted, it is still a massive number of changes. I did a quick examination of the changes, and I think I understand approximately 70-80% of the changes. There are some that involve more significant loop transformations and subroutine rearrangements that I would need to spend more time with before I can say that I totally understand what is going on. Honestly, I don't really know how to review something this large. I am advocating for accepting the PR as is if we can still build turbo-stack in OpenACC mode and it passes the CI. As part of the conversion to AMReX, we will be looking at almost every loop. So we will be visiting all the changes that @marshallward and NCI Australia / ACCESS-NRI have made, just not upfront before the PR is accepted. |
johnmauff
left a comment
There was a problem hiding this comment.
This is a massive PR that involves changes that have been reviewed by colleagues at GFDL and ACCESS AI. While I fundamentally understand ~70-80% of the changes, that still leaves a rather large number of changes that would require additional time. Because I anticipate looking at every loop as part of the AMReX conversion process, I am approving this PR as is.
alperaltuntas
left a comment
There was a problem hiding this comment.
As a sanity check, I tested this with TURBO-ESM/turbo-stack#231 and confirmed it's working in GPU offload mode.
This PR brings the latest changes from @marshallward and the NCI Australia / ACCESS-NRI team to dev/turbo. We were pretty far behind
mom-ocean:main, and there are answer changes that came through that branch -- you can reproduce our previous answers by adding the following toMOM_overridebut I don't think we have any need to preserve these specific bugs (perhaps we should setENABLE_BUGS_BY_DEFAULT = Falseand not re-enable any specific bugs?).In my testing with turbo-stack, I needed to update FMS2 to @edoyango's
ompoffloadbranch, so I suspect the CI testing will fail when I first open this PR. I'll link to the necessary FMS PR once it is open (we will also need to update TIM, though that's basically cherry-picking some of Ed's FMS changes). I'll leave this as a draft until the infrastructure changes are in place and the CI testing works.