Skip to content

Rewrite vertvisc() and vertvisc_remnant() loops to kji form#912

Merged
Hallberg-NOAA merged 2 commits into
NOAA-GFDL:dev/gfdlfrom
marshallward:vertvisc_kji
Jun 4, 2025
Merged

Rewrite vertvisc() and vertvisc_remnant() loops to kji form#912
Hallberg-NOAA merged 2 commits into
NOAA-GFDL:dev/gfdlfrom
marshallward:vertvisc_kji

Conversation

@marshallward
Copy link
Copy Markdown
Member

This patch rewrites the tridiagonal solvers of vertvisc() and vertvisc_remnant() to kji-form, increasing the concurrency over j-points.

Overall runtime of vertical friction is reduced by about 5-6%.

    (Ocean vertical viscosity):   5.319s,   5.652s (-5.9%)
    (Ocean vertical viscosity):   5.416s,   5.713s (-5.2%)
    (Ocean vertical viscosity):   5.371s,   5.689s (-5.6%)

The vertvisc() runtime is reduced by about 8%.

    mom_vert_friction_mp_vertvisc_:   0.583s,   0.629s (-7.3%)
    mom_vert_friction_mp_vertvisc_:   0.576s,   0.634s (-9.2%)
    mom_vert_friction_mp_vertvisc_:   0.583s,   0.636s (-8.3%)

vertvisc_remnant() is reduced by about 25-30%.

    mom_vert_friction_mp_vertvisc_remnant_:   0.939s,   1.241s (-24.3%)
    mom_vert_friction_mp_vertvisc_remnant_:   0.935s,   1.265s (-26.0%)
    mom_vert_friction_mp_vertvisc_remnant_:   0.910s,   1.258s (-27.7%)

Only one new 3d was required. Several 1d arrays were promoted to 2d, or were reshaped to ij.

Some speedups were due to movement of diagnostics outside of the main tridiagonal loops, which enabled vectorization. Another speedup was due to conditionally populating the Rayleigh drag Ray.

Speedups are much higher if the loops are changed to do concurrent (e.g. 2x speedup in vertvisc) but this will be handled in a separate PR.

This new loop form is favorable to GPUs, and is part of the preparation for porting MOM6 to GPU platforms.

This PR is based on an earlier draft by @edoyango developed for GPU migration.

@marshallward
Copy link
Copy Markdown
Member Author

I have some additional memory timings for Intel. Four instances are shown below.

There is a slight increase in memory time, although not by much. Roughly, time in memset has displaced time in memcpy.

__intel_avx_rep_memset:   2.078s,   1.833s (13.3%)
__intel_avx_rep_memcpy:   1.240s,   1.390s (-10.8%)
__intel_avx_rep_memset:   2.062s,   1.812s (13.8%)
__intel_avx_rep_memcpy:   1.312s,   1.411s (-7.1%)
__intel_avx_rep_memset:   2.056s,   1.844s (11.5%)
__intel_avx_rep_memcpy:   1.278s,   1.410s (-9.4%)
__intel_avx_rep_memset:   2.065s,   1.859s (11.1%)
__intel_avx_rep_memcpy:   1.357s,   1.304s (4.0%)

I don't think we need to be terribly worried about this. But we should probably consider this metric in similar future PRs.

Comment thread src/parameterizations/vertical/MOM_vert_friction.F90 Outdated
Comment thread src/parameterizations/vertical/MOM_vert_friction.F90 Outdated
Comment thread src/parameterizations/vertical/MOM_vert_friction.F90 Outdated
marshallward and others added 2 commits June 3, 2025 18:17
The jki loops in vertvisc() have been reordered to kji.  The solver
increases the number of concurrent tridiagonal solvers from Ni to Ni*Nj.

Two other changes contributed to performance

* Moving diagnostics (e.g. ADp%du_dt_str) outside of loops
* Conditional computing of Ray() when visc%Ray_[uv] is set

Not all optimizations of this sort were applied, and should be reviwed
in relevant experiments.

This showed a modest performance improvement on CPUs.  Three instances
are shown below.

* mom_vert_friction_mp_vertvisc_:   0.583s,   0.629s (-7.3%)
* mom_vert_friction_mp_vertvisc_:   0.576s,   0.634s (-9.2%)
* mom_vert_friction_mp_vertvisc_:   0.583s,   0.636s (-8.3%)

This patch uses nested do loops since we have not yet adoped do
concurrent loop constructs.  But a future do concurrent form shows even
greater speedup, e.g.

* mom_vert_friction_mp_vertvisc_:   0.258s,   0.539s (-52.2%)

The work in this PR will prepare this module for porting to GPUs.

Co-authored-by: Edward Yang <edward_yang_125@hotmail.com>
As with vertvisc(), this patch rewrites the vertvisc_remnant()
tridiagonal solvers to run in kji order, with even greater benefits to
runtime.  Three instances are shown below.  Speedup is about 1.3-1.4x.

* mom_vert_friction_mp_vertvisc_remnant_:   0.939s,   1.241s (-24.3%)
* mom_vert_friction_mp_vertvisc_remnant_:   0.935s,   1.265s (-26.0%)
* mom_vert_friction_mp_vertvisc_remnant_:   0.910s,   1.258s (-27.7%)

As before, only the diagnoal array (b1) was promoted to 3d.

As with vertvisc() this change is expected to be highly favorable to
GPU performance.
Copy link
Copy Markdown
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have examined these proposed changes, and I am convinced that they are correct and improve the readability of the code, and moreover are likely to be more efficient across a range of computers. I am happy to accept this PR, pending successful results from the pipeline testing.

@Hallberg-NOAA Hallberg-NOAA added enhancement New feature or request refactor Code cleanup with no changes in functionality or results labels Jun 3, 2025
@Hallberg-NOAA
Copy link
Copy Markdown
Member

This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/mom6ci/MOM6/-/pipelines/27665.

@Hallberg-NOAA Hallberg-NOAA merged commit 45699c5 into NOAA-GFDL:dev/gfdl Jun 4, 2025
52 checks passed
@marshallward marshallward deleted the vertvisc_kji branch November 18, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request refactor Code cleanup with no changes in functionality or results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants