Rewrite vertvisc() and vertvisc_remnant() loops to kji form#912
Merged
Conversation
Member
Author
|
I have some additional memory timings for Intel. Four instances are shown below. There is a slight increase in memory time, although not by much. Roughly, time in I don't think we need to be terribly worried about this. But we should probably consider this metric in similar future PRs. |
The jki loops in vertvisc() have been reordered to kji. The solver increases the number of concurrent tridiagonal solvers from Ni to Ni*Nj. Two other changes contributed to performance * Moving diagnostics (e.g. ADp%du_dt_str) outside of loops * Conditional computing of Ray() when visc%Ray_[uv] is set Not all optimizations of this sort were applied, and should be reviwed in relevant experiments. This showed a modest performance improvement on CPUs. Three instances are shown below. * mom_vert_friction_mp_vertvisc_: 0.583s, 0.629s (-7.3%) * mom_vert_friction_mp_vertvisc_: 0.576s, 0.634s (-9.2%) * mom_vert_friction_mp_vertvisc_: 0.583s, 0.636s (-8.3%) This patch uses nested do loops since we have not yet adoped do concurrent loop constructs. But a future do concurrent form shows even greater speedup, e.g. * mom_vert_friction_mp_vertvisc_: 0.258s, 0.539s (-52.2%) The work in this PR will prepare this module for porting to GPUs. Co-authored-by: Edward Yang <edward_yang_125@hotmail.com>
As with vertvisc(), this patch rewrites the vertvisc_remnant() tridiagonal solvers to run in kji order, with even greater benefits to runtime. Three instances are shown below. Speedup is about 1.3-1.4x. * mom_vert_friction_mp_vertvisc_remnant_: 0.939s, 1.241s (-24.3%) * mom_vert_friction_mp_vertvisc_remnant_: 0.935s, 1.265s (-26.0%) * mom_vert_friction_mp_vertvisc_remnant_: 0.910s, 1.258s (-27.7%) As before, only the diagnoal array (b1) was promoted to 3d. As with vertvisc() this change is expected to be highly favorable to GPU performance.
7082094 to
fb16a2d
Compare
Hallberg-NOAA
approved these changes
Jun 3, 2025
Member
Hallberg-NOAA
left a comment
There was a problem hiding this comment.
I have examined these proposed changes, and I am convinced that they are correct and improve the readability of the code, and moreover are likely to be more efficient across a range of computers. I am happy to accept this PR, pending successful results from the pipeline testing.
Member
|
This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/mom6ci/MOM6/-/pipelines/27665. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch rewrites the tridiagonal solvers of
vertvisc()andvertvisc_remnant()to kji-form, increasing the concurrency over j-points.Overall runtime of vertical friction is reduced by about 5-6%.
The
vertvisc()runtime is reduced by about 8%.vertvisc_remnant()is reduced by about 25-30%.Only one new 3d was required. Several 1d arrays were promoted to 2d, or were reshaped to ij.
Some speedups were due to movement of diagnostics outside of the main tridiagonal loops, which enabled vectorization. Another speedup was due to conditionally populating the Rayleigh drag
Ray.Speedups are much higher if the loops are changed to
do concurrent(e.g. 2x speedup invertvisc) but this will be handled in a separate PR.This new loop form is favorable to GPUs, and is part of the preparation for porting MOM6 to GPU platforms.
This PR is based on an earlier draft by @edoyango developed for GPU migration.