Skip to content

Small improvements#51

Merged
marshallward merged 10 commits into
marshallward:dev/gpufrom
edoyango:small-improvements
Nov 25, 2025
Merged

Small improvements#51
marshallward merged 10 commits into
marshallward:dev/gpufrom
edoyango:small-improvements

Conversation

@edoyango
Copy link
Copy Markdown
Collaborator

Hi @marshallward @JorgeG94

This is a pull request that fixes a few small things:

  • ports the find_ustar subroutine to GPU (where relevant to double_gyre)
    • this is called at start of vertvisc and was still on CPU
  • in btstep sends wt_* to GPU after calculating on GPU
    • avoids many scalar transfers in btstep_timeloop
  • download diagnostic variables [u,v]_old only if the diagnostics are turned on
  • fix segfault I was getting on V100 and A100s in MOM_Pressureforce_FV.

These changes should be compatible with the other pull requests open atm.

feel free to leave this until you're back marshall.

Comment thread src/core/MOM_PressureForce_FV.F90
Comment thread src/parameterizations/vertical/MOM_vert_friction.F90
@edoyango
Copy link
Copy Markdown
Collaborator Author

rebased on top of other open pull requests

@edoyango
Copy link
Copy Markdown
Collaborator Author

edoyango commented Oct 30, 2025

made sure some of the loops in btstep are submitted as a single kernel. No differences in double_gyre, but got some nice speedups on benchmark - probably because of the vert 22 layers in benchmark vs double_gyre's 2.

# gadi v100 without fisued loops
(Ocean barotropic mode stepping)       2.652537
# gadi v100 with new jki loops
(Ocean barotropic mode stepping)       2.056607

I get similar improvements on the A100s at monash.

Pretty significant given that I only touched a handful of the loops. Not sure how well this scales if we increase domain size laterally though.

vertvisc could probably benefit from a similar treatment. Will look tomorrow.

Comment thread src/core/MOM_dynamics_split_RK2.F90 Outdated
Comment thread src/core/MOM.F90
@edoyango
Copy link
Copy Markdown
Collaborator Author

fused a few loops in vertical viscosity 204106f
The speedup was more modest because many expensive loops are still launching multiple kernels per k iteration. I couldn't port these loops because of the OBC loops that are mixed in e.g. https://github.com/edoyango/MOM6/blob/204106ff305d763e5e5131b6c3bc2a717724d50c/src/parameterizations/vertical/MOM_vert_friction.F90#L2544

on gadi

# no fused loops
(Ocean vertical viscosity)             13.136691

# fused loops
(Ocean vertical viscosity)             12.531298

Not sure I want to spend too much more time on fusing loops in vertvisc since there's a columnar rewrite in dev gfdl we'll port eventually.

@marshallward
Copy link
Copy Markdown
Owner

Are you able to rebase this? The content in #48 has been merged.

@marshallward
Copy link
Copy Markdown
Owner

The switch to j/k/i could be quite significant for the CPU as we eventually port to dev/gfdl. I'm becoming convinced that it may be the best path forward, but we should probably test these in some production runs before merging.

@edoyango
Copy link
Copy Markdown
Collaborator Author

Are you able to rebase this? The content in #48 has been merged.

rebased!

@marshallward
Copy link
Copy Markdown
Owner

This has effectively become our baseline for performance, so I think it's time to merge this in.

Although the j/k/i swaps do take us away from the live code (dev/gfdl or main), we can come back and sort it out down the road.

There's a lot of commits, but there's also quite a variety of changes, so I'll merge without any squashing.

@marshallward marshallward merged commit 90b162e into marshallward:dev/gpu Nov 25, 2025
52 checks passed
@edoyango edoyango deleted the small-improvements branch March 12, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants