Skip to content

*Fix PE-layout reproducibility with interior OBCs#916

Merged
Hallberg-NOAA merged 2 commits into
NOAA-GFDL:dev/gfdlfrom
Hallberg-NOAA:fix_interior_OBC_layout_repro
Jun 17, 2025
Merged

*Fix PE-layout reproducibility with interior OBCs#916
Hallberg-NOAA merged 2 commits into
NOAA-GFDL:dev/gfdlfrom
Hallberg-NOAA:fix_interior_OBC_layout_repro

Conversation

@Hallberg-NOAA
Copy link
Copy Markdown
Member

@Hallberg-NOAA Hallberg-NOAA commented Jun 10, 2025

Corrected the previous PE-count and layout dependency of answers for cases with open boundaries are not at the natural edges of the domain by storing the previous barotropic velocities in ubt_prev and vbt_prev over a wider range of loop indices, and by increasing the stencil size used in btstep_timeloop() in such cases. Because the required loop range no longer matches the range over which the new velocities are set in bt_loop_update_u() and bt_loop_update_v(), the code setting these arrays has been moved out of these routines and into the main body of btstep_timeloop(). This commit will change answers (and make them invariant to the PE count and layout) in some cases, such as the ESMG-configs Kelvin_wave/rotate45_BT and Kelvin_wave/rotate_BT test cases, where there are southern open boundaries appearing in the northern halo regions on any PEs, for example. In all the Kelvin_wave test cases, the answers now agree with the previous single-PE answers. Most regional configurations with OBCs only apply them at the edges of the domain and hence will not be impacted, and no answers are changed in cases without open boundary conditions.

In response to issues identified with debugging and demonstrating the validity of this commit (see the conversation), a second commit was subsequently added to this PR. This second commit adds the new runtime parameters BT_WIDE_HALO_MIN_STENCIL and DEBUG_BT_WIDE_HALOS to control the stencil width used with wide halos and to specify whether the checksums for the entire active halo of values are written when DEBUG_BT is enabled, or just the symmetric computational domain. Setting BT_WIDE_HALO_MIN_STENCIL = 2 and DEBUG_BT_WIDE_HALOS = False provides for cleaner checksums in some cases because they are invariant to the use of wide halos, the width of those wide halos, or the relative positions of interior-domain OBCs relative to PE boundaries. This commit also changes a number of velocity point checksums in btstep_timeloop() to do symmetric output. All model solutions are bitwise identical regardless of how these new parameters are set, but they will alter some debugging output.

@Hallberg-NOAA Hallberg-NOAA added bug Something isn't working answer-changing A change in results (actual or potential) labels Jun 10, 2025
@kshedstrom
Copy link
Copy Markdown

Did you test this with different core counts on my test case? I'm still seeing different answers with say 1 vs 3 cores.

@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

Hallberg-NOAA commented Jun 11, 2025

I tested this on the ESMG Kelvin_wave/rotated45_BT test case on 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 21, 24, 25, 27, 28, 30 and 32 PEs and obtained identical answers in each case. (This test case is too small to run with a 4-point halo on 11, 13, 17, 19, 22, 23, 26, 29 or 31 PEs.) Without this fix, the Kelvin_wave/rotated45_BT test case case with an identical set of runtime parameters was often not even running (giving NaN related errors) partway into the run on 3, 5, 7 and some other PE counts.

I also ran all of the standard MOM6-examples test cases and found that no answers changed there. Which test case are you finding is still exhibiting different answers with these code changes, @kshedstrom?

@kshedstrom
Copy link
Copy Markdown

Yes, that test. I can see that hp after the call to continuity is not the same across tiles for a three core job. One and three-core answers do not match.

@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

Judging by the ocean.stats files, the Kelvin_wave/rotate45_BT test case was reproducing across PE count, but after further examination the Kelvin_wave/rotate_BT test case was not reproducing. I have some further revisions that expand the stencil size in the barotropic solver to 2 points when there are reversed OBCs in any halos (e.g., southern OBCs in northern halos). With these further revisions, which I will be adding shortly in a revision to this PR, the Kelvin_wave/rotate_BT test case is now reproducing (still judging by the ocean.stats files).

@Hallberg-NOAA Hallberg-NOAA force-pushed the fix_interior_OBC_layout_repro branch from f8c0b5d to 520f050 Compare June 11, 2025 20:07
@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

I have looked into this further, and with the new code, the presence of reversed OBCs in any halo changes the stencil size and hence the pattern between iterations of the width of the valid tracer points that are being explored with the checksums in the barotropic solver. Whether this stencil size change is triggered depends on the PE count and layout, and it never occurs on 1 PE. So if we are basing the test of reproducibility on the checksums, it can appear to fail, even though the solutions themselves actually are reproducing. Based on my version of the runs, I think that this is what is happening between the 1 and 3 PE versions of the Kelvin_wave/rotate_BT test case that you were running, @kshedstrom .

So we have a few choices.

  1. We could alter the stencil based on whether the reversed OBCs could occur in halos for ANY possible PE count and layout. This would fix the debugging, but it would lead to more halo updates than is necessary. It could also only be applied if debugging is on, but then turning on the debugging would change the pattern of what the model is actually doing.
  2. We could add the option to change how much of the halos we check in the barotropic solver to only check the symmetric memory computational grid and not any of the extra wide halo points. Alternately, we could do this only if there could be reversed OBCs. The debugging messages would then pass in this mode, but they could miss the source of answer changes if they first arise in the wide halos.
  3. We could just live with false positives in the debugging messages and rely on the restarts or ocean.stats files to determine whether answers are reproducing.

@Hallberg-NOAA Hallberg-NOAA added the Parameter change Input parameter changes (addition, removal, or description) label Jun 13, 2025
@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

I have updated this PR to add two new runtime parameters that implement key aspects of both solutions 1 (optionally alter the stensil) and 2 (optionally restrict the range of the checksums) in using the checksums that are triggered by the use of the DEBUG = True and DEBUG_BT = True settings. @marshallward convinced me that option 3 (ignore the problem) was a very bad idea. @kshedstrom, with this newly revised PR, if you were to do your runs with either or both of BT_WIDE_HALO_MIN_STENCIL = 2 and DEBUG_BT_WIDE_HALOS = False, I strongly suspect that you would find that the checksums for the Kelvin_wave/rotate_BT test case will reproduce between 1 and 3 PEs (or other PE counts that are small enough to work with the small size of this test case).

@Hallberg-NOAA Hallberg-NOAA force-pushed the fix_interior_OBC_layout_repro branch from 637343a to afc4256 Compare June 13, 2025 19:23
@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

Hallberg-NOAA commented Jun 13, 2025

This revised version appears to have some regression failures because the velocity-point land mask is now being applied to the calculation of some accelerations (see lines 3491 and 3497 of the revised file), and even though these accelerations are multiplied by the land masks before they are used, the diagnostics of these accelerations were not being multiplied by the land masks.

@Hallberg-NOAA Hallberg-NOAA force-pushed the fix_interior_OBC_layout_repro branch from afc4256 to e0ad6a0 Compare June 14, 2025 07:44
@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

I removed the masks from the calculation of accel_layer_u and accel_layer_v to preserve the previous diagnostics and pass the regression testing, even though this leads to isolated differences across PE count in checksums on these arrays in some cases, essentially because without the masks the pressure gradient accelerations at coastlines depend on values in land. I intend to address this issue more broadly in a subsequent PR.

@kshedstrom
Copy link
Copy Markdown

I can now get the same answer across 1, 3, 5 cores on the rotated Kelvin wave tests. I didn't need to change BT_WIDE_HALO_MIN_STENCIL from the default 0.

@theresa-cordero
Copy link
Copy Markdown

Is this ready to be approved Kate?

  Corrected the previous PE-count and layout dependency of answers for cases
with open boundaries are not at the natural edges of the domain by storing the
previous barotropic velocities in ubt_prev and vbt_prev over a wider range of
loop indices, and by increasing the stencil size used in btstep_timeloop in such
cases.  Because the required loop range no longer matches the range over which
the new velocities are set in bt_loop_update_u() and bt_loop_update_v(), the
code setting these arrays has been moved out of these routines and into the main
body of btstep_timeloop().  This commit will change answers (and make them
invariant to the PE count and layout) in some cases, such as the ESMG-configs
Kelvin_wave/rotate45_BT and Kelvin_wave/rotate_BT test cases, where there are
southern open boundaries appearing in the northern halo regions on any PEs, for
example.  In all the Kelvin_wave test cases, the answers now agree with the
previous single-PE answers.  Most regional configurations with OBCs only apply
them at the edges of the domain and hence will not be impacted, and no answers
are changed in cases without open boundary conditions.
  This commit adds the new runtime parameters BT_WIDE_HALO_MIN_STENCIL and
DEBUG_BT_WIDE_HALOS to control the stencil width used with wide halos and to
specify whether the checksums for the entire active halo of values are written
when DEBUG_BT is enabled, or just the symmetric computational domain.  Setting
`BT_WIDE_HALO_MIN_STENCIL = 2` and `DEBUG_BT_WIDE_HALOS = False` provides for
cleaner checksums in some cases because they are invariant to the use of wide
halos, the width of those wide halos, or the relative positions of interior-
domain OBCs relative to PE boundaries.  This commit also changes a number of
velocity point checksums in btstep_timeloop to do symmetric output.  All model
solutions are bitwise identical regardless of how these new parameters are set,
but they can alter some debugging output.
@kshedstrom
Copy link
Copy Markdown

Yes, I approve now.

@Hallberg-NOAA Hallberg-NOAA force-pushed the fix_interior_OBC_layout_repro branch from e0ad6a0 to 61a852e Compare June 16, 2025 21:54
Copy link
Copy Markdown

@theresa-cordero theresa-cordero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on behalf of Kate.
Thanks Kate, Yi-Cheng, and Bob for all that went into this PR!

@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/mom6ci/MOM6/-/pipelines/27795 with the expected warnings about new parameters in the MOM_parameter_doc files.

@Hallberg-NOAA Hallberg-NOAA merged commit 7633554 into NOAA-GFDL:dev/gfdl Jun 17, 2025
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

answer-changing A change in results (actual or potential) bug Something isn't working Parameter change Input parameter changes (addition, removal, or description)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants