*Fix PE-layout reproducibility with interior OBCs#916
Conversation
|
Did you test this with different core counts on my test case? I'm still seeing different answers with say 1 vs 3 cores. |
|
I tested this on the ESMG Kelvin_wave/rotated45_BT test case on 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 21, 24, 25, 27, 28, 30 and 32 PEs and obtained identical answers in each case. (This test case is too small to run with a 4-point halo on 11, 13, 17, 19, 22, 23, 26, 29 or 31 PEs.) Without this fix, the Kelvin_wave/rotated45_BT test case case with an identical set of runtime parameters was often not even running (giving NaN related errors) partway into the run on 3, 5, 7 and some other PE counts. I also ran all of the standard MOM6-examples test cases and found that no answers changed there. Which test case are you finding is still exhibiting different answers with these code changes, @kshedstrom? |
|
Yes, that test. I can see that hp after the call to continuity is not the same across tiles for a three core job. One and three-core answers do not match. |
|
Judging by the ocean.stats files, the Kelvin_wave/rotate45_BT test case was reproducing across PE count, but after further examination the Kelvin_wave/rotate_BT test case was not reproducing. I have some further revisions that expand the stencil size in the barotropic solver to 2 points when there are reversed OBCs in any halos (e.g., southern OBCs in northern halos). With these further revisions, which I will be adding shortly in a revision to this PR, the Kelvin_wave/rotate_BT test case is now reproducing (still judging by the ocean.stats files). |
f8c0b5d to
520f050
Compare
|
I have looked into this further, and with the new code, the presence of reversed OBCs in any halo changes the stencil size and hence the pattern between iterations of the width of the valid tracer points that are being explored with the checksums in the barotropic solver. Whether this stencil size change is triggered depends on the PE count and layout, and it never occurs on 1 PE. So if we are basing the test of reproducibility on the checksums, it can appear to fail, even though the solutions themselves actually are reproducing. Based on my version of the runs, I think that this is what is happening between the 1 and 3 PE versions of the Kelvin_wave/rotate_BT test case that you were running, @kshedstrom . So we have a few choices.
|
|
I have updated this PR to add two new runtime parameters that implement key aspects of both solutions 1 (optionally alter the stensil) and 2 (optionally restrict the range of the checksums) in using the checksums that are triggered by the use of the |
637343a to
afc4256
Compare
|
This revised version appears to have some regression failures because the velocity-point land mask is now being applied to the calculation of some accelerations (see lines 3491 and 3497 of the revised file), and even though these accelerations are multiplied by the land masks before they are used, the diagnostics of these accelerations were not being multiplied by the land masks. |
afc4256 to
e0ad6a0
Compare
|
I removed the masks from the calculation of |
|
I can now get the same answer across 1, 3, 5 cores on the rotated Kelvin wave tests. I didn't need to change BT_WIDE_HALO_MIN_STENCIL from the default 0. |
|
Is this ready to be approved Kate? |
Corrected the previous PE-count and layout dependency of answers for cases with open boundaries are not at the natural edges of the domain by storing the previous barotropic velocities in ubt_prev and vbt_prev over a wider range of loop indices, and by increasing the stencil size used in btstep_timeloop in such cases. Because the required loop range no longer matches the range over which the new velocities are set in bt_loop_update_u() and bt_loop_update_v(), the code setting these arrays has been moved out of these routines and into the main body of btstep_timeloop(). This commit will change answers (and make them invariant to the PE count and layout) in some cases, such as the ESMG-configs Kelvin_wave/rotate45_BT and Kelvin_wave/rotate_BT test cases, where there are southern open boundaries appearing in the northern halo regions on any PEs, for example. In all the Kelvin_wave test cases, the answers now agree with the previous single-PE answers. Most regional configurations with OBCs only apply them at the edges of the domain and hence will not be impacted, and no answers are changed in cases without open boundary conditions.
This commit adds the new runtime parameters BT_WIDE_HALO_MIN_STENCIL and DEBUG_BT_WIDE_HALOS to control the stencil width used with wide halos and to specify whether the checksums for the entire active halo of values are written when DEBUG_BT is enabled, or just the symmetric computational domain. Setting `BT_WIDE_HALO_MIN_STENCIL = 2` and `DEBUG_BT_WIDE_HALOS = False` provides for cleaner checksums in some cases because they are invariant to the use of wide halos, the width of those wide halos, or the relative positions of interior- domain OBCs relative to PE boundaries. This commit also changes a number of velocity point checksums in btstep_timeloop to do symmetric output. All model solutions are bitwise identical regardless of how these new parameters are set, but they can alter some debugging output.
|
Yes, I approve now. |
e0ad6a0 to
61a852e
Compare
theresa-cordero
left a comment
There was a problem hiding this comment.
Approved on behalf of Kate.
Thanks Kate, Yi-Cheng, and Bob for all that went into this PR!
|
This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/mom6ci/MOM6/-/pipelines/27795 with the expected warnings about new parameters in the MOM_parameter_doc files. |
Corrected the previous PE-count and layout dependency of answers for cases with open boundaries are not at the natural edges of the domain by storing the previous barotropic velocities in
ubt_prevandvbt_prevover a wider range of loop indices, and by increasing the stencil size used inbtstep_timeloop()in such cases. Because the required loop range no longer matches the range over which the new velocities are set inbt_loop_update_u()andbt_loop_update_v(), the code setting these arrays has been moved out of these routines and into the main body ofbtstep_timeloop(). This commit will change answers (and make them invariant to the PE count and layout) in some cases, such as the ESMG-configs Kelvin_wave/rotate45_BT and Kelvin_wave/rotate_BT test cases, where there are southern open boundaries appearing in the northern halo regions on any PEs, for example. In all the Kelvin_wave test cases, the answers now agree with the previous single-PE answers. Most regional configurations with OBCs only apply them at the edges of the domain and hence will not be impacted, and no answers are changed in cases without open boundary conditions.In response to issues identified with debugging and demonstrating the validity of this commit (see the conversation), a second commit was subsequently added to this PR. This second commit adds the new runtime parameters
BT_WIDE_HALO_MIN_STENCILandDEBUG_BT_WIDE_HALOSto control the stencil width used with wide halos and to specify whether the checksums for the entire active halo of values are written whenDEBUG_BTis enabled, or just the symmetric computational domain. SettingBT_WIDE_HALO_MIN_STENCIL = 2andDEBUG_BT_WIDE_HALOS = Falseprovides for cleaner checksums in some cases because they are invariant to the use of wide halos, the width of those wide halos, or the relative positions of interior-domain OBCs relative to PE boundaries. This commit also changes a number of velocity point checksums inbtstep_timeloop()to do symmetric output. All model solutions are bitwise identical regardless of how these new parameters are set, but they will alter some debugging output.