Skip to content

+Fix OBC reproducibility across restarts#907

Merged
Hallberg-NOAA merged 6 commits into
NOAA-GFDL:dev/gfdlfrom
Hallberg-NOAA:fix_OBC_restarts
Jun 6, 2025
Merged

+Fix OBC reproducibility across restarts#907
Hallberg-NOAA merged 6 commits into
NOAA-GFDL:dev/gfdlfrom
Hallberg-NOAA:fix_OBC_restarts

Conversation

@Hallberg-NOAA
Copy link
Copy Markdown
Member

@Hallberg-NOAA Hallberg-NOAA commented May 28, 2025

This pull request consists of a series of 6 commits that adds new runtime parameters (VERTEX_SHEAR_OBC_BUG,
MIXING_COEFS_OBC_BUG and OBC_RESERVOIR_INIT_BUG) that when set properly (to False) will allow some cases (and in particular the loop_current test case) with open boundary conditions to reproduce across restarts and to pass rotational consistency tests. There is also some targeted refactoring along with the code changes to 6 files and some new arguments to public interfaces to enable these changes. By default, all answers are bitwise identical.

The detailed description of the various commits are as follows:

  • Refactoring for OBC restart fix

This commit makes a set of changes to refactor code related to open boundary conditions to correct issues with restarts.

Renamed open_boundary_init() to open_boundary_halo_update() to reflect what it actually does and eliminated the 4 unused arguments to this routine.

Moved calls to fill_temp_salt_segments(), setup_OBC_tracer_reservoirs() and open_boundary_halo_update() outside of rotate_OBC_init(), and eliminated 5 unused parameters to rotate_OBC_init().

Moved halo updates for temperature and salinity and the call to setup_OBC_tracer_reservoirs() out of fill_temp_salt_segments() and made the tv argument to fill_temp_salt_segments() intent in.

Moved the call to open_boundary_halo_update() after MOM_state_initialization() for both rotated and unrotated branches.

Moved the calls to fill_temp_salt_segments() and setup_OBC_tracer_reservoirs() and the debugging calls for OBCs into the same code block as the rest of the OBC code in MOM_initialize_state().

All answers are bitwise identical, but there are changes to public interfaces and what does and does not occur in several routines.

  • +Add the runtime parameter VERTEX_SHEAR_OBC_BUG

Added the new runtime option VERTEX_SHEAR_OBC_BUG that can be set to false to use masks to set the thicknesses interpolated to velocity points when open boundary conditions might be used. This commit includes the addition of checksums of the input stratification and shear squared when DEBUG is true. It also refactors the code setting the output diffusivity in Calc_kappa_shear_vertex() to move some logical branches outside of the do loops for efficiency. A number of spelling errors were corrected, and some vertex-point indices were converted to uppercase to follow the MOM6 soft convention in MOM_kappa_shear.F90. By default, all answers are bitwise identical, but there is a new runtime parameter.

  • +Add calc_isoneutral_slopes optional argument OBC_N2

Added the new optional argument OBC_N2 to calc_isoneutral_slopes() that can be used to specify that only interior points are used to calculate the buoyancy frequencies at velocity points when open boundary conditions are in use. When the new OBC_N2 argument is absent or false, the solutions are identical to what they had been previously, but this gives a dependence on the arbitrary outer thicknesses when open boundary conditions are used. By default, all answers are bitwise identical, but there is a new optional argument to a publicly visible interface.

  • +Add the runtime parameter MIXING_COEFS_OBC_BUG

Added the new runtime option MIXING_COEFS_OBC_BUG that can be set to false to specify that only interior data should be used for thickness weighting in the lateral mixing coefficient calculations when open boundary conditions are in use, and that only interior temperatures, salinities, pressures and thicknesses should be used when calculating the buoyancy frequency at open boundary velocity faces. These changes are implemented by introducing two new 3-d arrays that set the thicknesses at velocity points and there are a few new logical tests inside of some loops, which may slow down the code slightly in routines that had never previously been notable for the amount of time spent there. They also make use of the recently added OBC_N2 argument to calc_isoneutral_slopes(). The default setting for MIXING_COEFS_OBC_BUG retains the previous (incorrect) answers, and all answers are bitwise identical in cases without open boundary conditions regardless of how it is set. There is a new entry in some MOM_parameter_doc
files for some cases with active open boundary conditions.

  • +Add OBC_RESERVOIR_INIT_BUG to fix OBC restarts

This commit revises setup_OBC_tracer_reservoirs() and other OBC routines to allow some cases using open boundary conditions and tracer reservoirs to reproduce across restarts and adds the new runtime parameter OBC_RESERVOIR_INIT_BUG that can be set to true to recover the previous answers in cases that are not initialized from a restart file. There is a new optional MOM_restart_CS argument to setup_OBC_tracer_reservoirs() that can be used to query whether the reservoirs have been read from a restart file before resetting them to values that may have been set during initialization. If the reservoirs have not been set from a restart file, there is now a preference to set them from the tracer reservoir (tres) element in the OBC tracer registry on the segments (if it has been initialized) rather than the interior tracer values that have been recorded on the segment.

Fill_temp_salt_segments() and fill_obgc_segments() have been modified to only set the tres segment data if it has not been initialized already, as indicated by the is_initialized field.

This commit adds the new public routine set_initialized_OBC_tracer_reservoirs() that can be used to record in the restart control structure that the OBC%tres_x and OBC%tres_y fields have been initialized and should not be reset to new values during initialization.

The calls to setup_OBC_tracer_reservoirs() and open_boundary_halo update() have been moved after the call to tracer_flow_control_init() so that the full arrays of tracer reservoirs do not need to be initialized in fill_obgc_segments(). Fill_obgc_segments() was refactored to separate out the steps that should occur
there from those that should not and are wrapped in a bug flag.

A debugging checksum call to chksum_OBC_segments() was added to the top of radiation_open_bdry_conds() to assist in any future debugging of the OBC code.

Doxygen comments were added describing several routines that were previously missing such a description.

By default, all answers are bitwise identical, but there is a new publicly visible routine, a new optional argument to a public interface and a new runtime parameter that can be used to correct the previous behavior that broke the
reproducibility across restarts of some cases using open boundary conditions with tracer reservoirs.

  • (*)Correct rotation with OBC tracer reservoirs

    Copy segment tracer registry data from the input segment to the target segment in rotate_OBC_segment_data(), including the tracer reservoirs on segments. With this change, configurations using tracers with open boundary conditions are now
    passing the rotational consistency tests. This will change answers with some rotated cases with active OBCs, but all answers in non-rotated cases or cases without tracers in OBCs are bitwise identical.

@Hallberg-NOAA Hallberg-NOAA added bug Something isn't working enhancement New feature or request answer-changing A change in results (actual or potential) Parameter change Input parameter changes (addition, removal, or description) labels May 28, 2025
@Hallberg-NOAA Hallberg-NOAA force-pushed the fix_OBC_restarts branch 2 times, most recently from 3a2fba7 to 0d9dd94 Compare May 29, 2025 07:00
@yichengt900
Copy link
Copy Markdown

We manually tested this PR with the CEFI-NWA12 regional physics-only config. Surprisingly, even the first commit fb7b25b changes the baseline answers and breaks restart reproducibility.

You can find the stdout on C6 at:
/gpfs/f6/ira-cefi/world-shared/error/test_phy_only_rst/NWA12.COBALT_o.208998595

@kshedstrom
Copy link
Copy Markdown

On the Kelvin_wave/rotated45_BT issue, running on 1 vs 3 cores gives differing values in CS%h in the first step at point (6,23) which is point (6,4) on the middle of the three-core job. This point is just outside of OBC segments 3 and 23. We shouldn't be setting or using h outside, right? Could this be due to changing the order of operations when using different core counts?

@kshedstrom
Copy link
Copy Markdown

In rk2, hp comes out of the second call to continuity differing at that point. On three cores inside continuity_PPM, vh differs between cores 0 and 1 at the face (6,23) aka (6,4), i.e., right on the tile boundary. This is at line L#176. hp then differs between the cores as well at all the i=6 points with values near 100.

@kshedstrom
Copy link
Copy Markdown

CS%v_accel_BT comes out of btstep with differiing values at that point, feeding into vp at line 687 of rk2 solver.

@kshedstrom
Copy link
Copy Markdown

btloop_update_v gets called 4 times in the first call to btstep. The first time, all is 0, times 2 and 3, the OBC values are 0 while the values to either side are non-zero but matching between the cores, the final time, the OBC values are 0 and the values to either side do not match - for PFv and Cor_v, hence v_accel_bt.

@kshedstrom
Copy link
Copy Markdown

Something happens to make vbt and vhbt not match across cores on the fourth call to apply_v_velocities_OBC. I have to quit for the day now.

@kshedstrom
Copy link
Copy Markdown

kshedstrom commented Jun 1, 2025

Could it be that the time-to-exchange code needs to be more conservative? The first time through apply_v_velocities_OBC, the vbt values all start as zero, then the boundary values are set based on external Flather forcing. The second time into the routine, both processes have three matching vbt values at i=6. The third time, two values match, one does not. The fourth time, only the boundary values match, while the halo values have kept their prior value. However, the vbt OBC calculation uses one of these halo points (vbt_old(i,J+1)).

Comment thread src/initialization/MOM_state_initialization.F90 Outdated
@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

Thank you for the comments on the reproducibility across PE count and layout for the Kelvin wave test case, @kshedstrom. Those issues remain, but are separate from the ones that this PR addresses. I do have some fixes to the rotational symmetry problem with Kelvin waves (they work with 1 PE but not multiple PEs, due to these issues). I will put in that PR and we can use that PR as a platform to continue this conversation.

In the mean-time, I would appreciate both your opinion and @yichengt900 's on whether the revised version of the current PR works with the OBC-based cases you use for testing.

@yichengt900
Copy link
Copy Markdown

We tested this PR using the CEFI NWA and NEP regional configurations. Both cases passed the reproducibility test across restarts.

For the NWA configuration, the baseline answers remain unchanged.

However, for the NEP configuration, we observed changes in the baseline results after the Add OBC_RESERVOIR_RESTART_BUG commit. A preliminary check suggests that the differences originate at the eastern and northern tracer OBCs.
nh3_surface_diff_overlay

@yichengt900
Copy link
Copy Markdown

After further investigation, we found that the baseline answers for the NEP case can be recovered by reverting the changes in the fill_obgc_segments subroutine.

@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

This PR has been updated to correct the single-character problem that was causing the NE Pacific test to fail. In addition, the runtime parameter associated with this bug has been updated to OBC_RESERVOIR_INIT_BUG to better characterize what it does.

  This commit makes a set of changes to refactor code related to open boundary
conditions to correct issues with restarts.

  Renamed open_boundary_init to open_boundary_halo_update to reflect what it
actually does and eliminated the 4 unused arguments to this routine.

  Moved halo updates for temperature and salinity and the call to
setup_OBC_tracer_reservoirs out of fill_temp_salt_segments and made the tv
argument to fill_temp_salt_segments intent in.

  Moved the call to setup_OBC_tracer_reservoirs and the debugging calls for OBCs
into the same code block as the rest of the OBC code in MOM_initialize_state.

  Moved the call to open_boundary_halo_update after MOM_state_initialization for
both rotated and unrotated branches.

  Moved the calls to fill_temp_salt_segments and setup_OBC_tracer_reservoirs and
the debugging calls for OBCs into the same code block as the rest of the OBC
code in MOM_initialize_state.

  All answers are bitwise identical, but there are changes to public interfaces
and what does and does not occur in several routines.
  Added the new runtime option VERTEX_SHEAR_OBC_BUG that can be set to false to
use masks to set the thicknesses interpolated to velocity points when open
boundary conditions might be used.  This commit includes the addition of
checksums of the input stratification and shear squared when DEBUG is true.  It
also refactors the code setting the output diffusivity in
Calc_kappa_shear_vertex to move some logical branches outside of the do loops
for efficiency.  A number of spelling errors were corrected, and some
vertex-point indices were converted to uppercase to follow the MOM6 soft
convention in MOM_kappa_shear.F90.  By default, all answers are bitwise
identical, but there is a new runtime parameter.
  Added the new optional argument OBC_N2 to calc_isoneutral_slopes that can be
use to specify that only interior points are used to calculate the buoyancy
frequencies at velocity points when open boundary conditions are in use.  When
the new OBC_N2 argument is absent or false, the solutions are identical to what
they had been previously, but this gives a dependence on the arbitrary outer
thicknesses when open boundary conditions are used.  By default, all answers are
bitwise identical, but there is a new optional argument to a publicly visible
interface.
  Added the new runtime option MIXING_COEFS_OBC_BUG that can be set to false to
specify that only interior data should be used for thickness weighting in the
lateral mixing coefficient calculations when open boundary conditions are in
use, and that only interior temperatures, salinities, pressures and thicknesses
should be used when calculating the buoyancy frequency at open boundary velocity
faces.  These changes are implemented by introducing two new 3-d arrays that set
the thicknesses at velocity points and there are a few new logical tests inside
of some loops, which may slow down the code slightly in routines that had never
previously been notable for the amount of time spent there.  They also make use
of the recently added OBC_N2 argument to calc_isoneutral_slopes().  There are
new ocean_OBC_type arguments to calc_resoln_function() and calc_sqg_struct() as
a part of these changes.  The default setting for MIXING_COEFS_OBC_BUG retains
the previous (incorrect) answers, and all answers are bitwise identical in cases
without open boundary conditions regardless of how it is set.  There is a new
entry in some MOM_parameter_doc files for some cases with active open boundary
conditions and new arguments to publicly visible interfaces.
  This commit revises setup_OBC_tracer_reservoirs and other OBC routines to allow
some cases using open boundary conditions and tracer reservoirs to reproduce
across restarts and added the new runtime parameter OBC_RESERVOIR_INIT_BUG that
can be set to true to recover the previous answers in cases that are not
initialized from a restart file.   There is a new optional MOM_restart_CS argument
to setup_OBC_tracer_reservoirs that can be used to query whether the reservoirs
have been read from a restart file before resetting them to values that may have
been set during initialization.  If the reservoirs have not been set from a
restart file, there is now a preference to set them from the tracer reservoir
(tres) element in the OBC tracer registry on the segments (if it has been
initialized) rather than the interior tracer values that have been recorded on the
segment.

  Fill_temp_salt_segments and fill_obgc_segments have been modified to only set
the tres segment data if it has not been initialized already, as indicated by
the is_initialized field.

  This commit adds the new public routine set_initialized_OBC_tracer_reservoirs
that can be used to record in the restart control structure that the OBC%tres_x
and OBC%tres_y fields have been initialized and should not be reset to new
values during initialization.

  The calls to setup_OBC_tracer_reservoirs and open_boundary_halo update have
been moved after the call to tracer_flow_control_init so that the full arrays of
tracer reservoirs do not need to be initialized in fill_obgc_segments.
Fill_obgc_segments was refactored to separate out the steps that should occur
there from those that should not and are wrapped in a bug flag.

  A debugging checksum call to chksum_OBC_segments was added to the top of
radiation_open_bdry_conds to assist in any future debugging of the OBC code.

  Added doxygen comments describing several routines that were previously
missing such a description.

  By default, all answers are bitwise identical, but there is a new publicly
visible routine, a new optional argument to a public interface and a new runtime
parameter that can be used to correct the previous behavior that broke the
reproducibility across restarts of some cases using open boundary conditions
with tracer reservoirs.
  Copy segment tracer registry data from the input segment to the target segment
in rotate_OBC_segment_data, including the tracer reservoirs on segments.  With
this change, configurations using tracers with open boundary conditions are now
passing the rotational consistency tests.  This will change answers with some
rotated cases with active OBCs, but all answers in non-rotated cases or cases
without tracers in OBCs are bitwise identical.
@yichengt900
Copy link
Copy Markdown

@Hallberg-NOAA, thanks for the updates! This PR has now passed both NWA and NEP regression tests. The answers remain unchanged and are reproducible across restarts.

@kshedstrom
Copy link
Copy Markdown

I approve of this PR. My tracer reservoir rotational tests are now passing.

That said, my restart test of the Arctic is not reproducing, even with the bug flags set to False. I will continue to investigate, but it shouldn't hold this PR up.

Copy link
Copy Markdown
Member

@marshallward marshallward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hallberg-NOAA
Copy link
Copy Markdown
Member Author

This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/mom6ci/MOM6/-/pipelines/27682 with the expected warnings about new runtime parameters.

@Hallberg-NOAA Hallberg-NOAA merged commit 4c0ae97 into NOAA-GFDL:dev/gfdl Jun 6, 2025
52 checks passed
@Hallberg-NOAA Hallberg-NOAA deleted the fix_OBC_restarts branch September 18, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

answer-changing A change in results (actual or potential) bug Something isn't working enhancement New feature or request Parameter change Input parameter changes (addition, removal, or description)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants