Skip to content

Bug fix to use correct rho value in cal_titau subroutines#1396

Merged
davegill merged 3 commits intowrf-model:developfrom
rsarthur:v422_m_opt_rho_fix
Feb 25, 2021
Merged

Bug fix to use correct rho value in cal_titau subroutines#1396
davegill merged 3 commits intowrf-model:developfrom
rsarthur:v422_m_opt_rho_fix

Conversation

@rsarthur
Copy link
Contributor

@rsarthur rsarthur commented Feb 5, 2021

TYPE: bug fix

KEYWORDS: rho, sfs_opt, m_opt, cal_titau

SOURCE: Robert Arthur and Jeff Mirocha (LLNL)

DESCRIPTION OF CHANGES:
Problem:
In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the
staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and
m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. This is a
follow-on from issue #1214 by @matzegoebel, which addressed the surface stress output from subroutine
vertical_diffusion_2 when sfs_opt .gt. 0 or m_opt .eq. 1, but did not address the cal_titau subroutines.

Solution:
A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the
cell-centered rho variable. Additionally, because cal_titau_12_21 now uses "corner" ghost points for rho, a new
halo/bc communication was added to solve_em.F to ensure that rho is specified correctly at those points. Correct
specification of the corner points also requires the updates to subroutine set_physical_bc3d in #1314, included in
the v4.2.2 release. Without the latter two updates for corner points, bit-for-bit errors occurred in regression tests for
idealized, doubly-periodic domains.

LIST OF MODIFIED FILES:
dyn_em/module_diffusion_em.F
dyn_em/solve_em.F
Registry/Registry.EM_COMMON

TESTS CONDUCTED:

  1. Idealized, doubly-periodic tests with various processor breakdowns show bit-for-bit identical results.
  2. Jenkins is all pass.

RELEASE NOTE: Bug fix to use correct rho value in cal_titau subroutines. In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable.

@davegill
Copy link
Contributor

davegill commented Feb 5, 2021

@rsarthur
!?!@#$%^&*(!?
No dice

Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: 2ba70f34652878d822b018c5554b06b14e106513, requested by: rsarthur for PR: https://github.com/wrf-model/WRF/pull/1396. For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 163           161         0
    Number of Comparisons  : 103           102         2

    Failed Simulations are: 
    None
    Which comparisons are not bit-for-bit: 
    output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_33_em_quarter_ss_04 status = 1
output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss_04 status = 1

Still those bit for bit errors.

@dudhia
Copy link
Collaborator

dudhia commented Feb 5, 2021 via email

@rsarthur
Copy link
Contributor Author

rsarthur commented Feb 5, 2021

@dudhia Interesting, thanks for the suggestion. I figured those halo/bc updates for rho were handled in module_firstk_rk_step_part2 before the diffusion calls... Anyway, I can try that and add a new commit. Two quick questions for you:

  1. should I add the set_physical_bc3d call to subroutine rk_phys_bc_dry_1, or just add the call separately (as for al and ph_2)?
  2. should I also add rho to PERIOD_BDY_EM_A?

@davegill
Copy link
Contributor

davegill commented Feb 5, 2021

@rsarthur @dudhia

should I also add rho to PERIOD_BDY_EM_A

Let me handle this one:
Great questions. Yes. When a HALO is added, an associated PERIOD should also be added.

@dudhia
Copy link
Collaborator

dudhia commented Feb 5, 2021 via email

@rsarthur rsarthur requested a review from a team as a code owner February 5, 2021 18:34
@davegill
Copy link
Contributor

davegill commented Feb 5, 2021

@dudhia @rsarthur
Folks,
The case that is failing bit-for-bit is an idealized case. The real-data cases do not seem to be missing HALO comms. It is the domain corner points for one of the doubly periodic BC cases (I pulled these plots and the text from the other PR).

I have a 1 time step namelist that shows diffs.

The diffs are in your mods (or are exposed by your mods, honestly). If I use the previous repo code that you branched from, I get bitwise identical results between differing number of MPI processes.

git checkout f311cd5e136631ebf3e

If I use your code, I get bit-wise diffs in the 4 corners, when comparing the differing numbers of MPI processes.

[wrfuser@07e8670e1170 em_quarter_ss]$ git branch
* m_opt_rho_fix
  master

[wrfuser@07e8670e1170 em_quarter_ss]$ mpirun -np 8 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI8
 starting wrf task            4  of            8
 starting wrf task            2  of            8
 starting wrf task            6  of            8
 starting wrf task            0  of            8
 starting wrf task            1  of            8
 starting wrf task            3  of            8
 starting wrf task            7  of            8
 starting wrf task            5  of            8

[wrfuser@07e8670e1170 em_quarter_ss]$ mpirun -np 3 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI3
 starting wrf task            1  of            3
 starting wrf task            0  of            3
 starting wrf task            2  of            3

[wrfuser@07e8670e1170 em_quarter_ss]$ ../../external/io_netcdf/diffwrf wrfout_d01_0001-01-01_00:00:00_MPI*
 Just plot  F
Diffing wrfout_d01_0001-01-01_00:00:00_MPI3 wrfout_d01_0001-01-01_00:00:00_MPI8
 Next Time 0001-01-01_00:00:00
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
 Next Time 0001-01-01_00:00:12
     Field   Ndifs    Dims       RMS (1)            RMS (2)     DIGITS    RMSE     pntwise max
         U      1547    3   0.2350595367E+02   0.2350595368E+02   9   0.2985E-04   0.8158E-04
         V      2046    3   0.3400045178E+01   0.3400045176E+01   9   0.2973E-04   0.6012E-03
         W      3332    3   0.3461880543E-01   0.3461873113E-01   5   0.5524E-04   0.1863E-02
        PH      2129    3   0.5000753094E+03   0.5000753217E+03   7   0.7080E-03   0.3194E-04
         T       738    3   0.7617058079E+02   0.7617058085E+02   9   0.4959E-05   0.1059E-05
       THM       738    3   0.7617058079E+02   0.7617058085E+02   9   0.4959E-05   0.1059E-05
        MU        56    2   0.4388289142E+03   0.4388289142E+03  10   0.2492E-02   0.8765E-04
         P      1891    3   0.1184337757E+03   0.1184337754E+03   8   0.1062E-02   0.1777E-03
    QVAPOR       520    3   0.4564871386E-02   0.4564871385E-02  10   0.1237E-08   0.6452E-05

I generated the file to plot using only the fields that show a difference:

ncdiff --variable U,V,W,PH,T,THM,MU,P,QVAPOR  wrfout_d01_0001-01-01_00:00:00_MPI[38] diffmpi38.nc

The following diffs are after a single time step (the initial time is identical, as is usually the situation), with mpi tasks 3 vs 8. The other fields look similarly structured, though with ncview I had to rescale to get the values viewable for the specific level. The PH diff shows up at the first level up from the surface (ph=0 is a definition at the model surface).

MU
Screen Shot 2021-01-07 at 8 44 46 PM

U
Screen Shot 2021-01-07 at 8 45 49 PM

QVAPOR
Screen Shot 2021-01-07 at 8 46 42 PM

@dudhia
Copy link
Collaborator

dudhia commented Feb 5, 2021 via email

@rsarthur
Copy link
Contributor Author

rsarthur commented Feb 5, 2021

@dudhia Yes, #1314 should be included here. That is why we started this new PR (as a follow-on to #1344).

However, in this most recent push, I only included the HALO and PERIOD updates. I will try again now with the new call to set_physical_bc3d for rho included.

@rsarthur
Copy link
Contributor Author

rsarthur commented Feb 5, 2021

Aha! My ongoing WRF education continues... @dudhia could you explain why you thought to add the new halo/bc calls for rho in that particular place in the code?

@davegill
Copy link
Contributor

davegill commented Feb 5, 2021

@rsarthur
The Delphi Oracle speaks!

@dudhia
Copy link
Collaborator

dudhia commented Feb 5, 2021 via email

@rsarthur
Copy link
Contributor Author

rsarthur commented Feb 6, 2021

@dudhia @davegill thanks for your help with this

@davegill
Copy link
Contributor

davegill commented Feb 9, 2021

This is an updated version of PR #1344 to version 4.2.2. The corner points fix from PR #1314 that is included in the 4.2.2 release is (fingers crossed) expected to fix the issue that was blocking #1344 from being approved. #1344 was closed

@davegill
Copy link
Contributor

davegill commented Feb 9, 2021

@rsarthur

  1. There are now three files that have been modified. Please include those in the PR commit message.
  2. Provide an explanation of why the communication was required.
  3. Close the other Bug fix to use correct rho value in cal_titau subroutines #1344 PR, with a pointer to this PR.

@davegill davegill changed the title Attempt 2: Bug fix to use correct rho value in cal_titau subroutines Bug fix to use correct rho value in cal_titau subroutines Feb 9, 2021
@rsarthur
Copy link
Contributor Author

rsarthur commented Feb 9, 2021

@davegill Done. Let me know if I should add anything else.

@davegill
Copy link
Contributor

davegill commented Feb 9, 2021

@dudhia @weiwangncar @rsarthur
Folks,
This one is ready for review.

@matzegoebel
Copy link
Contributor

I looked for the last halo call before the diffusion which was in solve_em and yes rho was not in it. Then I looked for other places where rho might get changed between that halo and the diffusion code and saw the bc3d calls which seemed to be needed in a similar way to halos.

So this fixed the errors now, right? I'm still struggling to understand this. Because to me, it seems that after the HALO communication that @dudhia mentioned, there is another one shortly before the diffusion calculations in module_first_rk_step_part2.F: HALO_EM_TKE_E, PERIOD_BDY_EM_PHY_BC, and set_physical_bc3d in the routine phy_bc which all include rho. Can anybody explain why these communications do not work?

@dudhia
Copy link
Collaborator

dudhia commented Feb 9, 2021 via email

@rsarthur
Copy link
Contributor Author

@matzegoebel I was wondering the same thing... Perhaps it is because rho is modified in module_first_rk_step_part1 in subroutine phy_prep? But yes, the errors are fixed now.

@dudhia
Copy link
Collaborator

dudhia commented Feb 12, 2021 via email

@matzegoebel
Copy link
Contributor

matzegoebel commented Feb 12, 2021

It is possible that the physical_bc3d call was all that was needed

I'm still a bit confused because, as I wrote earlier, there is another physical_bc3d call for rho in the subroutine phy_bc just before the diffusion calculations. From what I see the diffusion calculations are the first spot where the HALO points of rho are needed, so halo communications right before the diffusion should be sufficient.

Perhaps it is because rho is modified in module_first_rk_step_part1 in subroutine phy_prep?

@rsarthur I don't think this makes a difference, because the halo points are not needed in phy_prep, so halo communication in module_first_rk_step_part2 should be enough.

I'm just trying to understand this issue because it might help to understand future bugs. Anyway, maybe there is just some strange compiler behavior or so.

@dudhia
Copy link
Collaborator

dudhia commented Feb 12, 2021 via email

@dudhia
Copy link
Collaborator

dudhia commented Feb 12, 2021

However, maybe not. I will look again.

@dudhia
Copy link
Collaborator

dudhia commented Feb 12, 2021

I still don't know why this fix worked.

@davegill davegill merged commit 812c7f4 into wrf-model:develop Feb 25, 2021
vlakshmanan-scala pushed a commit to scala-computing/WRF that referenced this pull request Apr 4, 2024
…1396)

TYPE: bug fix

KEYWORDS: rho, sfs_opt, m_opt, cal_titau

SOURCE: Robert Arthur and Jeff Mirocha (LLNL)

DESCRIPTION OF CHANGES:
Problem:
In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the 
staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and 
m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. This is a 
follow-on from issue wrf-model#1214 by @matzegoebel, which addressed the surface stress output from subroutine 
vertical_diffusion_2 when sfs_opt .gt. 0 or m_opt .eq. 1, but did not address the cal_titau subroutines.

Solution:
A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the 
cell-centered rho variable. Additionally, because cal_titau_12_21 now uses "corner" ghost points for rho, a new 
halo/bc communication was added to solve_em.F to ensure that rho is specified correctly at those points. Correct 
specification of the corner points also requires the updates to subroutine set_physical_bc3d in wrf-model#1314, included in 
the v4.2.2 release. Without the latter two updates for corner points, bit-for-bit errors occurred in regression tests for 
idealized, doubly-periodic domains.

LIST OF MODIFIED FILES: 
dyn_em/module_diffusion_em.F
dyn_em/solve_em.F
Registry/Registry.EM_COMMON

TESTS CONDUCTED:
1. Idealized, doubly-periodic tests with various processor breakdowns show bit-for-bit identical results.
2. Jenkins is all pass.

RELEASE NOTE: Bug fix to use correct rho value in cal_titau subroutines. In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants