Bug fix to use correct rho value in cal_titau subroutines#1396
Bug fix to use correct rho value in cal_titau subroutines#1396davegill merged 3 commits intowrf-model:developfrom
Conversation
|
@rsarthur Still those bit for bit errors. |
|
I think rho needs to be added to HALO_EM_A in the Registry.
I also wonder if we need to call set_physical_bc3d for rho around line 655
in solve_em.F.
…On Thu, Feb 4, 2021 at 7:36 PM Dave Gill ***@***.***> wrote:
@rsarthur <https://github.com/rsarthur>
!?!@#$%^&*(!?
No dice
Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: 2ba70f3, requested by: rsarthur for PR: #1396. For any query please send e-mail to David Gill.
Test Type | Expected | Received | Failed
= = = = = = = = = = = = = = = = = = = = = = = = = = = =
Number of Tests : 19 18
Number of Builds : 48 46
Number of Simulations : 163 161 0
Number of Comparisons : 103 102 2
Failed Simulations are:
None
Which comparisons are not bit-for-bit:
output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_33_em_quarter_ss_04 status = 1
output_4:SUCCESS_RUN_WRF_d01_em_quarter_ss_32_em_quarter_ss_04 vs SUCCESS_RUN_WRF_d01_em_quarter_ss_34_em_quarter_ss_04 status = 1
Still those bit for bit errors.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77CTYWRGT6CYG3RXXB3S5NKTXANCNFSM4XD43FXA>
.
|
|
@dudhia Interesting, thanks for the suggestion. I figured those halo/bc updates for rho were handled in module_firstk_rk_step_part2 before the diffusion calls... Anyway, I can try that and add a new commit. Two quick questions for you:
|
|
I would suggest one at a time to see what works. Start with the HALO.
…On Fri, Feb 5, 2021 at 10:33 AM Dave Gill ***@***.***> wrote:
@rsarthur <https://github.com/rsarthur> @dudhia
<https://github.com/dudhia>
should I also add rho to PERIOD_BDY_EM_A
Let me handle this one:
Great questions. Yes. When a HALO is added, an associated PERIOD should
also be added.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77FYEFQ3ZC32UADIRTLS5QTWPANCNFSM4XD43FXA>
.
|
eadd10e to
9f87a57
Compare
|
@dudhia @rsarthur I have a 1 time step namelist that shows diffs. The diffs are in your mods (or are exposed by your mods, honestly). If I use the previous repo code that you branched from, I get bitwise identical results between differing number of MPI processes. If I use your code, I get bit-wise diffs in the 4 corners, when comparing the differing numbers of MPI processes. I generated the file to plot using only the fields that show a difference: The following diffs are after a single time step (the initial time is identical, as is usually the situation), with mpi tasks 3 vs 8. The other fields look similarly structured, though with ncview I had to rescale to get the values viewable for the specific level. The PH diff shows up at the first level up from the surface (ph=0 is a definition at the model surface). |
|
We had another fix for corner points recently #1314. I guess this
regression test includes that.
That was a change in set_physical_bc3d. Was this new call added for rho in
your test?
…On Fri, Feb 5, 2021 at 11:47 AM Dave Gill ***@***.***> wrote:
@dudhia <https://github.com/dudhia> @rsarthur
<https://github.com/rsarthur>
Folks,
The case that is failing bit-for-bit is an idealized case. The real-data
cases do not seem to be missing HALO comms. It is the domain corner points
for one of the doubly periodic BC cases (I pulled these plots and the text
from the other PR).
I have a 1 time step namelist that shows diffs.
The diffs are in your mods (or are exposed by your mods, honestly). If I
use the previous repo code that you branched from, I get bitwise identical
results between differing number of MPI processes.
git checkout f311cd5
If I use your code, I get bit-wise diffs in the 4 corners, when comparing
the differing numbers of MPI processes.
***@***.*** em_quarter_ss]$ git branch
* m_opt_rho_fix
master
***@***.*** em_quarter_ss]$ mpirun -np 8 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI8
starting wrf task 4 of 8
starting wrf task 2 of 8
starting wrf task 6 of 8
starting wrf task 0 of 8
starting wrf task 1 of 8
starting wrf task 3 of 8
starting wrf task 7 of 8
starting wrf task 5 of 8
***@***.*** em_quarter_ss]$ mpirun -np 3 --oversubscribe wrf.exe ; mv wrfout_d01_0001-01-01_00:00:00 wrfout_d01_0001-01-01_00:00:00_MPI3
starting wrf task 1 of 3
starting wrf task 0 of 3
starting wrf task 2 of 3
***@***.*** em_quarter_ss]$ ../../external/io_netcdf/diffwrf wrfout_d01_0001-01-01_00:00:00_MPI*
Just plot F
Diffing wrfout_d01_0001-01-01_00:00:00_MPI3 wrfout_d01_0001-01-01_00:00:00_MPI8
Next Time 0001-01-01_00:00:00
Field Ndifs Dims RMS (1) RMS (2) DIGITS RMSE pntwise max
Next Time 0001-01-01_00:00:12
Field Ndifs Dims RMS (1) RMS (2) DIGITS RMSE pntwise max
U 1547 3 0.2350595367E+02 0.2350595368E+02 9 0.2985E-04 0.8158E-04
V 2046 3 0.3400045178E+01 0.3400045176E+01 9 0.2973E-04 0.6012E-03
W 3332 3 0.3461880543E-01 0.3461873113E-01 5 0.5524E-04 0.1863E-02
PH 2129 3 0.5000753094E+03 0.5000753217E+03 7 0.7080E-03 0.3194E-04
T 738 3 0.7617058079E+02 0.7617058085E+02 9 0.4959E-05 0.1059E-05
THM 738 3 0.7617058079E+02 0.7617058085E+02 9 0.4959E-05 0.1059E-05
MU 56 2 0.4388289142E+03 0.4388289142E+03 10 0.2492E-02 0.8765E-04
P 1891 3 0.1184337757E+03 0.1184337754E+03 8 0.1062E-02 0.1777E-03
QVAPOR 520 3 0.4564871386E-02 0.4564871385E-02 10 0.1237E-08 0.6452E-05
I generated the file to plot using only the fields that show a difference:
ncdiff --variable U,V,W,PH,T,THM,MU,P,QVAPOR wrfout_d01_0001-01-01_00:00:00_MPI[38] diffmpi38.nc
The following diffs are after a single time step (the initial time is
identical, as is usually the situation), with mpi tasks 3 vs 8. The other
fields look similarly structured, though with ncview I had to rescale to
get the values viewable for the specific level. The PH diff shows up at the
first level up from the surface (ph=0 is a definition at the model surface).
MU
[image: Screen Shot 2021-01-07 at 8 44 46 PM]
<https://user-images.githubusercontent.com/12666234/103971990-39ca4b80-5129-11eb-968f-17039a54a690.png>
U
[image: Screen Shot 2021-01-07 at 8 45 49 PM]
<https://user-images.githubusercontent.com/12666234/103972036-5a92a100-5129-11eb-8726-450dde071cd6.png>
QVAPOR
[image: Screen Shot 2021-01-07 at 8 46 42 PM]
<https://user-images.githubusercontent.com/12666234/103972089-78f89c80-5129-11eb-80aa-1ddf078fe2ae.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77H4L44EVFRKGAGTHGDS5Q4MPANCNFSM4XD43FXA>
.
|
…OD_BDY_EM_A in solve_em.
|
Aha! My ongoing WRF education continues... @dudhia could you explain why you thought to add the new halo/bc calls for rho in that particular place in the code? |
|
@rsarthur |
|
The symptom of not passing a parallel test when using i+1 etc points
straight to a halo issue. I looked for the last halo call before the
diffusion
which was in solve_em and yes rho was not in it. Then I looked for
other places where rho might get changed between that halo and
the diffusion code and saw the bc3d calls which seemed to be needed
in a similar way to halos.
…On Fri, Feb 5, 2021 at 12:55 PM rsarthur ***@***.***> wrote:
Aha! My ongoing WRF education continues... @dudhia
<https://github.com/dudhia> could you explain why you though to add the
new halo/bc calls for rho in that particular place in the code?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77GDTOR7LGCLVX4J2VTS5REJJANCNFSM4XD43FXA>
.
|
|
|
@davegill Done. Let me know if I should add anything else. |
|
@dudhia @weiwangncar @rsarthur |
So this fixed the errors now, right? I'm still struggling to understand this. Because to me, it seems that after the HALO communication that @dudhia mentioned, there is another one shortly before the diffusion calculations in module_first_rk_step_part2.F: HALO_EM_TKE_E, PERIOD_BDY_EM_PHY_BC, and set_physical_bc3d in the routine phy_bc which all include rho. Can anybody explain why these communications do not work? |
|
If that is the case, then adding the call for physical_bc3d for rho may
have been
the only fix needed but the suggested HALO call was before that.
…On Tue, Feb 9, 2021 at 11:30 AM Matthias Göbel ***@***.***> wrote:
I looked for the last halo call before the diffusion which was in solve_em
and yes rho was not in it. Then I looked for other places where rho might
get changed between that halo and the diffusion code and saw the bc3d calls
which seemed to be needed in a similar way to halos.
So this fixed the errors now, right? I'm still struggling to understand
this. Because to me, it seems that after the HALO communication that
@dudhia <https://github.com/dudhia> mentioned, there is another one
shortly before the diffusion calculations in module_first_rk_step_part2.F:
HALO_EM_TKE_E, PERIOD_BDY_EM_PHY_BC, and set_physical_bc3d in the routine
phy_bc which all include rho. Can anybody explain why these communications
do not work?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77APGV7USWNIOU4TVSDS6F5LTANCNFSM4XD43FXA>
.
|
|
@matzegoebel I was wondering the same thing... Perhaps it is because rho is modified in module_first_rk_step_part1 in subroutine phy_prep? But yes, the errors are fixed now. |
|
It is possible that the physical_bc3d call was all that was needed and not
the HALO
call, but no harm in leaving it in the HALO,
Jimy
…On Thu, Feb 11, 2021 at 6:39 PM rsarthur ***@***.***> wrote:
@matzegoebel <https://github.com/matzegoebel> I was wondering the same
thing... Perhaps it is because rho is modified in
module_first_rk_step_part1 in subroutine phy_prep? But yes, the errors are
fixed now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77DDYIH2PE42TIADALTS6SBFXANCNFSM4XD43FXA>
.
|
I'm still a bit confused because, as I wrote earlier, there is another physical_bc3d call for rho in the subroutine phy_bc just before the diffusion calculations. From what I see the diffusion calculations are the first spot where the HALO points of rho are needed, so halo communications right before the diffusion should be sufficient.
@rsarthur I don't think this makes a difference, because the halo points are not needed in phy_prep, so halo communication in module_first_rk_step_part2 should be enough. I'm just trying to understand this issue because it might help to understand future bugs. Anyway, maybe there is just some strange compiler behavior or so. |
|
The modifications using rho were in routines called from cal_deform_and_div
which is towards the top of the part2 module.
So this is before the code you were looking at.
…On Fri, Feb 12, 2021 at 12:31 PM Matthias Göbel ***@***.***> wrote:
It is possible that the physical_bc3d call was all that was needed
I'm still a bit confused because, as I wrote earlier, there is another
physical_bc3d call for rho in the subroutine phy_bc just before the
diffusion calculations. From what I see the diffusion calculations are the
first spot where the HALO points of rho are needed, so halo communications
right before the diffusion should be sufficient.
Perhaps it is because rho is modified in module_first_rk_step_part1 in
subroutine phy_prep?
@rsarthur <https://github.com/rsarthur> I don't think this makes a
difference, because the halo points are not needed in phy_prep, so halo
communication in module_first_rk_step_part2 should be enough.
I'm just trying to understand this issue because it might help to
understand future bugs. Anyway, maybe there is just some strange compiler
behavior or so.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1396 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIZ77BQCFA4PPHEV3SHJKTS6V6ZZANCNFSM4XD43FXA>
.
|
|
However, maybe not. I will look again. |
|
I still don't know why this fix worked. |
…1396) TYPE: bug fix KEYWORDS: rho, sfs_opt, m_opt, cal_titau SOURCE: Robert Arthur and Jeff Mirocha (LLNL) DESCRIPTION OF CHANGES: Problem: In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. This is a follow-on from issue wrf-model#1214 by @matzegoebel, which addressed the surface stress output from subroutine vertical_diffusion_2 when sfs_opt .gt. 0 or m_opt .eq. 1, but did not address the cal_titau subroutines. Solution: A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable. Additionally, because cal_titau_12_21 now uses "corner" ghost points for rho, a new halo/bc communication was added to solve_em.F to ensure that rho is specified correctly at those points. Correct specification of the corner points also requires the updates to subroutine set_physical_bc3d in wrf-model#1314, included in the v4.2.2 release. Without the latter two updates for corner points, bit-for-bit errors occurred in regression tests for idealized, doubly-periodic domains. LIST OF MODIFIED FILES: dyn_em/module_diffusion_em.F dyn_em/solve_em.F Registry/Registry.EM_COMMON TESTS CONDUCTED: 1. Idealized, doubly-periodic tests with various processor breakdowns show bit-for-bit identical results. 2. Jenkins is all pass. RELEASE NOTE: Bug fix to use correct rho value in cal_titau subroutines. In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable.



TYPE: bug fix
KEYWORDS: rho, sfs_opt, m_opt, cal_titau
SOURCE: Robert Arthur and Jeff Mirocha (LLNL)
DESCRIPTION OF CHANGES:
Problem:
In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the
staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and
m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. This is a
follow-on from issue #1214 by @matzegoebel, which addressed the surface stress output from subroutine
vertical_diffusion_2 when sfs_opt .gt. 0 or m_opt .eq. 1, but did not address the cal_titau subroutines.
Solution:
A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the
cell-centered rho variable. Additionally, because cal_titau_12_21 now uses "corner" ghost points for rho, a new
halo/bc communication was added to solve_em.F to ensure that rho is specified correctly at those points. Correct
specification of the corner points also requires the updates to subroutine set_physical_bc3d in #1314, included in
the v4.2.2 release. Without the latter two updates for corner points, bit-for-bit errors occurred in regression tests for
idealized, doubly-periodic domains.
LIST OF MODIFIED FILES:
dyn_em/module_diffusion_em.F
dyn_em/solve_em.F
Registry/Registry.EM_COMMON
TESTS CONDUCTED:
RELEASE NOTE: Bug fix to use correct rho value in cal_titau subroutines. In the cal_titau subroutines in module_diffusion_em, rho was not always interpolated to the correct location on the staggered grid. This issue affected the subgrid stress terms when sfs_opt .gt. 1. Additionally, when sfs_opt .eq. 0 and m_opt .eq. 1, the actual subgrid stresses applied in the code were correct, but the output was wrong. A rhoavg variable was added to cal_titau_12_21, cal_titau_13_31, and cal_titau_23_32 and used instead of the cell-centered rho variable.