Actually fix MPI synchronization in real.exe #1267#1526
Actually fix MPI synchronization in real.exe #1267#1526honnorat wants to merge 6 commits intowrf-model:release-v4.3.3from
Conversation
|
@honnorat
|
|
To be honnest, I do not fully understand the MPI communicators management in WRF. However, the bug #1267 I encountered happened with https://github.com/wrf-model/WRF/blob/master/main/module_wrf_top.F#L206 The root cause seems to be that the communicator Except for one: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version). On this setup, I have a random crash due to a bad broadcast of namelist configuration (put in |
|
Close in favor of #1600 |
TYPE: bug fix
KEYWORDS: real.exe,MPI,bug fix
SOURCE: Marc Honnorat (EXWEXs)
DESCRIPTION OF CHANGES:
Problem:
Issue #1267 explains the problem. It was supposedly fixed by #1299 but actually only the first version of its predecessor
#1268 contained a complete solution, which was unfortunately removed during review process.
In some obscure circumstances, MPI broadcasting of initial configuration in
real.exewas not fully performed:Solution:
The same piece of code in
wrf.exenever triggered this kind of bug. On this model, we switch temporarily MPI context tompi_comm_allcompute.ISSUE:
Fixes #1267
LIST OF MODIFIED FILES:
M main/real_em.F
TESTS CONDUCTED:
On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.
RELEASE NOTE: Fixes MPI synchronization bug in real.exe #1267