Fix MPI synchronization in real#1299
Conversation
Simplification of PR wrf-model#1268 Fixes wrf-model#1267
|
@honnorat #if ( defined( DM_PARALLEL ) && ( ! defined( STUBMPI ) ) ) |
|
Done. But this is not necessary since every call to |
|
@honnorat I wanted a modification to this PR to see if the jenkins testing failed. This seemed like a safe modification to try. The jenkins testings worked. Did you receive an email from jenkins? |
After the first commit, yes. Everything seemed ok. |
|
@honnorat |
|
|
From the last commit: |
|
@honnorat |
TYPE: bug fix KEYWORDS: mpi, real, bug SOURCE: Marc Honnorat (EXWEXs) DESCRIPTION OF CHANGES: Problem: When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI. The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes. I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this read and broadcast of the namelist occurs only once per real / WRF / ndown run. Solution: An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking the namelist consistency. This is a simplification of PR wrf-model#1268, which had extra white space. ISSUE: Fixes wrf-model#1267 LIST OF MODIFIED FILES: M external/RSL_LITE/module_dm.F TESTS CONDUCTED: 1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup. 2. Jenkins testing is all PASS. RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view. The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.
TYPE: bug fix
KEYWORDS: mpi, real, bug
SOURCE: Marc Honnorat (EXWEXs)
DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite
(in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI.
The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter
uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of
model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes.
I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort).
The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the
issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this
read and broadcast of the namelist occurs only once per real / WRF / ndown run.
Solution:
An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking
the namelist consistency.
This is a simplification of PR #1268, which had extra white space.
ISSUE:
Fixes #1267
LIST OF MODIFIED FILES:
M external/RSL_LITE/module_dm.F
TESTS CONDUCTED:
RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view. The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.