Skip to content

Fix MPI synchronization in real#1299

Merged
davegill merged 2 commits intowrf-model:release-v4.2.2from
EXWEXs:fix-real-mpi
Oct 23, 2020
Merged

Fix MPI synchronization in real#1299
davegill merged 2 commits intowrf-model:release-v4.2.2from
EXWEXs:fix-real-mpi

Conversation

@honnorat
Copy link
Contributor

@honnorat honnorat commented Oct 12, 2020

TYPE: bug fix

KEYWORDS: mpi, real, bug

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite
(in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI.

The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter
uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of
model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes.

I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort).

The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the
issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this
read and broadcast of the namelist occurs only once per real / WRF / ndown run.

Solution:
An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking
the namelist consistency.

This is a simplification of PR #1268, which had extra white space.

ISSUE:
Fixes #1267

LIST OF MODIFIED FILES:
M external/RSL_LITE/module_dm.F

TESTS CONDUCTED:

  1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.
  2. Jenkins testing is all PASS.

RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view. The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.

@davegill
Copy link
Contributor

@honnorat
Marc,
I don't have permission to try a mod:
Use this instead:

#if ( defined( DM_PARALLEL ) && ( ! defined( STUBMPI ) ) )

@honnorat
Copy link
Contributor Author

Done. But this is not necessary since every call to wrf_dm_initialize is already fenced this way. If we had to take this path, we would have to do the same for each of the 55 other occurences of #ifndef STUBMPI in module_dm.F.

@davegill
Copy link
Contributor

@honnorat
Marc,
I completely agree with you, this is not necessary.

I wanted a modification to this PR to see if the jenkins testing failed. This seemed like a safe modification to try. The jenkins testings worked. Did you receive an email from jenkins?

@honnorat
Copy link
Contributor Author

Did you receive an email from jenkins?

After the first commit, yes. Everything seemed ok.

@davegill
Copy link
Contributor

@honnorat
Marc,
Could you put that email text in one of these comments? Thanks

@davegill
Copy link
Contributor

Please find result of the WRF regression test cases in the attachment. This build is for Commit ID: aef012df1aab738aeb9805dd1c6afaf305e307f9, requested by: honnorat for PR: https://github.com/wrf-model/WRF/pull/1299. For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 166           164        0
    Number of Comparisons  : 105           104        0

    Failed Simulations are: 
    None
    Which comparisons are not bit-for-bit: 
    None

@honnorat
Copy link
Contributor Author

From the last commit:
wrf_output.zip

Please find result of the WRF regression test cases in the attachment.
This build is for Commit ID: aef012df1aab738aeb9805dd1c6afaf305e307f9,
requested by: honnorat for PR: https://github.com/wrf-model/WRF/pull/1299.
For any query please send e-mail to David Gill.

    Test Type              | Expected  | Received |  Failed
    = = = = = = = = = = = = = = = = = = = = = = = =  = = = =
    Number of Tests        : 19           18
    Number of Builds       : 48           46
    Number of Simulations  : 166           164        0
    Number of Comparisons  : 105           104        0

    Failed Simulations are: 
    None
    Which comparisons are not bit-for-bit: 
    None

@davegill
Copy link
Contributor

@honnorat
Marc,
Thanks for this contribution! Welcome aboard the WRF developer's train.

@davegill davegill merged commit ac76162 into wrf-model:release-v4.2.2 Oct 23, 2020
vlakshmanan-scala pushed a commit to scala-computing/WRF that referenced this pull request Apr 4, 2024
TYPE: bug fix

KEYWORDS: mpi, real, bug

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
When running real.exe on multiple processes with MPI, one or more process occasionally crashes in setup_physics_suite 
(in share/module_check_a_mundo.F). This has been linked to wrf_dm_initialize non-blocking MPI. 

The real.exe occasionally crashes in setup_physics_suite (in share/module_check_a_mundo.F#L2640) because the latter 
uses model_config_rec % physics_suite, which on some machines is not initialized. The behavior is as if the broadcast of 
model_config_rec performed just before in main/real_em.F#L124 had not been received by all processes.

I had never seen this bug before, it has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort).

The current fix makes sure that all processes are well synced before proceeding with setup_physics_suite. It solves the 
issue on my machine. Since this is immediately after reading in the namelist, no performance issues are expected as this 
read and broadcast of the namelist occurs only once per real / WRF / ndown run.

Solution:
An MPI barrier is added at the end of wrf_dm_initialize to force all of the processes to be synchronized before checking 
the namelist consistency.

This is a simplification of PR wrf-model#1268, which had extra white space.

ISSUE:
Fixes wrf-model#1267

LIST OF MODIFIED FILES:
M external/RSL_LITE/module_dm.F

TESTS CONDUCTED:
1. On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.
2. Jenkins testing is all PASS.

RELEASE NOTES: When running real.exe on multiple processes with MPI, one or more processes occasionally crash in setup_physics_suite (in share/module_check_a_mundo.F). This has been traced to the fact that wrf_dm_initialize is non-blocking from an MPI point of view.  The problem is intermittent and has only happened on one machine (an Intel-based cluster using Intel-MPI and ifort). An MPI barrier has been added at the end of wrf_dm_initialize to force all processes to be synchronized before checking namelist consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants