Skip to content

Actually fix MPI synchronization in real.exe #1267#1526

Closed
honnorat wants to merge 6 commits intowrf-model:release-v4.3.3from
EXWEXs:fix-real-1268
Closed

Actually fix MPI synchronization in real.exe #1267#1526
honnorat wants to merge 6 commits intowrf-model:release-v4.3.3from
EXWEXs:fix-real-1268

Conversation

@honnorat
Copy link
Contributor

@honnorat honnorat commented Jun 17, 2021

TYPE: bug fix

KEYWORDS: real.exe,MPI,bug fix

SOURCE: Marc Honnorat (EXWEXs)

DESCRIPTION OF CHANGES:
Problem:
Issue #1267 explains the problem. It was supposedly fixed by #1299 but actually only the first version of its predecessor
#1268 contained a complete solution, which was unfortunately removed during review process.

In some obscure circumstances, MPI broadcasting of initial configuration in real.exe was not fully performed:

   CALL get_config_as_buffer( configbuf, configbuflen, nbytes )
   CALL wrf_dm_bcast_bytes( configbuf, nbytes )
   CALL set_config_as_buffer( configbuf, configbuflen )

Solution:
The same piece of code in wrf.exe never triggered this kind of bug. On this model, we switch temporarily MPI context to mpi_comm_allcompute.

ISSUE:
Fixes #1267

LIST OF MODIFIED FILES:
M main/real_em.F

TESTS CONDUCTED:
On the only machine were I have seen the bug occur, this change fixes the problem. No other test was conducted since I couldn't reproduce the bug on another setup.

RELEASE NOTE: Fixes MPI synchronization bug in real.exe #1267

@honnorat honnorat requested review from a team as code owners June 17, 2021 16:15
@davegill
Copy link
Contributor

@honnorat
Marc,

  1. Would you explain the logic for saving the a communicator, setting it to a different value, doing the initialization, and then reverting back to the saved communicator.
  2. Would you describe the system that is causing the troubles, such as OS, compiler, architecture, etc.

@honnorat
Copy link
Contributor Author

To be honnest, I do not fully understand the MPI communicators management in WRF. However, the bug #1267 I encountered happened with real.exe but never with wrf.exe, despite their similar initialization procedure. Therefore the present fix mimics the behaviour of wrf.exe:

https://github.com/wrf-model/WRF/blob/master/main/module_wrf_top.F#L206

The root cause seems to be that the communicator mpi_comm_allcompute, created by subroutine split_communicatorcalled by init_modules(1), was not explicitely activated for the call to wrf_dm_bcast_bytes( configbuf, nbytes ) in real.exe. Which, on most of the platforms I use does not seem to be a problem.

Except for one: a cluster of Intel Xeon E5-2650 v4 running on CentOS Linux release 7.6.1810, with Intel Parallel Studio XE (various versions, including 2018u3 and 2020u4) and Intel MPI Library (same version). On this setup, I have a random crash due to a bad broadcast of namelist configuration (put in configbuf after the call to get_config_as_buffer()) across the MPI processes.

@honnorat honnorat changed the base branch from release-v4.3.1 to release-v4.3.2 October 29, 2021 12:34
@honnorat honnorat changed the base branch from release-v4.3.2 to release-v4.3.3 December 15, 2021 07:43
@honnorat
Copy link
Contributor Author

Close in favor of #1600

@honnorat honnorat closed this Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants