Skip to content

Remove hard-coded max number of tasks for obs nudging#698

Merged
smileMchen merged 3 commits intowrf-model:release-v4.0.3from
smileMchen:OBS
Dec 11, 2018
Merged

Remove hard-coded max number of tasks for obs nudging#698
smileMchen merged 3 commits intowrf-model:release-v4.0.3from
smileMchen:OBS

Conversation

@smileMchen
Copy link
Collaborator

@smileMchen smileMchen commented Nov 16, 2018

TYPE: bug fix

KEYWORDS: obs nudging, max number of tasks

SOURCE: internal

DESCRIPTION OF CHANGES:
Problem:
The max number of processors, 1024, is hard coded in module_dm.F for observation nudging.
If a user requests more MPI tasks than this max number, this leads to segmentation fault.

Solution:
In the routine where the dimension of the variables is defined as the maximum number of MPI
tasks, those two variables are now declared as ALLOCATABLE, and then they are allocated based on
the total number of MPI ranks.

LIST OF MODIFIED FILES:
M external/RSL_LITE/module_dm.F

TESTS CONDUCTED:

  1. Applied new code to a user's case, which shows the code works as expected.
  2. No bit-wise diffs with smaller test case, before vs after mods: I built the code with ./configure -d option, and run a small test case with 1 processor and 36 processors, respectively. OBS nudging is turned on. Both runs cover a 3-hour period. Results are identical.
  3. Test case with > 1024 MPI tasks: A large case (derived from a user's case) is also tested. In this case, the code is built with ./configure -D option. Without the change, the case crashed immediately. The error message is:
OBS NUDGING is requested on a total of  2 domain(s).
++++++CALL ERROB AT KTAU =     0 AND INEST =  1:  NSTA =     0 ++++++
At line 5741 of file module_dm.f90
Fortran runtime error: Index '1025' of dimension 1 of array 'idisplacement' above upper bound of 1024
Error termination. Backtrace:
#0  0x782093 in __module_dm_MOD_get_full_obs_vector
	at /glade/scratch/chenming/WRFHELP/WRFV3.9.1.1_intel_dmpar_large-file/frame/module_dm.f90:5741
#1  0xffffffffffffffff in ???

With the code change, the case can run successfully for 6 hours.

RELEASE NOTE: After removing a hard-coded limit for an assumed maximum number of MPI tasks, the observation nudging code for WRF now supports more than 1024 MPI tasks. If users previously ran the obs nudging code with 1024 or fewer MPI tasks, the original code is OK. However, if users tried to run obs nudging with > 1024 MPI tasks, likely the code died from a segmentation fault, while trying to access an address for an array index that was not available.

@jonggwan
Copy link

It has small memory footprint but it needs DEALLOCATE for the allocated arrays before subroutine exit.

@smileMchen
Copy link
Collaborator Author

@jonggwan
Thanks for the suggestion. The code has been modified.

@davegill davegill changed the title Bug fix for hard coded number of processors Remove hard-coded max number of tasks for obs nudging Nov 19, 2018
@davegill
Copy link
Contributor

@jonggwan
That is a good catch. We might have come back into the routine and tried to allocate arrays that were already allocated.

@davegill
Copy link
Contributor

@smileMchen
Ming,

  1. You should run a test with > 1024 MPI tasks (29*36=1044). Run with "configure -D". The test case of with and without your mods for this "big" job should be explicitly reported in your testing.
  2. Run a test that shows no bit-for-bit differences (this one should NOT use "configure -D", since it will be expensive. Since this is an engineering / framework change, we should be able to replicate previous results with obs nudgin for a small job.

@davegill
Copy link
Contributor

davegill commented Dec 5, 2018

@smileMchen
Ming,
I am waiting on two items to approve this PR:

  1. You should run a test with > 1024 MPI tasks (29*36=1044). Run with "configure -D". The test case of with and without your mods for this "big" job should be explicitly reported in your testing.
  2. Run a test that shows no bit-for-bit differences (this one should NOT use "configure -D", since it will be expensive. Since this is an engineering / framework change, we should be able to replicate previous results with obs nudgin for a small job.

@smileMchen
Copy link
Collaborator Author

smileMchen commented Dec 11, 2018

@davegill
Dave,
(1) I built the code with ./configure -d option, and run a small test case with 1 processor and 36 processors, respectively. OBS nudging is turned on. Both runs cover a 3-hour period. Results are identical. (Results are saved at /glade/scratch/chenming/WRFHELP/WRF_OBS/test/em_real)

(2) A large case (derived from a user's case) is also tested. In this case, the code is built with ./configure -D option. Without the change, the case crashed immediately. The error message is:

OBS NUDGING is requested on a total of  2 domain(s).
++++++CALL ERROB AT KTAU =     0 AND INEST =  1:  NSTA =     0 ++++++
At line 5741 of file module_dm.f90
Fortran runtime error: Index '1025' of dimension 1 of array 'idisplacement' above upper bound of 1024
Error termination. Backtrace:
#0  0x782093 in __module_dm_MOD_get_full_obs_vector
	at /glade/scratch/chenming/WRFHELP/WRFV3.9.1.1_intel_dmpar_large-file/frame/module_dm.f90:5741
#1  0xffffffffffffffff in ???

With the code change, the case can run successfully for 6 hours.

Please let me know if you need more information.

Copy link
Contributor

@davegill davegill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@smileMchen smileMchen merged commit cc5e11e into wrf-model:release-v4.0.3 Dec 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants