Add p7.2 tests and updates#969
Conversation
and update CDMBWD_c384 values to those used in C384
which includes prognostic aerosols.
|
I'm assuming other than moving staged data to proper locations that there might be some requests for changes and I will work to get those done as soon as possible. I will be on leave Dec 23-38th and Raffaele is out through Jan 10th so if anything requires his help for updates "as soon as possible" might be January. |
| # use shared memory and OpenFabrics Alliance (OFA) fabric with Intel MPI to circumvent RDMA-related bug in DAPL. | ||
| setenv I_MPI_FABRICS shm:ofa | ||
| # Intel MPI setting to circumvent RDMA-related bug in DAPL. | ||
| setenv I_MPI_DAPL_UD 1 |
There was a problem hiding this comment.
Is this variable available on other platforms too? Or is there change/updates required to use set this option on other platforms?
There was a problem hiding this comment.
@rmontuoro made this change, I'm unsure if this is needed elsewhere but I'll see what I can figure out
There was a problem hiding this comment.
Please remove this line too as it is deprecated as Intel 17. Hera admin suggested us not using it.
There was a problem hiding this comment.
Setting I_MPI_DAPL_UD to 1 is required when running coupled GOCART with Intel MPI 2018.0.4. The model will crash otherwise.
There was a problem hiding this comment.
@rmontuoro @JessicaMeixner-NOAA I saw 25 tracers are added in the cpld_bmark_aero test (compared to bmark test) and the number of nodes does not increase. and it takes 35 mins to finish. Since all the RT tests are required to finish within 30mins. I ran the test with 20 tasks/node (TPN_cpl_bmrk=20) and without setting I_MPI_DAPL_UD, the test finished in 24 mins. Since the ufs-weather-model supported platforms have different MPI versions, I think this might be helpful to port the aerosol code on other platforms without the MPI specific I_MPI setting.
There was a problem hiding this comment.
@junwang-noaa - Thank you for the suggestion. I've rerun the regression test using 20 tasks/node within 27 minutes of wall clock time on Hera, so I agree to setting TPN_cpl_bmrk=20.
My test also did not set I_MPI_DAPL_UD. Note that, as mentioned in previous discussions, setting I_MPI_DAPL_UD to 1 enables the connectionless DAPL UD transport, which is crucial when running the coupled model (w/ aerosols) on thousands of cores since the default transport would not scale sufficiently, causing timeout errors in the GOCART History component. This regression test uses only 560 MPI tasks and the default DAPL transport may still be adequate. Prototype runs, however, use well above 1k MPI tasks. In such cases, using the connectionless DAPL transport is highly recommended if not required to prevent communication failures.
Note also that the Intel MPI library "switched from the Open Fabrics Alliance* (OFA) framework to the Open Fabrics Interfaces* (OFI) framework" with release 2019. Therefore, the DAPL fabric is deprecated starting with the 2019 release (see Intel documentation). The UFS weather model is using Intel MPI 2018 on Hera and Orion.
There was a problem hiding this comment.
I think it's OK to use DAPL, but the variable I_MPI_DAPL_UD is a depreciated option since Intel 17. Because of this, the impact on srun/slurm is unknown. Some of opn runs are using a couple of thousands tasks without using this setting. Do you have any test case with thousand tasks we can take look the performance of not using this setting? I am not sure what is the problem with those runs with thousand tasks
There was a problem hiding this comment.
@junwang-noaa - DAPL and I_MPI_DAPL_UD are deprecated since release 2019 of Intel MPI. We are still using Intel MPI 2018 on some platforms.
Setting I_MPI_DAPL_UD to 1 (as recommended by NASA) is necessary when running GOCART on a large number of MPI tasks, since the MAPL I/O implementation is based on MPI one-sided communication using Remote Memory Access (RMA) for higher performance. This requires a RDMA-capable communication fabric such as DAPL. Enabling DAPL connectionless transport (I_MPI_DAPL_UD=1) addresses scalability issues/failures at higher core counts and reduces the overall memory footprint. This setting is only needed when running the UFS weather model with GOCART.
|
|
||
| # Add aerosols (temporary) | ||
| COMPILE | -DAPP=S2SW -DUFS_GOCART=ON -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1 | + hera.intel | fv3 | | ||
| RUN | cpld_bmark_p7_aero | + hera.intel | fv3 | |
There was a problem hiding this comment.
I suggest to change the test name to cpld_bmark_p7.2
There was a problem hiding this comment.
Since we are moving forward to P8 should this be cpld_bmkark_p8_aero instead or do you want this to strictly follow p7.2?
There was a problem hiding this comment.
Unless the entire feature test suite for P8 gets updated in this PR (meaning all resolutions), I consider this a feature test for s2sw+aerosols, outside of the Prototype feature set. So my suggestion is that this test is cpld_bmark_aero or something similar.
|
Several comments:
|
|
In response to your questions @junwang-noaa
|
|
@junwang-noaa the new input data has been staged in the nemspara area. The PR description was updated to specifically name the tests that changed answers. I'm still working on the oRT tests as I need to make a non-wave aerosol coupled test and then run the oRT test. |
|
@rmontuoro Please see the https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top.html, click on "download "PDF", search for DAPL, it clearly states on page 16: The DAPL, tcp. tmi and omi fabris are depreciated since Intel MPI Library 2017 update 1. |
@junwang-noaa - This statement contradicts Intel's own documentation (perhaps a typo?). Please see Intel MPI 2018 developer reference guide. Page 81 lists all the supported fabrics and page 98 documents |
|
Closing as it's replaced by #1071 which moves tests to p8b settings. |
PR Checklist
Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.
This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR
An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
are specified below.
Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.
New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.
Instructions: All subsequent sections of text should be filled in as appropriate.
The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.
Description
This updates the benchmark tests to use p7.2 features which include new ICs and the orographic gravity wave drag parameter setting (cdmbgwd) is updated for C384 from “1.1,0.72,1.0,1.0” to "1.0,2.2,1.0,1.0" to enhance orographic gravity-wave drag and improve 500-hPa ACC. This parameter is changed for all C384 tests based on feedback from @yangfanglin
This also adds a test cpld_bmark_p7_aero which includes aerosols in a p7 configuration.
Needed input updates:
New ICs for benchmark tests:
Currently staged on hera:
/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/BM_IC-20211021
Additional input for aerosols:
Currently staged on hera:
/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20211210/GOCART/p7/
Co-author: @rmontuoro
Issue(s) addressed
Link the issues to be closed with this PR, whether in this repository, or in another repository.
(Remember, issues must always be created before starting work on a PR branch!)
Testing
How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)
Current testing has been with hera.intel. New baselines will be required for any C384 test that uses the new default uGWD setting in addition to the benchmark test changing due to this and different ICs. This means the following existing tests change answers:
cpld_bmark_p7 cpld_bmark_mpi_p7 and cpld_control_c384_p7 all change answers
control_c384 changes answers because of the CDMBWD_c384 parameter change.
Note, the gdas tests do not change because the cdmbwd parameter is hard-coded.
Dependencies
If testing this branch requires non-default branches in other repositories, list them. Those branches should have matching names (ideally).
Do PRs in upstream repositories need to be merged first? No