Skip to content

Add p7.2 tests and updates#969

Closed
JessicaMeixner-NOAA wants to merge 12 commits into
ufs-community:developfrom
JessicaMeixner-NOAA:feature/p7.2
Closed

Add p7.2 tests and updates#969
JessicaMeixner-NOAA wants to merge 12 commits into
ufs-community:developfrom
JessicaMeixner-NOAA:feature/p7.2

Conversation

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented Dec 20, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.

  • New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

This updates the benchmark tests to use p7.2 features which include new ICs and the orographic gravity wave drag parameter setting (cdmbgwd) is updated for C384 from “1.1,0.72,1.0,1.0” to "1.0,2.2,1.0,1.0" to enhance orographic gravity-wave drag and improve 500-hPa ACC. This parameter is changed for all C384 tests based on feedback from @yangfanglin
This also adds a test cpld_bmark_p7_aero which includes aerosols in a p7 configuration.

Needed input updates:
New ICs for benchmark tests:
Currently staged on hera:
/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/BM_IC-20211021
Additional input for aerosols:
Currently staged on hera:
/scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/input-data-20211210/GOCART/p7/

Co-author: @rmontuoro

Issue(s) addressed

Link the issues to be closed with this PR, whether in this repository, or in another repository.
(Remember, issues must always be created before starting work on a PR branch!)

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

Current testing has been with hera.intel. New baselines will be required for any C384 test that uses the new default uGWD setting in addition to the benchmark test changing due to this and different ICs. This means the following existing tests change answers:
cpld_bmark_p7 cpld_bmark_mpi_p7 and cpld_control_c384_p7 all change answers
control_c384 changes answers because of the CDMBWD_c384 parameter change.
Note, the gdas tests do not change because the cdmbwd parameter is hard-coded.

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • opnReqTest for newly added/changed feature
  • CI

Dependencies

If testing this branch requires non-default branches in other repositories, list them. Those branches should have matching names (ideally).

Do PRs in upstream repositories need to be merged first? No

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator Author

I'm assuming other than moving staged data to proper locations that there might be some requests for changes and I will work to get those done as soon as possible. I will be on leave Dec 23-38th and Raffaele is out through Jan 10th so if anything requires his help for updates "as soon as possible" might be January.

# use shared memory and OpenFabrics Alliance (OFA) fabric with Intel MPI to circumvent RDMA-related bug in DAPL.
setenv I_MPI_FABRICS shm:ofa
# Intel MPI setting to circumvent RDMA-related bug in DAPL.
setenv I_MPI_DAPL_UD 1
Copy link
Copy Markdown
Collaborator

@junwang-noaa junwang-noaa Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable available on other platforms too? Or is there change/updates required to use set this option on other platforms?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmontuoro made this change, I'm unsure if this is needed elsewhere but I'll see what I can figure out

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line too as it is deprecated as Intel 17. Hera admin suggested us not using it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmontuoro you okay with this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting I_MPI_DAPL_UD to 1 is required when running coupled GOCART with Intel MPI 2018.0.4. The model will crash otherwise.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmontuoro @JessicaMeixner-NOAA I saw 25 tracers are added in the cpld_bmark_aero test (compared to bmark test) and the number of nodes does not increase. and it takes 35 mins to finish. Since all the RT tests are required to finish within 30mins. I ran the test with 20 tasks/node (TPN_cpl_bmrk=20) and without setting I_MPI_DAPL_UD, the test finished in 24 mins. Since the ufs-weather-model supported platforms have different MPI versions, I think this might be helpful to port the aerosol code on other platforms without the MPI specific I_MPI setting.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa - Thank you for the suggestion. I've rerun the regression test using 20 tasks/node within 27 minutes of wall clock time on Hera, so I agree to setting TPN_cpl_bmrk=20.

My test also did not set I_MPI_DAPL_UD. Note that, as mentioned in previous discussions, setting I_MPI_DAPL_UD to 1 enables the connectionless DAPL UD transport, which is crucial when running the coupled model (w/ aerosols) on thousands of cores since the default transport would not scale sufficiently, causing timeout errors in the GOCART History component. This regression test uses only 560 MPI tasks and the default DAPL transport may still be adequate. Prototype runs, however, use well above 1k MPI tasks. In such cases, using the connectionless DAPL transport is highly recommended if not required to prevent communication failures.

Note also that the Intel MPI library "switched from the Open Fabrics Alliance* (OFA) framework to the Open Fabrics Interfaces* (OFI) framework" with release 2019. Therefore, the DAPL fabric is deprecated starting with the 2019 release (see Intel documentation). The UFS weather model is using Intel MPI 2018 on Hera and Orion.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to use DAPL, but the variable I_MPI_DAPL_UD is a depreciated option since Intel 17. Because of this, the impact on srun/slurm is unknown. Some of opn runs are using a couple of thousands tasks without using this setting. Do you have any test case with thousand tasks we can take look the performance of not using this setting? I am not sure what is the problem with those runs with thousand tasks

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa - DAPL and I_MPI_DAPL_UD are deprecated since release 2019 of Intel MPI. We are still using Intel MPI 2018 on some platforms.

Setting I_MPI_DAPL_UD to 1 (as recommended by NASA) is necessary when running GOCART on a large number of MPI tasks, since the MAPL I/O implementation is based on MPI one-sided communication using Remote Memory Access (RMA) for higher performance. This requires a RDMA-capable communication fabric such as DAPL. Enabling DAPL connectionless transport (I_MPI_DAPL_UD=1) addresses scalability issues/failures at higher core counts and reduces the overall memory footprint. This setting is only needed when running the UFS weather model with GOCART.

Comment thread tests/rt.conf

# Add aerosols (temporary)
COMPILE | -DAPP=S2SW -DUFS_GOCART=ON -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1 | + hera.intel | fv3 |
RUN | cpld_bmark_p7_aero | + hera.intel | fv3 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to change the test name to cpld_bmark_p7.2

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are moving forward to P8 should this be cpld_bmkark_p8_aero instead or do you want this to strictly follow p7.2?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the entire feature test suite for P8 gets updated in this PR (meaning all resolutions), I consider this a feature test for s2sw+aerosols, outside of the Prototype feature set. So my suggestion is that this test is cpld_bmark_aero or something similar.

@junwang-noaa
Copy link
Copy Markdown
Collaborator

Several comments:

  1. I think both Raffaele and you have the write permission to nemspara RT location. Please stage the new BM_IC to RT/nemsfv3gfs directory and to the platform you have access to, and let the code managers know which platforms you don't have access to, they can help populate the data.
  2. Can you confirm that only C384 tests changes results? If yes, please add a note in the description.
  3. Does the cpld_bmark_p7_aero pass ORT test (restart without wave)? If yes, please add a note.
    Also the ufs commit queue now goes to 12/30. This PR will probably get committed after those PRs unless there is issue with any of the PRs. Let us know if this PR has high priority. Thanks

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator Author

In response to your questions @junwang-noaa

  1. Yes I have access, I tend to wait until the new inputs are finalized before wanting to make changes in the official area. Since these look okay at this point I'll make the updates on hera and then work on the other machines I have access to.
  2. Will add information when confirmed.
  3. I don't know if @rmontuoro did this, I will try and report back.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator Author

@junwang-noaa the new input data has been staged in the nemspara area. The PR description was updated to specifically name the tests that changed answers. I'm still working on the oRT tests as I need to make a non-wave aerosol coupled test and then run the oRT test.

@junwang-noaa junwang-noaa added the Baseline Updates Current baselines will be updated. label Dec 22, 2021
@junwang-noaa
Copy link
Copy Markdown
Collaborator

@rmontuoro Please see the https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top.html, click on "download "PDF", search for DAPL, it clearly states on page 16:

The DAPL, tcp. tmi and omi fabris are depreciated since Intel MPI Library 2017 update 1.

@rmontuoro
Copy link
Copy Markdown
Collaborator

@rmontuoro Please see the https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top.html, click on "download "PDF", search for DAPL, it clearly states on page 16:

The DAPL, tcp. tmi and omi fabris are depreciated since Intel MPI Library 2017 update 1.

@junwang-noaa - This statement contradicts Intel's own documentation (perhaps a typo?). Please see Intel MPI 2018 developer reference guide. Page 81 lists all the supported fabrics and page 98 documents I_MPI_DAPL_UD. This option indeed fixed RDMA-related I/O failures with Intel MPI 2018.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator Author

Closing as it's replaced by #1071 which moves tests to p8b settings.

@JessicaMeixner-NOAA JessicaMeixner-NOAA deleted the feature/p7.2 branch July 13, 2022 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Baseline Updates Current baselines will be updated.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a regression test for ufsatm-mom6-cice6-wave-aerosol

4 participants