Skip to content

Add the RRFS_NA_3km pre-defined domain to the SRW App#480

Closed
JeffBeck-NOAA wants to merge 13 commits into
ufs-community:developfrom
JeffBeck-NOAA:feature/RRFS_NA_3km
Closed

Add the RRFS_NA_3km pre-defined domain to the SRW App#480
JeffBeck-NOAA wants to merge 13 commits into
ufs-community:developfrom
JeffBeck-NOAA:feature/RRFS_NA_3km

Conversation

@JeffBeck-NOAA
Copy link
Copy Markdown
Collaborator

@JeffBeck-NOAA JeffBeck-NOAA commented Apr 27, 2021

DESCRIPTION OF CHANGES:

This PR adds an option to run the pre-defined RRFS_NA_3km domain to the SRW App. Note that in order to run longer simulations, the model code should be compiled in 32-bit mode, not 64-bit. Changes have also been made to k/n_split values in the namelist template which optimize run time. Nodes/core settings must be modified for chgres_cube and post due to the size of this domain. A WE2E test was added and more information on all of these settings can be found within the related config.sh script.

Note that CPUS_PER_TASK_RUN_FCST is changed from "4" to "2" in this PR. Setting this field to "4" was doubling the requested nodes for the run_fcst task. For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable. When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values. EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs.

TESTS CONDUCTED:

Multiple tests on Hera were run, including a full 36-hr forecast here: /scratch2/BMC/det/beck/FV3-LAM/expt_dirs/test_RRFS_NA_3km_36hr

CONTRIBUTORS (optional):

Thanks are due to @JamesAbeles-NOAA for his recommendations for build/namelist changes and help troubleshooting run times. Thanks to @BenjaminBlake-NOAA and @JacobCarley-NOAA for their help with the domain configuration.

Comment thread ush/config_defaults.sh
OMP_STACKSIZE_RUN_FCST="1024m"

CPUS_PER_TASK_RUN_FCST="4"
CPUS_PER_TASK_RUN_FCST="2"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBeck-NOAA I'm testing this PR on Cheyenne right now. Aside from that, my only comment is about this line: Can you provide a brief justification for this change to the defaults in your PR message? This (along with the other changes you already mention) has the potential to impact others significantly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkavulich. Good catch on this. Setting this field to "4" was doubling the requested nodes for the run_fcst task. For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable. When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values. EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs. I can add this to the PR message.

PREDEF_GRID_NAME="RRFS_NA_3km"
QUILTING="TRUE"

CCPP_PHYS_SUITE="FV3_GSD_SAR"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment: Should this use the RRFS_v1alpha suite instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it probably should be. I can change it in the PR, but with the problems we're having with RRFS_v1alpha at the moment, we can't test with that suite (at least not on Hera).

@JeffBeck-NOAA
Copy link
Copy Markdown
Collaborator Author

The pre-defined RRFS_NA_3km domain has been incorporated into PR #492. Closing this PR.

mkavulich added a commit that referenced this pull request May 25, 2021
…A_3km pre-defined domain, update timestep and MPI settings (#492)

## DESCRIPTION OF CHANGES: 
This PR accomplishes three things: 

1. A new pre-defined domain (RRFS_NA_3km) has been added to the SRW App. Nodes/core settings must be modified for chgres_cube and post due to the size of this domain.  A WE2E test was added and more information on all of these settings can be found within the related config.sh script (tests/baseline_configs/config.grid_RRFS_NA_3km.sh). 
2. The default k_split value is updated for a faster model integration. With k_split=2, we see model integration ~30% faster than the previous settings for the same weather model hash. **This will not affect physics suites that have specified other k_split values**
3. In order to properly run the above domain with the intended FV3_RRFS_v1alpha physics suite, the weather model needed to be updated to a more recent hash. This more up-to-date weather model version also has renamed the FV3_GFS_v16beta suite to FV3_GFS_v16; this required a number of changes to the workflow and end-to-end tests. In addition, several changes to default settings are occurring in this PR. Changes have also been made to k/n_split values in the namelist template which optimize run time.  CPUS_PER_TASK_RUN_FCST is changed from "4" to "2" in this PR.  Setting this field to "4" was doubling the requested nodes for the run_fcst task.  For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable.  When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values.  EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs, so this should be uncontroversial.

This PR will need to be accompanied by changes in the ufs-srweather-app for updating the weather model hash and incorporating some necessary build changes (including compiling with 32-bit reals by default); this PR has been created (ufs-community/ufs-srweather-app#140) but it still a draft pending some platform-specific fixes and the merger of this PR.

## TESTS CONDUCTED: 
For the initial changes for PR #480, multiple tests on Hera were run, including a full 36-hr forecast here: /scratch2/BMC/det/beck/FV3-LAM/expt_dirs/test_RRFS_NA_3km_36hr

With the additional changes and updates to the weather model, and updates to the Hera environment file, all end-to-end tests (aside from nco tests) were run on Hera (intel). There were a few pre-existing failures, and aside from an occasional GST failure due to wallclock time issues (see #490) the only new failures were for grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR, and grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta, which all had a new failure in make_ics and make_lbcs. Currently investigating this issue, though it is almost certainly related to the build environments which need to be addressed in ufs-community/ufs-srweather-app#140

## CONTRIBUTORS: 
@JeffBeck-NOAA authored the half of these changes originating from #480, and offered the following credits on his original PR:

Thanks are due to @JamesAbeles-NOAA for his recommendations for build/namelist changes and help troubleshooting run times.  Thanks to @BenjaminBlake-NOAA and @JacobCarley-NOAA for their help with the domain configuration.
shoyokota pushed a commit to shoyokota/regional_workflow that referenced this pull request Feb 7, 2023
ufs-community#480)

* Replace save_input with save_da_output and use save_input with save_restart task

* Update config.sh.RRFS_CONUS_13km_ens

* Update config.sh.RRFS_CONUS_3km_ens

* Update config_defaults.sh

* Update config_defaults.sh

* Include saving cold start IC files in the DO_SAVE_INPUT loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants