Add the RRFS_NA_3km pre-defined domain to the SRW App#480
Add the RRFS_NA_3km pre-defined domain to the SRW App#480JeffBeck-NOAA wants to merge 13 commits into
Conversation
…d nodes from exploding.
| OMP_STACKSIZE_RUN_FCST="1024m" | ||
|
|
||
| CPUS_PER_TASK_RUN_FCST="4" | ||
| CPUS_PER_TASK_RUN_FCST="2" |
There was a problem hiding this comment.
@JeffBeck-NOAA I'm testing this PR on Cheyenne right now. Aside from that, my only comment is about this line: Can you provide a brief justification for this change to the defaults in your PR message? This (along with the other changes you already mention) has the potential to impact others significantly.
There was a problem hiding this comment.
Thanks @mkavulich. Good catch on this. Setting this field to "4" was doubling the requested nodes for the run_fcst task. For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable. When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values. EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs. I can add this to the PR message.
| PREDEF_GRID_NAME="RRFS_NA_3km" | ||
| QUILTING="TRUE" | ||
|
|
||
| CCPP_PHYS_SUITE="FV3_GSD_SAR" |
There was a problem hiding this comment.
One more comment: Should this use the RRFS_v1alpha suite instead?
There was a problem hiding this comment.
Yes, it probably should be. I can change it in the PR, but with the problems we're having with RRFS_v1alpha at the moment, we can't test with that suite (at least not on Hera).
|
The pre-defined RRFS_NA_3km domain has been incorporated into PR #492. Closing this PR. |
…A_3km pre-defined domain, update timestep and MPI settings (#492) ## DESCRIPTION OF CHANGES: This PR accomplishes three things: 1. A new pre-defined domain (RRFS_NA_3km) has been added to the SRW App. Nodes/core settings must be modified for chgres_cube and post due to the size of this domain. A WE2E test was added and more information on all of these settings can be found within the related config.sh script (tests/baseline_configs/config.grid_RRFS_NA_3km.sh). 2. The default k_split value is updated for a faster model integration. With k_split=2, we see model integration ~30% faster than the previous settings for the same weather model hash. **This will not affect physics suites that have specified other k_split values** 3. In order to properly run the above domain with the intended FV3_RRFS_v1alpha physics suite, the weather model needed to be updated to a more recent hash. This more up-to-date weather model version also has renamed the FV3_GFS_v16beta suite to FV3_GFS_v16; this required a number of changes to the workflow and end-to-end tests. In addition, several changes to default settings are occurring in this PR. Changes have also been made to k/n_split values in the namelist template which optimize run time. CPUS_PER_TASK_RUN_FCST is changed from "4" to "2" in this PR. Setting this field to "4" was doubling the requested nodes for the run_fcst task. For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable. When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values. EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs, so this should be uncontroversial. This PR will need to be accompanied by changes in the ufs-srweather-app for updating the weather model hash and incorporating some necessary build changes (including compiling with 32-bit reals by default); this PR has been created (ufs-community/ufs-srweather-app#140) but it still a draft pending some platform-specific fixes and the merger of this PR. ## TESTS CONDUCTED: For the initial changes for PR #480, multiple tests on Hera were run, including a full 36-hr forecast here: /scratch2/BMC/det/beck/FV3-LAM/expt_dirs/test_RRFS_NA_3km_36hr With the additional changes and updates to the weather model, and updates to the Hera environment file, all end-to-end tests (aside from nco tests) were run on Hera (intel). There were a few pre-existing failures, and aside from an occasional GST failure due to wallclock time issues (see #490) the only new failures were for grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR, and grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta, which all had a new failure in make_ics and make_lbcs. Currently investigating this issue, though it is almost certainly related to the build environments which need to be addressed in ufs-community/ufs-srweather-app#140 ## CONTRIBUTORS: @JeffBeck-NOAA authored the half of these changes originating from #480, and offered the following credits on his original PR: Thanks are due to @JamesAbeles-NOAA for his recommendations for build/namelist changes and help troubleshooting run times. Thanks to @BenjaminBlake-NOAA and @JacobCarley-NOAA for their help with the domain configuration.
ufs-community#480) * Replace save_input with save_da_output and use save_input with save_restart task * Update config.sh.RRFS_CONUS_13km_ens * Update config.sh.RRFS_CONUS_3km_ens * Update config_defaults.sh * Update config_defaults.sh * Include saving cold start IC files in the DO_SAVE_INPUT loop
DESCRIPTION OF CHANGES:
This PR adds an option to run the pre-defined RRFS_NA_3km domain to the SRW App. Note that in order to run longer simulations, the model code should be compiled in 32-bit mode, not 64-bit. Changes have also been made to k/n_split values in the namelist template which optimize run time. Nodes/core settings must be modified for chgres_cube and post due to the size of this domain. A WE2E test was added and more information on all of these settings can be found within the related config.sh script.
Note that CPUS_PER_TASK_RUN_FCST is changed from "4" to "2" in this PR. Setting this field to "4" was doubling the requested nodes for the run_fcst task. For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable. When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values. EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs.
TESTS CONDUCTED:
Multiple tests on Hera were run, including a full 36-hr forecast here: /scratch2/BMC/det/beck/FV3-LAM/expt_dirs/test_RRFS_NA_3km_36hr
CONTRIBUTORS (optional):
Thanks are due to @JamesAbeles-NOAA for his recommendations for build/namelist changes and help troubleshooting run times. Thanks to @BenjaminBlake-NOAA and @JacobCarley-NOAA for their help with the domain configuration.