Skip to content

[develop] Add checks for correct SPP settings for the selected CCPP suite, resolve four more issues#1259

Merged
MichaelLueken merged 22 commits into
ufs-community:developfrom
mkavulich:feature/spp_settings_check
Apr 29, 2025
Merged

[develop] Add checks for correct SPP settings for the selected CCPP suite, resolve four more issues#1259
MichaelLueken merged 22 commits into
ufs-community:developfrom
mkavulich:feature/spp_settings_check

Conversation

@mkavulich

@mkavulich mkavulich commented Apr 1, 2025

Copy link
Copy Markdown
Collaborator

DESCRIPTION OF CHANGES:

In discussions with @gspetro and @JeffBeck-NOAA about documentation for SPP settings, we determined that the current documentation for SPP-related settings is insufficient, and that since the SPP capabilities for specific variables are tied to specific CCPP schemes, those checks should be added to setup.py and user settings adjusted accordingly.

This PR adds a check in setup.py that ensures the appropriate physics schemes are being used for the selected SPP settings, and if not, removes the SPP settings for that scheme. If after this pruning there are no SPP settings left, an exception is raised. Note that many of the changes in setup.py are just re-arranging of existing code; the removed logic from lines 1029 to 1141 are moved downwards so that all of these CCPP checks are done only when a run_fcst task is specified in the workflow.

While making these changes to setup.py, I also added this file to the list of pytest-ed files. This required updating pylint to a newer version, since 2.17 results in an error (and this older version is no longer supported anyway).

Additionally, this PR resolves four more issues:

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

Ran a number of tests on Hera (Intel). Ran test grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR and modified to use every supported suite, observing that the appropriate SPP variables were deactivated in input.nml for each suite that did not contain all needed schemes (FV3_GFS_v16 and FV3_WoFS_v0):

FV3_GFS_v16

&nam_sppperts
    iseed_spp = 2020081000017
    spp_lscale = 150000.0
    spp_prt_list = 0.2
    spp_sigtop1 = 0.1
    spp_sigtop2 = 0.025
    spp_stddev_cutoff = 1.5
    spp_tau = 21600.0
    spp_var_list = 'rad'
/

FV3_WoFS_v0

&nam_sppperts
    iseed_spp = 2020081000014, 2020081000015, 2020081000017
    spp_lscale = 150000.0, 150000.0, 150000.0
    spp_prt_list = 0.2, 0.2, 0.2
    spp_sigtop1 = 0.1, 0.1, 0.1
    spp_sigtop2 = 0.025, 0.025, 0.025
    spp_stddev_cutoff = 1.5, 1.5, 1.5
    spp_tau = 21600.0, 21600.0, 21600.0
    spp_var_list = 'pbl', 'sfc', 'rad'
/

Also ran smoke (old and new) and fundamental WE2E tests; all passed.

  • derecho.intel
  • gaea.intel
  • gaea-c6.intel
  • hera.gnu
  • hera.intel
  • hercules.intel
  • jet.intel
  • orion.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

DOCUMENTATION:

Documentation is updated accordingly for latest SPP options.

ISSUE:

Resolves #1228, #1258, #1275, #1276

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@mkavulich mkavulich changed the title Add checks for correct SPP settings for the selected CCPP suite, resolve two more issues [develop] Add checks for correct SPP settings for the selected CCPP suite, remove duplicate smoke/dust field table entries for RRFS-sas Apr 1, 2025
@mkavulich mkavulich changed the title [develop] Add checks for correct SPP settings for the selected CCPP suite, remove duplicate smoke/dust field table entries for RRFS-sas [develop] Add checks for correct SPP settings for the selected CCPP suite, resolve two more issues Apr 1, 2025
Comment thread tests/WE2E/machine_suites/comprehensive.derecho
@MichaelLueken

Copy link
Copy Markdown
Collaborator

@mkavulich -

Following the merging of PR #1204, there are two conflicts in ush/setup.py. Please merge the HEAD of develop into your feature/spp_settings_check branch at your earliest convenience. Thanks!

Comment thread ush/setup.py Outdated
Comment thread ush/setup.py Outdated

@benkozi benkozi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few minor suggestions but otherwise looks good to me!

@MichaelLueken

Copy link
Copy Markdown
Collaborator

@mkavulich -

I have opened PR #4 into your feature/spp_settings_check branch to update @benkozi's GitHub username in the .github/CODEOWNERS file. At your earliest opportunity, please approve and merge this PR into your branch. Thank you very much!

[feature/spp_settings_check] Update Ben's GitHub username in CODEOWNERS file
@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Apr 25, 2025
@mkavulich

Copy link
Copy Markdown
Collaborator Author

@benkozi I think I addressed your comments, let me know what you think.

@benkozi

benkozi commented Apr 25, 2025

Copy link
Copy Markdown
Collaborator

@benkozi I think I addressed your comments, let me know what you think.

@mkavulich - Looks good to me. Thanks for making the changes.

@MichaelLueken

Copy link
Copy Markdown
Collaborator

The UFS_Fire WE2E tests have successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
UFS_FIRE_multifire_two-way-coupled_20250425085208                  COMPLETE              46.01
UFS_FIRE_one-way-coupled_20250425085238                            COMPLETE              47.91
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              93.92

The AQM WE2E test has successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20250425102112                   COMPLETE            6250.51
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            6250.51

@MichaelLueken

Copy link
Copy Markdown
Collaborator

The vx-det_multicyc_first-obs-00z_ncep-hrrr WE2E test failed on Hera Intel. The test failed in the get_obs_mrms task. A rocotoboot allowed the task to properly run and the test subsequently passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2019_memorial_day_heat_wave_20250425144058                         COMPLETE              67.50
custom_ESGgrid_Peru_12km_20250425144118                            COMPLETE              39.00
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2025042  COMPLETE               9.51
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             146.25
get_from_HPSS_ics_HRRR_lbcs_RAP_20250425144233                     COMPLETE              19.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              19.77
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20250425144335  COMPLETE              13.85
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202504251  COMPLETE             741.47
MET_ensemble_verification_only_vx_time_lag_20250425144454          COMPLETE               4.68
pregen_grid_orog_sfc_climo_20250425144545                          COMPLETE              12.29
smoke_dust_grid_RRFS_CONUS_25km_suite_RRFS_sas_20250425144616      COMPLETE              65.02
vx-det_long-fcst_custom-vx-config_aiml-graphcast_20250425144650    COMPLETE               0.72
vx-det_multicyc_long-fcst-overlap_nssl-mpas_20250425144724         COMPLETE              22.40
vx-det_multicyc_long-fcst-no-overlap_nssl-mpas_20250425144802      COMPLETE              31.91
vx-det_multicyc_first-obs-00z_ncep-hrrr_20250425144843             COMPLETE               2.77
vx-det_multicyc_no-00z-obs_nssl-mpas_20250425144922                COMPLETE               3.54
vx-det_multicyc_no-fcst-overlap_ncep-hrrr_20250425145002           COMPLETE               6.58
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202504  COMPLETE              44.62
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1251.61

The Jenkins tests have passed for Azure, Derecho, Gaea C5, Gaea C6, Google, Hera GNU, Hercules, and AWS PW. Two tests are still sitting in queue on Orion. I will merge this PR once those two tests have completed.

@MichaelLueken

Copy link
Copy Markdown
Collaborator

@mkavulich -

Following the updates to ush/setup.py earlier today, I'm now encountering issues while attempting to launch the WE2E tests. I'm receiving the following error message on Orion:

*********************************************************************
FATAL ERROR:
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: log.run_WE2E_tests
*********************************************************************

Traceback (most recent call last):
  File "/work/noaa/epic/mlueken/ufs-srweather-app/orion/tests/WE2E/./run_we2e_tests.py", line 723, in <module>
    run_we2e_tests(srw_dir, user_args)
  File "/work/noaa/epic/mlueken/ufs-srweather-app/orion/tests/WE2E/./run_we2e_tests.py", line 268, in run_we2e_tests
    expt_dir = generate_FV3LAM_wflow(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/work/noaa/epic/mlueken/ufs-srweather-app/orion/tests/WE2E/../../ush/generate_FV3LAM_wflow.py", line 305, in generate_FV3LAM_wflow
    setup_fv3_namelist(expt_config,debug)
  File "/work/noaa/epic/mlueken/ufs-srweather-app/orion/tests/WE2E/../../ush/generate_FV3LAM_wflow.py", line 611, in setup_fv3_namelist
    if sdf_uses_ruc_lsm := workflow_config["SDF_USES_RUC_LSM"]:
                           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'SDF_USES_RUC_LSM'

I'll check and see if this issue is appearing on other platforms as well.

@mkavulich

Copy link
Copy Markdown
Collaborator Author

@MichaelLueken I am assuming this is not related to the updates I made today, but the original changes. I moved some logic that should only be necessary when running a run_fcst task so that it's only executed in that case, but now some other logic in generate_workflow (specifically the logic for generating the run_fcst namelist) has some unset variables. This logic should also be omitted when run_fcst task is not executed. I'm running a test now but I think the fix is simple.

@MichaelLueken

Copy link
Copy Markdown
Collaborator

Thanks for the quick fix, @mkavulich! After pulling your modification, I was able to generate all experiments on Orion:

Continuous mode: will monitor jobs until all are complete
Setup complete; monitoring 10 experiments
Use ctrl-c to pause job submission/monitoring
Reading database for experiment 2020_CAD_20250425152513, updating experiment dictionary
Reading database for experiment custom_ESGgrid_SF_1p1km_20250425152550, updating experiment dictionary
Reading database for experiment deactivate_tasks_20250425152625, updating experiment dictionary
Reading database for experiment get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2mems_20250425152648, updating experiment dictionary
Reading database for experiment grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20250425152726, updating experiment dictionary
Reading database for experiment grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20250425152814, updating experiment dictionary
Reading database for experiment grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20250425152906, updating experiment dictionary
Reading database for experiment grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20250425152959, updating experiment dictionary
Reading database for experiment grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_20250425153102, updating experiment dictionary
Reading database for experiment MET_verification_smoke_only_vx_20250425153226, updating experiment dictionary

@MichaelLueken

Copy link
Copy Markdown
Collaborator

@mkavulich -

I think I realize now what is happening - the Jenkins tests automatically merge the latest develop into a branch before building and running the coverage WE2E tests. The Jenkins tests were able to run without issue for all platforms this morning because changes that have been merged to develop since Christina's PR #1204 allowed the experiments to generate properly. While in a standalone state, as noted above, your branch is now running smoothly, merging the latest changes in develop causes experiments to fail to generate:

  Creating rocoto workflow XML file (WFLOW_XML_FP):
    WFLOW_XML_FP = '/scratch1/NCEPDEV/stmp2/Michael.Lueken/ufs-srweather-app/expt_dirs/deactivate_tasks/FV3LAM_wflow.xml'
0 schema-validation errors found in Rocoto config
0 Rocoto XML validation errors found

*********************************************************************
FATAL ERROR:
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: log.run_WE2E_tests
*********************************************************************

Traceback (most recent call last):
  File "/scratch1/NCEPDEV/stmp2/Michael.Lueken/ufs-srweather-app/hera/tests/WE2E/./run_we2e_tests.py", line 723, in <module>
    run_we2e_tests(srw_dir, user_args)
  File "/scratch1/NCEPDEV/stmp2/Michael.Lueken/ufs-srweather-app/hera/tests/WE2E/./run_we2e_tests.py", line 268, in run_we2e_tests
    expt_dir = generate_FV3LAM_wflow(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/stmp2/Michael.Lueken/ufs-srweather-app/hera/tests/WE2E/../../ush/generate_FV3LAM_wflow.py", line 329, in generate_FV3LAM_wflow
    "n_var_spp": N_VAR_SPP,
                 ^^^^^^^^^
NameError: name 'N_VAR_SPP' is not defined

I was testing the deactivate_tasks WE2E test because this was the test that was causing issues on Orion. Please merge the latest develop into your feature/spp_settings_check branch and ensure that the experiment properly generates. Thanks!

@MichaelLueken

Copy link
Copy Markdown
Collaborator

@mkavulich -

I kicked off a quick test (deactivate_tasks) on Gaea C6 via Jenkins this morning and the test failed to generate the experiment:

*********************************************************************
FATAL ERROR:
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: log.run_WE2E_tests
*********************************************************************
  
Traceback (most recent call last):
  File "/gpfs/f6/bil-fire8/scratch/role.epic/jenkins/workspace/s-srweather-app_pipeline_PR-1259/gaeac6/tests/WE2E/./run_we2e_tests.py", line 723, in <module>
    run_we2e_tests(srw_dir, user_args)
  File "/gpfs/f6/bil-fire8/scratch/role.epic/jenkins/workspace/s-srweather-app_pipeline_PR-1259/gaeac6/tests/WE2E/./run_we2e_tests.py", line 268, in run_we2e_tests
    expt_dir = generate_FV3LAM_wflow(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/f6/bil-fire8/scratch/role.epic/jenkins/workspace/s-srweather-app_pipeline_PR-1259/gaeac6/tests/WE2E/../../ush/generate_FV3LAM_wflow.py", line 329, in generate_FV3LAM_wflow
    "n_var_spp": N_VAR_SPP,
                 ^^^^^^^^^
NameError: name 'N_VAR_SPP' is not defined

It looks like the fix you applied to get experiment generation to work in your branch is causing issues when the HEAD of develop is merged to your branch in the Jenkins testing process. Please merge the latest HEAD of develop into your branch at your earliest convenience and try and correct this final issue. Thank you very much for your time!

@mkavulich

Copy link
Copy Markdown
Collaborator Author

@MichaelLueken turns out this is another issue with my original changes: I did not realize that in cases where make_grid is run but run_fcst is not, make_grid still needs access to the fv3 namelist file. This really shouldn't be the case, but rearranging all this logic is beyond the scope of my PR, so I just added the correct logic to ensure that this case is covered, and now the deactivate_tasks experiment completes successfully. I also ran a fundamental suite as a sanity check to hopefully check that I'm not missing any more edge cases, but I think my changes are correct/sane.

@MichaelLueken

Copy link
Copy Markdown
Collaborator

Thanks, @mkavulich, for finding and addressing this last issue! I have kicked off the Jenkins test for Gaea C6 and am rerunning the tests on Orion as well. Once these tests pass, I will move forward with merging this PR.

@MichaelLueken

Copy link
Copy Markdown
Collaborator

The manual run of the WE2E coverage tests on Orion have successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD_20250428135117                                            COMPLETE              97.08
custom_ESGgrid_SF_1p1km_20250428135150                             COMPLETE             528.45
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            2138.67
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20250428135  COMPLETE            1073.72
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              87.07
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             812.27
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202504  COMPLETE              57.18
MET_verification_smoke_only_vx_20250428135558                      COMPLETE               0.86
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4795.30

@MichaelLueken

Copy link
Copy Markdown
Collaborator

The Gaea C6 Jenkins rerun has successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_NewZealand_3km_20250428152222                       COMPLETE              76.03
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_sas_2025  COMPLETE              37.08
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2025042815  COMPLETE              40.16
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             424.32
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2025042  COMPLETE              35.06
smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf_20250428152332        COMPLETE            1172.44
2020_CAPE_20250428152351                                           COMPLETE              43.09
2020_easter_storm_20250428152408                                   COMPLETE              40.50
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202504  COMPLETE              28.21
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1896.89

Tests were also run on Azure (the coverage suite that contains the deactivate_tasks WE2E test):

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_RRFS_lbcs_RRFS_suite_RRFS_sas_202  COMPLETE              17.99
deactivate_tasks_20250428211734                                    COMPLETE               2.01
specify_template_filenames_20250428211749                          COMPLETE              12.90
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20250428211  DEAD                  11.38
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20250428211842  COMPLETE             176.18
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202504  COMPLETE              31.44
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                 251.90

The rerun of the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP WE2E test successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20250429132  COMPLETE              30.67
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              30.67

@MichaelLueken

Copy link
Copy Markdown
Collaborator

All WE2E coverage tests are now passing for this PR. The previous issues with the deactivate_tasks WE2E test has been addressed.

Moving forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit b4f4bf6 into ufs-community:develop Apr 29, 2025
6 of 7 checks passed
@mkavulich mkavulich deleted the feature/spp_settings_check branch April 21, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

3 participants