[develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains by mkavulich · Pull Request #871 · ufs-community/ufs-srweather-app

mkavulich · 2023-07-28T03:52:52Z

DESCRIPTION OF CHANGES:

This represents the final round of WE2E test improvements related to Issue #587, including new tests, additional features, script improvements, and removal of unsupported domains.

Script improvements

`run_WE2E_tests.py`

Adds ability to run a group of all tests in a subdirectory of test_configs
Replace --use_cron_to_relaunch with a --launch argument that can take the values "python", "cron", or "none" (the last of which will create the experiments but not run them)

`monitor_jobs.py`

Adds --mode flag, which if specified as "advance" will only run rocotorun once for each experiment, then quit.

`generate_FV3LAM_wflow.py` and `set_FV3nml_sfc_climo_filenames.py`

Add --debug flag that will give more verbose output. No longer read "VERBOSE" variable from config.yaml for this script.
Make most prints "debug only"

`get_crontab_contents.py`

Overhaul script to remove global variables, add arguments as needed
Fixes bug where submitting jobs from cron won't work if your crontab is empty (issue A cron job can only be added if a crontab already exists. #876)
Change functionality so that the script will remove the specified line from crontab if "--remove" flag is provided, otherwise it prints the crontab contents

New tests

Several new custom domain tests are added in a new custom_grids directory. The new custom domains were chosen to span a variety of locations, terrain types, and dates. Aside from custom_ESGgrid_Great_Lakes_snow_8km, these are basic tests that don't use many non-default settings, and so are good candidates for testing new SRW capabilities in the future. See the "Documentation" section for more details.
A new "long forecast" test (108 hours) starting at 2023060112 that retrieves FV3GFS grib2 input data from AWS (and so can not be run on Cheyenne).

Additional changes

Removes unsupported domains:
- EMC_AK
- EMC_HI
- EMC_PR
- EMC_GU
- GSL_HAFSV0.A_25km
- GSL_HAFSV0.A_13km
- GSL_HAFSV0.A_3km
- GSD_HRRR_AK_50km
For HPSS tests, ensure they all are explicitly set to look for HPSS data only
Test custom_ESGgrid_Great_Lakes_snow_8km (introduced in [develop] Consolidate verification tasks using retrieve_data.py #864) moved from verification to custom_tests, with a symbolic link (test_configs/verification/config.MET_verification_winter_wx.yaml) left behind. Also the ICs/LBCs are now from RAP output, retrieved from HPSS (since the observation data requires HPSS access anyway this is fine)
Make new comprehensive test file for Gaea (symlink to Orion's file) since that machine also does not have HPSS access
Various comment fixes

Type of change

New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
- Removes old non-supported domains.
- Deprecates --use_cron_to_relaunch argument to run_WE2E_tests.py (gives user a verbose message about replacement flag)
This change requires a documentation update
- Need to document new tests, change of script behavior, removed domains

TESTS CONDUCTED:

The new script behavior was tested for each of the new flags, ensured that they behaved as expected.

DEPENDENCIES:

~~This branch currently contains changes from #859 and #864, and so should not be merged until those PRs are approved.~~ Merged!

DOCUMENTATION:

Updated command-line arguments were documented within the scripts. Additional documentation for the updates was provided as part of PR #880.

Here are some details/sample plots for the new custom domains:

New test	plot1	plot2
`custom_ESGgrid_Central_Asia_3km`: A 6-hour forecast starting 2019070100 over central Asia, including much of the Tibetan Plateau and Tarim Basin; this area was chosen for its extreme topography and climate range, with no ocean points. Uses staged FV3GFS nemsio data.
`custom_ESGgrid_Great_Lakes_snow_8km`: A 6-hour forecast starting 2023021700 over the North American Great Lakes region. Uses RAP netCDF data retrieved from HPSS, and tests a non-standard resolution (8 km). Tests deterministic verification for snow accumulation (as well as other variables), retrieving observations from HPSS (therefore can only be run on Jet and Hera).
`custom_ESGgrid_IndianOcean_6km`: A 12-hour forecast starting 2019061518 over the central Indian Ocean; this area was chosen for its lack of land points and as a southern hemisphere test. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (6 km).
`custom_ESGgrid_NewZealand_3km`: A 6-hour forecast starting 2019061518 over New Zealand; this area was chosen for its extreme topography and mix of land/ocean points, as well as spanning the international date line. Uses staged FV3GFS grib2 data.
`custom_ESGgrid_Peru_12km`: A 12-hour forecast starting 2019061500 over western South America, centered over Peru; this area was chosen for its extreme topography and mix of land/ocean/lake points, as well as spanning the equator. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (12 km).
`custom_ESGgrid_SF_1p1km`: A 6-hour forecast starting 2019061500 over central coastal California, centered over the San Francisco Bay area. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (1.1 km).

ISSUE:

Resolves Overhaul and consolidate WE2E tests, identify needed additional tests #587 (finally!)
Resolves A cron job can only be added if a crontab already exists. #876

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published

CONTRIBUTORS:

Thanks to @gsketefian for providing some custom grids for new tests

…etatasks that do not exist or have been removed by other checks.

- Remove extraneous empty metatask dependencies from MET_verification_only_vx test now that these are handled in setup.py - Add ability to specify a subdirectory of test_configs to run all tests in it - Move custom grid yamls to separate directory

non-config files are skipped in 'all' and directory tests

…is most expensive at ~60 core hours, rest are 30 or less.

…t can take the values "python", "cron", or "none" - Change names in argparse section to shorten lines

- Update custom grids for more appropriate physics suites - Speed up Peru 12km case with more nodes and shorter forecast time - Distribute new custom domains to comprehensive and coverage tests

symlink - Unless running in debug mode, call "set_template" with "quiet" flag

for more verbose output - Also add debug flag for set_FV3nml_sfc_climo_filenames.py - Convert most print messages to debug-only for cleaner output - Do not read VERBOSE flag for generate script; this flag should only affect experiments, not initial scripts

…oes not have HPSS access)

…that will only run rocotorun once for each experiment rather than continuously monitoring

gsketefian · 2023-08-28T16:20:36Z

+                    help='Explicitly set EXPT_BASEDIR for all experiments')
+    ap.add_argument('--exec_subdir', type=str,
+                    help='Explicitly set EXEC_SUBDIR for all experiments')
+    ap.add_argument('--use_cron_to_relaunch', action='store_true',


@mkavulich Seeing the cron option reminded me of a bug that I've run into with the run_WE2E_tests.py script, which is that when the user's cron table contains a commented line (job) that is exactly identical to the line that this script would otherwise insert into the table, the script doesn't insert the line (so the experiment doesn't get (re)launched). I guess it's because the script doesn't check that the existing line is commented out and so thinks the line is already in the table. Can you run a quick test to see if that is still happening? Thanks.

@mkavulich Just realized the bug (if it still exists) might be in the script get_crontab_contents.py.

@gsketefian Just included a fix for that bug as well, feel free to test it out.

@mkavulich Ok, thanks. I'll trust you that it works!

gsketefian · 2023-08-28T16:23:17Z

@@ -0,0 +1,26 @@
+metadata:
+  description: |-
+    This test checks the capability of the workflow to retrieve from NOAA


@mkavulich Please add in the description that this test is also to check tha the weather model can perform a relatively long forecast.

…being tested

…if the command already exists but is commented, we still want to add it

MichaelLueken · 2023-09-05T14:42:08Z

@mkavulich - The Jenkins tests failed on August 26 due to an issue with Orion and the custom_ESGgrid_Central_Asia_3km test on Hera Intel. This test failed in the run_fcst task with the following:

FATAL from PE 126: NetCDF: Index exceeds dimension bound: netcdf_read_data_2d: file:INPUT/gfs_data.nc- variable:ps

I'm fairly certain that a rerun would allow the test to successfully run without issue. Before rerunning, I've been told that the EPIC role account now has rstprod permission. Would you be okay with uncommenting the MET_ensemble_verification_only_vx_time_lag test in coverage.hera.gnu.com and the custom_ESGgrid_Great_Lakes_snow_8km test in coverage.jet? I'd like to ensure that the EPIC account is able to pull and process the necessary files now.

@mkavulich and @gsketefian - I agree with @mkavulich, it is certainly a good idea to trim/remove the NCO WE2E tests. However, I don't feel that this should be done in this PR.

MichaelLueken · 2023-09-05T15:34:18Z

@mkavulich - It also sounds like the various RDHPCS machines are all using the EPIC role accounts. Would you be able to change the .cicd/Jenkinsfile, replacing jet-epic with jet. If you would rather, I can make the changes to your branch and either open a PR to your branch or push it directly if you make me a temporary collaborator.

…PSS data

mkavulich · 2023-09-05T16:34:15Z

@MichaelLueken I have made those suggested changes, but they are not appearing here. According to https://www.githubstatus.com/, PRs are currently "degraded" so it seems like there's a problem on Github's end that's preventing the PR from being updated. Hopefully it's resolved soon and you can re-run the tests.

MichaelLueken · 2023-09-05T16:53:19Z

@mkavulich - The changes look good and I have resubmitted the Jenkins tests. Once the retests are complete, this work should be ready to be merged. Thanks!

MichaelLueken · 2023-09-06T14:23:13Z

@mkavulich - The reruns of the Hera Intel tests are still seeing failures in the new custom_ESGgrid_Central_Asia_3km test (the last run overnight actually failed in the Functional Unit Tests stage of the pipeline, so it never made it to the actual tests). During yesterday's testing, the MET_ensemble_verification_only_vx_time_lag and custom_ESGgrid_Great_Lakes_snow_8km tests were still failing due to permission issues trying to pull the necessary files from HPSS. I've resubmitted the Jenkins tests on Jet to see if the issue has been corrected. I'll also reach out to the platform tools team on Slack and see when the role.epic account truly acquired rstprod project access.

The good news is that the tests successfully passed on Gaea and Orion. Additionally, outside of the tests noted above, all other tests have successfully passed.

mkavulich · 2023-09-06T16:27:50Z

@MichaelLueken That is so strange, I've run the custom_ESGgrid_Central_Asia_3km probably a dozen times on Hera/Intel and not seen a failure. Hera is down for maintenance today, I don't suppose you know the details of the failure?

MichaelLueken · 2023-09-06T16:42:18Z

@mkavulich - Unfortunately, I'm not aware of the reason for the last failures of the test (though, it is likely due once again to the weird FATAL from PE 126: NetCDF: Index exceeds dimension bound: netcdf_read_data_2d: file:INPUT/gfs_data.nc- variable:ps failure). I've been able to run the custom_ESGgrid_Central_Asia_3km test without issue as well. However, having said that, I've only ever run the test in community mode. The automated Jenkins tests will run this test in NCO mode. The weird NetCDF failures seem to occur when several NCO tests are run simultaneously.

I've also cancelled the Jet testing, since HPSS is down for maintenance as well and the MET_ensemble_verification_only_vx_time_lag would fail due to this.

mkavulich · 2023-09-07T03:26:02Z

@MichaelLueken Ahh, I somehow forgot that the Intel tests are the ones run in NCO mode. So then this is another fleeting symptom of #652?

MichaelLueken · 2023-09-07T13:56:07Z

@mkavulich - Yes, I suspect that this is a continuation of what was being encountered with Issue #652. Now, however, instead of mislinking COM directories, it is now leading to these weird NetCDF failures.

MichaelLueken · 2023-09-07T15:36:51Z

I was able to successfully run the Hera Intel tests manually this morning:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km                                    COMPLETE              23.29
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.69
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             764.38
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              14.25
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               6.34
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              15.10
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE               9.98
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               7.36
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             230.67
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             297.56
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             328.78
pregen_grid_orog_sfc_climo                                         COMPLETE               8.80
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1713.20

I had to use rocotorewind/boot on the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot test, but the rest successfully passed without needing to be rewound and booted.

I'm now waiting to see if the custom_ESGgrid_Great_Lakes_snow_8km test will successfully run on Jet. If it passes, then we can move forward with this work. If it fails, then the custom_ESGgrid_Great_Lakes_snow_8km and MET_ensemble_verification_only_vx_time_lag tests will once again need to be commented out until the Platform team can see what is happening, since the EPIC role account now has rstprod permission. @mkavulich, I will let you know as the test fails or not so that we can quickly make the necessary changes and merge.

MichaelLueken · 2023-09-07T16:00:04Z

@mkavulich -

The custom_ESGgrid_Great_Lakes_snow_8km test is continuing to fail on Jet:

202302170000          get_obs_nohrsc                    56024334                DEAD                 256         1          17.0
202302170000            get_obs_mrms                    56024335                DEAD                 256         1          28.0
202302170000            get_obs_ndas                    56024336                DEAD                 256         1          28.0

The error is:
ERROR: [FATAL] no permission to open HPSS archive file: /NCEPPROD/hpssprod/runhistory/rh2023/202302/20230217/dcom_20230217.tar

Please comment out the two verification tests until this issue can be sorted. Thanks!

MichaelLueken · 2023-09-07T17:33:39Z

@mkavulich - I have sent a request to the RDHPCS HPSS Helpdesk to ask that the EPIC role account be granted rstprod permissions on HPSS. Hopefully they are able to add this permission quickly, then I can submit the Hera GNU tests to make sure that the tests that require HPSS data are able to successfully pull the data from HPSS.

MichaelLueken · 2023-09-07T20:42:22Z

@mkavulich - Good news! Skylar has added the EPIC role account to the rstprod group on HPSS. Once the current Jet tests finish, then the Hera GNU tests will run. If the MET_ensemble_verification_ony_vx_time_lag test passes, then I should be able to merge this PR in the morning.

natalie-perlin · 2023-09-08T15:57:47Z

Why the file name has .com extension?.. Was the intent to have *.nco file with the list of tests?

@natalie-perlin - The .com extension is to ensure that all tests in the file run in the community environment. Within this file, there is an nco test, nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16. When this test is launched, the nco environment is overwritten so that it will run in community mode.

This is similar to the coverage.hera.intel.nco file. All of the tests in this file are normally run in community mode, but the .nco extension ensures that the tests use the nco environment instead.

There are no plans to include either a coverage.hera.intel.com or coverage.hera.gnu.nco file.

MichaelLueken · 2023-09-08T16:42:23Z

@mkavulich -

The Jenkins automated MET_ensemble_verification_only_vx_time_lag test successfully passed :

MET_ensemble_verification_only_vx_time_lag COMPLETE 7.58

Now moving forward with merging this work.

mkavulich changed the title ~~[develop] Improvements for WE2E tests: script features, additional tests~~ [develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains Jul 28, 2023

mkavulich mentioned this pull request Aug 4, 2023

A cron job can only be added if a crontab already exists. #876

Closed

mkavulich added 21 commits August 15, 2023 21:21

Add functions to setup.py that remove metatask dependencies for any m…

160220c

…etatasks that do not exist or have been removed by other checks.

Improving testing scripts

d1c5054

- Remove extraneous empty metatask dependencies from MET_verification_only_vx test now that these are handled in setup.py - Add ability to specify a subdirectory of test_configs to run all tests in it - Move custom grid yamls to separate directory

- Lint run_WE2E_tests.py (not 10/10 yet), add checks to ensure that

f5ddd17

non-config files are skipped in 'all' and directory tests

Add new custom domains of varying resolutions and locations. NZ grid …

cdda9b3

…is most expensive at ~60 core hours, rest are 30 or less.

- Replace "use_cron_to_relaunch" argument with "launch" argument tha…

9af5c69

…t can take the values "python", "cron", or "none" - Change names in argparse section to shorten lines

- Add new hi-res (1.1km) SF area case

6d068d2

- Update custom grids for more appropriate physics suites - Speed up Peru 12km case with more nodes and shorter forecast time - Distribute new custom domains to comprehensive and coverage tests

Explicitly look only on HPSS for HPSS tests

be1648b

Revert vx_only changes, this isn't a viable route to simplification

51dbda0

Swap tests to get successes...UPP on gnu still seems buggy

c1c7720

Add long forecast (over 100 hours) case

57efa83

Add long_fcst to machine test files

99fa01c

- Move new snow verification test to "custom_grids" directory, leave

0facdd5

symlink - Unless running in debug mode, call "set_template" with "quiet" flag

Remove unused predefined domains

4b69dd9

Gaea needs to use the same comprehensive suite as Orion (which also d…

0c92b5e

…oes not have HPSS access)

long_fcst test uses AWS data, so can not be run on Cheyenne

aee5c95

Fix write component grids for new custom domains

e4a510b

Add new '--mode' flag to monitor_jobs.py to allow for "advance" mode …

a81d2a1

…that will only run rocotorun once for each experiment rather than continuously monitoring

Fix SF test write componant grid

2f0d123

Add default (False) for new debug argument, should fix failing test

f646f3b

Fix linting for generate_FV3LAM_wflow.py

2b9ac58

mkavulich force-pushed the feature/WE2E_test_improvement_round_3_rebased branch from 20fdb32 to 2b9ac58 Compare August 16, 2023 00:18

mkavulich added 5 commits August 16, 2023 00:28

Update new nco variable names in comments where they were missed

8311a06

Jenkins tests do not use cron, so this option is unnecessary

4692d31

Updates to fix empty crontab issue

4e7829b

Change new custom domain layouts from get_layout.sh script

e0f9f96

Fix unit test for ush/get_crontab_contents.py

7567188

mkavulich force-pushed the feature/WE2E_test_improvement_round_3_rebased branch from 567c5ca to 7567188 Compare August 16, 2023 02:25

gsketefian reviewed Aug 28, 2023

View reviewed changes

mkavulich added 4 commits August 29, 2023 03:36

Update description for long_fcst test to be more accurate to what is …

4c512e2

…being tested

Add logic to not check commented crontab lines for matching command; …

297008b

…if the command already exists but is commented, we still want to add it

Better clarifying comments

ecfc8a4

Simplify logic to an "else" statement

c647a4a

gsketefian approved these changes Aug 29, 2023

View reviewed changes

mkavulich added 2 commits September 5, 2023 16:29

Restore temporarily removed tests now that EPIC accounts can access H…

8316c7d

…PSS data

Replace "jet-epic" --> "jet" per Michael Leuken

7b8b883

MichaelLueken added the run_we2e_jenkins_coverage_tests SRW App automated CI testing with modified Jenkinsfile label Sep 5, 2023

mkavulich mentioned this pull request Sep 7, 2023

[develop] Component hash updates, FV3_HRRR namelist changes #906

Merged

25 tasks

natalie-perlin reviewed Sep 8, 2023

View reviewed changes

MichaelLueken merged commit 98c4106 into ufs-community:develop Sep 8, 2023

MichaelLueken mentioned this pull request Sep 8, 2023

[develop] Changes for Derecho, a new platform #894

Merged

37 tasks

MichaelLueken mentioned this pull request Sep 21, 2023

[develop]: Update ConfigWorkflow chapter to HEAD of develop #915

Merged

9 tasks

mkavulich deleted the feature/WE2E_test_improvement_round_3_rebased branch April 21, 2026 19:20

Conversation

mkavulich commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIPTION OF CHANGES:

Script improvements

run_WE2E_tests.py

monitor_jobs.py

generate_FV3LAM_wflow.py and set_FV3nml_sfc_climo_filenames.py

get_crontab_contents.py

New tests

Additional changes

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

CONTRIBUTORS:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaelLueken commented Sep 5, 2023

Uh oh!

MichaelLueken commented Sep 5, 2023

Uh oh!

mkavulich commented Sep 5, 2023

Uh oh!

MichaelLueken commented Sep 5, 2023

Uh oh!

MichaelLueken commented Sep 6, 2023

Uh oh!

mkavulich commented Sep 6, 2023

Uh oh!

MichaelLueken commented Sep 6, 2023

Uh oh!

mkavulich commented Sep 7, 2023

Uh oh!

MichaelLueken commented Sep 7, 2023

Uh oh!

MichaelLueken commented Sep 7, 2023

Uh oh!

MichaelLueken commented Sep 7, 2023

Uh oh!

MichaelLueken commented Sep 7, 2023

Uh oh!

MichaelLueken commented Sep 7, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaelLueken commented Sep 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkavulich commented Jul 28, 2023 •

edited

Loading

`run_WE2E_tests.py`

`monitor_jobs.py`

`generate_FV3LAM_wflow.py` and `set_FV3nml_sfc_climo_filenames.py`

`get_crontab_contents.py`