Skip to content

[develop] Fix for error at the end of monitor_jobs.py, other minor improvements#600

Merged
MichaelLueken merged 5 commits into
ufs-community:developfrom
mkavulich:feature/python_WE2E_script
Feb 14, 2023
Merged

[develop] Fix for error at the end of monitor_jobs.py, other minor improvements#600
MichaelLueken merged 5 commits into
ufs-community:developfrom
mkavulich:feature/python_WE2E_script

Conversation

@mkavulich
Copy link
Copy Markdown
Collaborator

@mkavulich mkavulich commented Feb 9, 2023

DESCRIPTION OF CHANGES:

This is another round of improvements for the pythonized WE2E test scripts. I was originally going to wait until the script was ready to replace, but I accidentally forgot to make a change to a final log message in monitor_jobs.py; this results in a scary-looking error message even though all experiments may have completed successfully.

calling function that monitors jobs, prints summary
Writing information for all experiments to monitor_jobs_20230208170852.yaml
Checking tests available for monitoring...
Starting experiment get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems running
Setup complete; monitoring 1 experiments
Experiment get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems is COMPLETE; will no longer monitor.

*********************************************************************
FATAL ERROR:
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: log.run_WE2E_tests
*********************************************************************

Traceback (most recent call last):
  File "/mnt/lfs4/HFIP/hfv3gfs/Michael.Lueken/new_tests/tests/WE2E/./run_WE2E_tests.py", line 461, in <module>
    run_we2e_tests(homedir,args)
  File "/mnt/lfs4/HFIP/hfv3gfs/Michael.Lueken/new_tests/tests/WE2E/./run_WE2E_tests.py", line 195, in run_we2e_tests
    monitor_file = monitor_jobs(monitor_yaml, debug=args.debug)
  File "/mnt/lfs4/HFIP/hfv3gfs/Michael.Lueken/new_tests/tests/WE2E/monitor_jobs.py", line 84, in monitor_jobs
    logging.info(f'All {num_expts} experiments finished in {str(total_walltime)}')
NameError: name 'num_expts' is not defined

The solution was simply to replace the undefined variable with a correct one.

This PR comes from an in-progress branch for improvements to the python scripts, so a few other improvements are coming along with this bug fix:

  • When monitor_jobs.py is called for the first time, it will check the status of all jobs, not just those that are incomplete. This will allow users to use this script to monitor a previously failed job that has been fixed.
  • Add checks for proper run_envir, and add all needed variables for run_envir=nco mode.
  • Fix another incorrect error message in setup.py

Type of change

  • Bug fix (non-breaking change which fixes an issue)

TESTS CONDUCTED:

This change mostly impacts the new python-based tests which are not yet used for official testing. Ran some tests on Hera to ensure the problem with the WE2E testin scripts was fixed. Also running fundamental tests on Hera and Jet due to change in setup.py.

  • hera.intel
  • jet.intel

DEPENDENCIES:

None

DOCUMENTATION:

None

ISSUE:

Related to #586, more work needed.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes

CONTRIBUTORS (optional):

Thanks to @MichaelLueken for pointing out the error.

…t skip checking DEAD, ERROR, or COMPLETE status jobs. This will alow the use of this script to, for example, make changes to a failed experiment and re-run after the appropriate rocotorewind command(s) have been run
@mkavulich mkavulich added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Feb 9, 2023
@venitahagerty venitahagerty removed ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Feb 9, 2023
@venitahagerty
Copy link
Copy Markdown
Collaborator

venitahagerty commented Feb 9, 2023

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1234339030/20230209003509/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Succeeded on hera: pregen_grid_orog_sfc_climo
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: MET_ensemble_verification
All experiments completed

@venitahagerty
Copy link
Copy Markdown
Collaborator

venitahagerty commented Feb 9, 2023

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1234339030/20230209003513/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: specify_DOT_OR_USCORE
Experiment Succeeded on jet: specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: custom_GFDLgrid
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: custom_ESGgrid
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: specify_RESTART_INTERVAL
All experiments completed

Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Thank you for quickly addressing the issue with encountered at the end of monitor_jobs.py! I was able to successfully test this on Jet as well. Approving this work.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Feb 9, 2023
@mkavulich
Copy link
Copy Markdown
Collaborator Author

@MichaelLueken Do you know what it means when it says the CI build was "aborted"? I didn't make any changes to the build system so I'm assuming it's not related to this change?

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@MichaelLueken Do you know what it means when it says the CI build was "aborted"? I didn't make any changes to the build system so I'm assuming it's not related to this change?

@mkavulich Even though the Jenkinsfile has been updated to no longer run on Hera (EPIC no longer has access to the nems account on the machine and an EPIC allocation hasn't been given as of this time), I had to go in and manually abort the Hera tests in Jenkins. The rest of the tests successfully passed, so once a second approval is given, this work can get merged in. Sorry for the confusion this caused.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

If you have the time, please review this PR and @danielabdi-noaa PR #612. These two bug fixes are important to get into develop as soon as possible. Thank you very much for your time.

@MichaelLueken MichaelLueken merged commit a36f88d into ufs-community:develop Feb 14, 2023
@mkavulich mkavulich deleted the feature/python_WE2E_script branch April 21, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants