Skip to content

[develop] Python-based crontab replacement#466

Merged
MichaelLueken merged 6 commits into
ufs-community:developfrom
NOAA-EPIC:feature/crontab-replacement
Nov 28, 2022
Merged

[develop] Python-based crontab replacement#466
MichaelLueken merged 6 commits into
ufs-community:developfrom
NOAA-EPIC:feature/crontab-replacement

Conversation

@mark-a-potts
Copy link
Copy Markdown
Contributor

@mark-a-potts mark-a-potts commented Nov 11, 2022

DESCRIPTION OF CHANGES:

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@mkavulich
Copy link
Copy Markdown
Collaborator

@mark-a-potts This looks like a good short-term solution to me. I wonder if we should include a brief README file in this directory letting users know which flags they need in order to bypass crontab; it looks like the run_WE2E_tests.sh script already has the use_cron_to_relaunch option but the default is "true"

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator

@mark-a-potts Would it be possible to make it work in batches, say 8 tests simultaneously? I am afraid sequential execution will take forever.

@mkavulich
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa I don't think that's necessary for a quick interim solution as I work on a more complete end-to-end python rewrite. It takes on the order of 10 seconds to process each directory on Hera. While it will be strictly slower (due to rocotorun only being parsed every 10*numtests seconds vs every 2 minutes with current crontab solution) it shouldn't severely limit the speed of tests, since the batch jobs run independently of this script once submitted. I haven't done a direct comparison but I am currently running a full comprehensive suite of tests on Hera and can report back on the full time it takes for the whole suite of tests to complete.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator

danielabdi-noaa commented Nov 11, 2022

@mkavulich Ok I understand. I misunderstood the comments to mean each test case run sequentially, when it is really lauching them sequentially with some delay introduced. Hopefully that would be enough to solve issue on Cheyenne.

@mkavulich
Copy link
Copy Markdown
Collaborator

For the record, here is the timing information from my Hera comprehensive test:

Tests have all completed

real	67m11.388s
user	11m32.652s
sys	6m30.883s

@mkavulich
Copy link
Copy Markdown
Collaborator

@mark-a-potts Is this still a work-in-progress? It looks to me like it is ready aside from documentation (I'm willing to help with that if you're too busy).

@mark-a-potts
Copy link
Copy Markdown
Contributor Author

I think it could be ready to merge if the documentation is updated. It shouldn't run unless somebody explicitly requests it when launching the run_WE2E_test.sh script.

 - When on Cheyenne, set use_cron_to_relaunch=false
 - When use_cron_to_relaunch=false, output a message at the end of the script
   describing how to run the new run_srw_tests.py script to manage the tests
 - Regardless of other variables, print a message at the end of the script
   showing the user where the test directory is
@mkavulich
Copy link
Copy Markdown
Collaborator

@mark-a-potts I don't have permissions to push to that branch directly, so I opened a PR here with some end-to-end test script updates to go along with the new script: NOAA-EPIC#4

@MichaelLueken
Copy link
Copy Markdown
Collaborator

Since the documentation has been brought into this PR, I will remove the Work in Progress label.

Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the use_cron_to_relaunch is set to FALSE for Cheyenne, should we go ahead and remove Cheyenne from .cicd/Jenkinsfile temporarily? Or should logic be added to the .cicd/scripts/srw_test.sh script so that Cheyenne will call the new ush/run_srw_tests.py script instead of using cron and not use the get_expts_status.sh?

Copy link
Copy Markdown
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this latest version of the WE2E tests (fundamental) on Cheyenne and Hera (with and without the use_cron_to_relaunch option) and got the expected output and results.

@mkavulich
Copy link
Copy Markdown
Collaborator

For those interested, this is what the ./run_WE2E_tests.sh script now outputs after creating all experiment directories on Cheyenne (or any other platform if you manually set use_cron_to_relaunch = false):


  ========================================================================
  ========================================================================

  All experiments have been generated in the directory
  /glade/scratch/kavulich/workdir/UFS/PR_466/expt_dirs

  ========================================================================
  ========================================================================



The variable 'use_cron_to_relaunch' has been set to FALSE. Jobs will not be automatically run via crontab.

You can run each task manually in the experiment directory:
(/glade/scratch/kavulich/workdir/UFS/PR_466/expt_dirs)

Or you can use the 'run_srw_tests.py' script in the ush/ directory:

  cd /glade/scratch/kavulich/workdir/UFS/PR_466/ufs-srweather-app/ush
  ./run_srw_tests.py -e=/glade/scratch/kavulich/workdir/UFS/PR_466/expt_dirs

Copy link
Copy Markdown
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@mark-a-potts @mkavulich Since the Jenkins tests use the crontab to run, should Cheyenne temporarily be removed from the .cicd/Jenkinsfile, or should logic be added to .cicd/scripts/srw_test.sh to use the new ush/run_srw_tests.py script?

@mark-a-potts
Copy link
Copy Markdown
Contributor Author

I think we should add logic to use the new script with Jenkins.

@mkavulich
Copy link
Copy Markdown
Collaborator

@mark-a-potts Are you able to do that? I am not familiar enough with the Jenkins testing to feel comfortable making that addition (plus I do not think I have permissions to kick off that testing).

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@mark-a-potts @mkavulich I think I might be able to do this. However, the new run_srw_tests.py script will always pass a return code of zero, even if a test fails. This will cause the Jenkins tests to always report that the Cheyenne tests are passing. Would it be possible, during the grep -L 'wflow_status =' */log.launch_FV3LAM_wflow check section, to either return a non-zero return code to .cicd/scripts/srw_test.sh if the status is FAILURE or write the wflow_status to a file that can be queried once the script is done running?

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@mark-a-potts I have created NOAA-EPIC#5 to allow the .cicd/scripts/srw_test.sh script to work with ush/run_srw_tests.py on Cheyenne. This will allow the Jenkins tests to run on Cheyenne. Please let me know if there are any issues and I will correct them. Thanks!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Nov 22, 2022
Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jenkins tests successfully passed with these updates. Everything looks good to me and I am now approving this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants