Skip to content

Avoid the effect of external environment on tasks#436

Merged
MichaelLueken merged 4 commits into
ufs-community:developfrom
danielabdi-noaa:feature/no_export
Oct 28, 2022
Merged

Avoid the effect of external environment on tasks#436
MichaelLueken merged 4 commits into
ufs-community:developfrom
danielabdi-noaa:feature/no_export

Conversation

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa danielabdi-noaa commented Oct 27, 2022

DESCRIPTION OF CHANGES:

This PR mainly addresses issue

Detailed list of changes

  • Make SCHED_NATIVE_CMD be used on all systems not only for gaea
  • Make sure rocoto does not export environment variables other than SLURM specific ones with --export=NONE
  • Change srun commands to have --export=ALL
  • Change conda activation logic in load_modules_run_task.sh to work first time (yet to be tested)
  • Add some missing task specific modulefiles on some systems. This should be uniform across all platforms

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

To be conducted ...

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

DOCUMENTATION:

None

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • Needs Gaea test
  • Needs noaacloud test
  • help wanted

CONTRIBUTORS (optional):

@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 27, 2022
@venitahagerty venitahagerty removed ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 27, 2022
@venitahagerty
Copy link
Copy Markdown
Collaborator

venitahagerty commented Oct 27, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1101855546/20221027152014/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
All experiments completed

@venitahagerty
Copy link
Copy Markdown
Collaborator

venitahagerty commented Oct 27, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1101855546/20221027152020/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Failed on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
2022-10-27 16:00:13 +0000 :: fe6 :: Task get_extrn_lbcs, jobid=14825261, in state DEAD (OUT_OF_MEMORY), ran for 202.0 seconds, exit status=253, try=1 (of 1)
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
All experiments completed

Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me!

It would probably be a good idea to keep @mark-a-potts, @natalie-perlin, and @ulmononian in the loop. Will this method be the one used to move forward?

I'll kick off the Jenkins tests to see if they encounter any issues, in the meantime.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Oct 27, 2022
Copy link
Copy Markdown
Contributor

@mark-a-potts mark-a-potts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me, and when combined with Natalie's PR, they seem to fix some problems I was seeing on Jet. Assuming they pass the Jenkins tests, I propose we merge these and then Natalie and I can pull these changes into hers to get the miniconda updated as well.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

danielabdi-noaa commented Oct 28, 2022

@MichaelLueken There is an issue with this PR and cheyenne now that SCHED_NATIVE_CMD is activated on all systems so you may want to stop the Jenkins run. Look below for the #PBS None line because SCHED_NATIVE_CMD is not set to anything. Cheyenne and other PBSPRO scheduling systems don't need it --export=NONE equivalent because I believe that is their default behaviour. To solve this, I can simply disable it for these systems but am looking for an alternative that could make the PBS batch script happy, and keep setting of NATIVE command for all systems.

10/27/22 14:04:16 MDT :: FV3LAM_wflow.xml :: qsub: directive error: None 
10/27/22 14:04:16 MDT :: FV3LAM_wflow.xml :: Submitting get_extrn_ics using qsub < /glade/scratch/epicufsrt/.tmp/qsub.in20221027-26845-1d5v79v with input {{#! /bin/sh
#PBS -A NRAL0032
#PBS -q regular
#PBS -l select=1:mpiprocs=1:ncpus=1
#PBS -l walltime=00:45:00
#PBS -N get_extrn_ics
#PBS -j oe -o /glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-436/expt_dirs/nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR/log/get_extrn_ics_2020081000.log
#PBS None
export GLOBAL_VAR_DEFNS_FP='/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-436/expt_dirs/nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR/var_defns.sh'
export PDY='20200810'
export cyc='00'
export subcyc='00'
export LOGDIR='/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-436/expt_dirs/nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR/log'
export ICS_OR_LBCS='ICS'
set -x
/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-436/ush/load_modules_run_task.sh "get_extrn_ics" "/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-436/jobs/JREGIONAL_GET_EXTRN_MDL_FILES"}}

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa I have killed the Jenkins CI testing for Cheyenne. I'll resubmit once you have implemented a fix.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

@MichaelLueken I have made a fix that will hopefully make it work, so you can re-run jenkins. I am not sure if an empty pbspro line liek this #PBS is acceptable, but slurm is ok with an empty #SBATCH line though. If this doesn't work, I will resort back to activating native command for a list of platforms.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa It looks like your modification has corrected the issue on Cheyenne (the workflow is actually running with PBS job submission failures). Once the tests wrap up, I'll move forward with merging this work.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa and @mark-a-potts Bad news. It looks like the Cheyenne Intel run was successful, but the Cheyenne GNU run is still encountering the same issue as previously:

10/28/22 14:08:19 MDT :: FV3LAM_wflow.xml :: Submission of make_grid failed! qsub: directive error: None
10/28/22 14:08:19 MDT :: FV3LAM_wflow.xml :: Submission of get_extrn_ics failed! qsub: directive error: None

Looking in the FV3LAM_wflow.xml files for the Intel tests, the SCHED_NATIVE_CMD is being set to "", but for GNU, it is still being set to "None". Any ideas as to why it would work for Intel, but continue to encounter issues with GNU?

Stopping the Jenkins pipeline now.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

@MichaelLueken I feel like there is still some kind of clash between simultaneous runs on the same machine on Jenkins. In the past there were clashes between GNU and INTEL runs on Cheyenne because they were being assigned the same experiment directory. The other PR that fixes a bug on hera failed on Cheyenne too, but it really shouldn't since nothing changed there. It was not able to access the directory so maybe the other PR deleted it because the intel run was successful there. Maybe @jessemcfarland can take a look?

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

danielabdi-noaa commented Oct 28, 2022

@MichaelLueken More on this. There can't be conflict between different PRs but it looks there is still the issue with GNU and INTEL using same directories.

For PR-438 run-1

For Build both have same directories, which is not right, but the build will work fine because the build directories are named build_{compiler}.

Intel: /glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-438
GNU:/glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-438

For Test

Intel: /glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-438__2
GNU: /glade/scratch/epicufsrt/jenkins/workspace/fs-srweather-app_pipeline_PR-438

Intel failed because there is no __2 directory, while GNU is still running.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa Thanks for looking into this more. I saw that the PR #436 Jenkins working directory also had both build_intel and build_gnu in the same directory. @jessemcfarland has opened PR #441 to try and correct the behavior so that doesn't continue happening.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa As a follow-up, both a manual run of your branch for on Cheyenne for GNU and the Jenkins CI with GNU only are both submitting jobs properly. Once either the manual or Jenkins test is complete, I will go ahead and merge this work.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa @mark-a-potts Both the manual and Jenkins CI tests for GNU have successfully passed. This PR is ready to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants