Skip to content

Enable wcoss2 ufsda build and module load#2620

Merged
WalterKolczynski-NOAA merged 17 commits into
NOAA-EMC:developfrom
RussTreadon-NOAA:feature/wcoss2_ufsda
Jun 5, 2024
Merged

Enable wcoss2 ufsda build and module load#2620
WalterKolczynski-NOAA merged 17 commits into
NOAA-EMC:developfrom
RussTreadon-NOAA:feature/wcoss2_ufsda

Conversation

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA RussTreadon-NOAA commented May 22, 2024

Description

This PR enables ufsda (sorc/gdas.cd) to be built and run on WCOSS2.

Resolves #2602
Resolves #2579

Type of change

  • New feature (adds functionality)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

Clone, build, and run C96C48_ufs_hybatmDA CI on WCOSS2 (Cactus)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

RussTreadon-NOAA commented May 22, 2024

NOTE: This PR requires

  • GDASApp PR #1122 to be merged into GDASApp develop
  • update sorc/gdas.cd submodule hash

This PR will be marked Ready for review once these tasks are completed.

@WalterKolczynski-NOAA
Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA As part of this PR, please also enable the CI tests that use the new JEDI-based GDAS that are currently disabled on wcoss. You can do this by removing wcoss2 from the skip_ci_on_hosts section in the ci/cases/*/*.yaml case files.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

@WalterKolczynski-NOAA , I can remove wcoss2 from ci/cases/pr/C96C48_ufs_hybatmDA.yaml since I tested this on Cactus. It works.

Are you asking that this PR remove the following occurrences of `wcoss2 in CI yamls?

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml:  - wcoss2
ci/cases/pr/C96C48_ufs_hybatmDA.yaml:  - wcoss2
ci/cases/pr/C96_atmaerosnowDA.yaml:  - wcoss2
ci/cases/pr/C96_atm3DVar.yaml:  - wcoss2
ci/cases/pr/C48_S2SWA_gefs.yaml:  - wcoss2

I do not plan on testing anything other than C96C48_ufs_hybatmDA.yaml

Added note: This PR will remain in draft mode until NCO installs bufr/12.0.1 in production. Once this is done, wcoss2.intel.lua in GDASApp PR #1122 will be updated to use the official production installation of bufr/12.0.1. After GDASApp PR #1122 is closed, the sorc/gdas.cd hash in this PR will be updated.

@WalterKolczynski-NOAA
Copy link
Copy Markdown
Contributor

@WalterKolczynski-NOAA , I can remove wcoss2 from ci/cases/pr/C96C48_ufs_hybatmDA.yaml since I tested this on Cactus. It works.

Are you asking that this PR remove the following occurrences of `wcoss2 in CI yamls?

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml:  - wcoss2
ci/cases/pr/C96C48_ufs_hybatmDA.yaml:  - wcoss2
ci/cases/pr/C96_atmaerosnowDA.yaml:  - wcoss2
ci/cases/pr/C96_atm3DVar.yaml:  - wcoss2
ci/cases/pr/C48_S2SWA_gefs.yaml:  - wcoss2

I do not plan on testing anything other than C96C48_ufs_hybatmDA.yaml

Added note: This PR will remain in draft mode until NCO installs bufr/12.0.1 in production. Once this is done, wcoss2.intel.lua in GDASApp PR #1122 will be updated to use the official production installation of bufr/12.0.1. After GDASApp PR #1122 is closed, the sorc/gdas.cd hash in this PR will be updated.

Not all of them. The GEFS test has to remain off until the bash CI system supports dual build (GFS and GEFS use different UFS executables because of the wave grid option). I'm also not sure why the C96_atm3DVar test isn't on already, will check.

The other three should work as soon as gdas.cd can be built, AFAIK. If they doesn't work out-of-the-box, we can get you help or defers those.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

I'll keep it simple at first an only activate C96C48_ufs_hybatmDA on wcoss2

@WalterKolczynski-NOAA
Copy link
Copy Markdown
Contributor

Not all of them. The GEFS test has to remain off until the bash CI system supports dual build (GFS and GEFS use different UFS executables because of the wave grid option). I'm also not sure why the C96_atm3DVar test isn't on already, will check.

The other three should work as soon as gdas.cd can be built, AFAIK. If they doesn't work out-of-the-box, we can get you help or defers those.

Oh, the C96_atm3DVar test is disable because we run the extended version instead.

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review May 28, 2024 18:02
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

Build RussTreadon-NOAA:feature/wcoss2_ufsda at 10a2bc5 on Cactus. Run JEDI ATM CI. 20240224/00 gfs and gdas cycles run to completion. 20240224/00 enkf cycle fails in the final job because member analysis increment files are not found in the expected format.

gdas.cd @ 95218e7 includes changes related to g-w PR #2592. This g-w PR adds a new enkf analysis increment job. gdas.cd @ 95218e7 assumes member increments are created by this new g-w job.

g-w PR #2592 must be merged into develop and RussTreadon-NOAA:feature/wcoss2_ufsda updated in order for the enkf cycle to successfully complete.

@RussTreadon-NOAA RussTreadon-NOAA mentioned this pull request May 28, 2024
7 tasks
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

Install RussTreadon-NOAA:feature/wcoss2_ufsda at 7f7093f on Cactus. Run JEDI ATM CI (C96C48_ufs_hybatmDA). All jobs from gfs, gdas, and enkfgdas cycles successfully ran to completion

russ.treadon@clogin04:/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prtest> rocotostat -d prtest.db -w prtest.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    May 29 2024 00:25:41    May 29 2024 00:40:16
202402240000        Done    May 29 2024 00:25:41    May 29 2024 02:45:11

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this May 29, 2024
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

@DavidHuber-NOAA and @CatherineThomas-NOAA , if either of you have time would you review the changes in this PR?

This PR allows GDASApp to be built and run on WCOSS2. This capability is required for GFS v17.

Comment thread ush/load_ufsda_modules.sh Outdated
Copy link
Copy Markdown
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks!

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

While not impacted by this PR, also run GSI-based ATM CI (C96C48_hybatmDA) on Cactus. All jobs successfully run to completion.

russ.treadon@clogin04:/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prtest_gsi> rocotostat -d prtest_gsi.db -w prtest_gsi.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    May 29 2024 09:35:34    May 29 2024 09:50:13
202112210000        Done    May 29 2024 09:35:34    May 29 2024 11:30:16
202112210600        Done    May 29 2024 09:35:34    May 29 2024 11:30:16

Comment thread ush/load_ufsda_modules.sh Outdated
Comment thread ush/load_ufsda_modules.sh
Copy link
Copy Markdown
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionally approved pending successful completion of CI tests.

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 4, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Tue Jun  4 17:47:21 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/04/24 06:21:53 PM
Case setup: Completed for experiment C48_ATM_ca024035
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_ca024035
Case setup: Skipped for experiment C48_S2SWA_gefs_ca024035
Case setup: Completed for experiment C48_S2SW_ca024035
Case setup: Completed for experiment C96_atm3DVar_extended_ca024035
Case setup: Skipped for experiment C96_atm3DVar_ca024035
Case setup: Skipped for experiment C96_atmaerosnowDA_ca024035
Case setup: Completed for experiment C96C48_hybatmDA_ca024035
Case setup: Completed for experiment C96C48_ufs_hybatmDA_ca024035

@emcbot emcbot added CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress labels Jun 4, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jun 4, 2024

Experiment C48_ATM_ca024035 **** on Wcoss2 at 06/04/24 09:42:14 PM

Error logs:


Follow link here to view the contents of the above file(s): (link)

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. and removed CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed labels Jun 4, 2024
@emcbot emcbot added CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 and removed CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. labels Jun 4, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jun 4, 2024

CI Update on Wcoss2 at 06/04/24 09:48:54 PM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 72270 on host: clogin05

@emcbot emcbot added the CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress label Jun 4, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jun 4, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Tue Jun  4 21:57:42 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/04/24 10:33:39 PM
Case setup: Completed for experiment C48_ATM_ca024035
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_ca024035
Case setup: Skipped for experiment C48_S2SWA_gefs_ca024035
Case setup: Completed for experiment C48_S2SW_ca024035
Case setup: Completed for experiment C96_atm3DVar_extended_ca024035
Case setup: Skipped for experiment C96_atm3DVar_ca024035
Case setup: Skipped for experiment C96_atmaerosnowDA_ca024035
Case setup: Completed for experiment C96C48_hybatmDA_ca024035
Case setup: Completed for experiment C96C48_ufs_hybatmDA_ca024035

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 4, 2024

Experiment C48_ATM_ca024035 **** on Wcoss2 at 06/04/24 10:36:11 PM

Error logs:


Follow link here to view the contents of the above file(s): (link)

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 4, 2024

Experiment C48_S2SW_ca024035 **** on Wcoss2 at 06/04/24 11:42:17 PM

Error logs:


Follow link here to view the contents of the above file(s): (link)

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

The (link) referenced above just takes us to this PR. A check of the logs for C48_S2SW suggests that the stage and init jobs wound up in a strange state on Cactus

russ.treadon@clogin09:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/EXPDIR/C48_S2SW_ca024035/logs> cat 2021032312.log 
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfsstage_ic
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfswaveinit
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfsstage_ic is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfswaveinit is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfsstage_ic is success, jobid=134280588
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfswaveinit is success, jobid=134280589
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfsstage_ic, jobid=134280588, in state UNKNOWN (F)
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfswaveinit, jobid=134280589, in state UNKNOWN (F)

I'm not sure what's going on.

@WalterKolczynski-NOAA
Copy link
Copy Markdown
Contributor

The (link) referenced above just takes us to this PR. A check of the logs for C48_S2SW suggests that the stage and init jobs wound up in a strange state on Cactus

russ.treadon@clogin09:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/EXPDIR/C48_S2SW_ca024035/logs> cat 2021032312.log 
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfsstage_ic
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfswaveinit
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfsstage_ic is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfswaveinit is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfsstage_ic is success, jobid=134280588
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfswaveinit is success, jobid=134280589
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfsstage_ic, jobid=134280588, in state UNKNOWN (F)
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfswaveinit, jobid=134280589, in state UNKNOWN (F)

I'm not sure what's going on.

Something internal to the CI system. Terry is monitoring it closely and manually adjusted some things.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

OK, @WalterKolczynski-NOAA. I'll stand down.

@TerrenceMcGuinness-NOAA
Copy link
Copy Markdown
Collaborator

TerrenceMcGuinness-NOAA commented Jun 5, 2024

@RussTreadon-NOAA as far as I can tell it looks more like WCOSS2 is having pbs issues and is showing UNKNOWN rocoto status randomly in the CI experiments which is reflexive of those logs. This time its with C48_S2SW.

terry.mcguinness (clogin02) C48_S2SW_ca024035 $ rocotostat   -w C48_S2SW_ca024035.xml -d C48_S2SW_ca024035.db | head -5
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202103231200             gfsstage_ic                   134280588             UNKNOWN                   -         0           0.0
202103231200             gfswaveinit                   134280589             UNKNOWN                   -         0           0.0
202103231200                 gfsfcst                           -                   -                   -         -             -
terry.mcguinness (clogin02) C48_S2SW_ca024035 $

@TerrenceMcGuinness-NOAA
Copy link
Copy Markdown
Collaborator

TerrenceMcGuinness-NOAA commented Jun 5, 2024

@WalterKolczynski-NOAA we should add long waits in the rocotostat python code for when we get UNKNONW. I'm going to restart it one more time and then review it in the morning.

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 5, 2024

Experiment C48_ATM_ca024035 SUCCESS on Wcoss2 at 06/05/24 02:12:11 AM

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 5, 2024

Experiment C48_S2SW_ca024035 SUCCESS on Wcoss2 at 06/05/24 02:42:13 AM

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 5, 2024

Experiment C96C48_hybatmDA_ca024035 SUCCESS on Wcoss2 at 06/05/24 03:21:21 AM

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 5, 2024

Experiment C96C48_ufs_hybatmDA_ca024035 SUCCESS on Wcoss2 at 06/05/24 03:36:11 AM

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 5, 2024

Experiment C96_atm3DVar_extended_ca024035 SUCCESS on Wcoss2 at 06/05/24 09:21:30 AM

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 5, 2024

All CI Test Cases Passed on Wcoss2:

Experiment C48_ATM_ca024035 *** SUCCESS *** at 06/05/24 02:12:11 AM
Experiment C48_S2SW_ca024035 *** SUCCESS *** at 06/05/24 02:42:13 AM
Experiment C96C48_hybatmDA_ca024035 *** SUCCESS *** at 06/05/24 03:21:21 AM
Experiment C96C48_ufs_hybatmDA_ca024035 *** SUCCESS *** at 06/05/24 03:36:11 AM
Experiment C96_atm3DVar_extended_ca024035 *** SUCCESS *** at 06/05/24 09:21:30 AM

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

Yeah, WCOSS2-CI Passed!

@WalterKolczynski-NOAA , RussTreadon-NOAA:feature/wcoss2_ufsda is one commit behind the current head of develop. feature/wcoss2_ufsda does not include the two files committed at 67b833e.

Shall I update RussTreadon-NOAA:feature/wcoss2_ufsda? I have no problem doing so, but I am concerned that doing so might trigger another round of CI testing across supported platforms.

@WalterKolczynski-NOAA
Copy link
Copy Markdown
Contributor

Yeah, WCOSS2-CI Passed!

@WalterKolczynski-NOAA , RussTreadon-NOAA:feature/wcoss2_ufsda is one commit behind the current head of develop. feature/wcoss2_ufsda does not include the two files committed at 67b833e.

Shall I update RussTreadon-NOAA:feature/wcoss2_ufsda? I have no problem doing so, but I am concerned that doing so might trigger another round of CI testing across supported platforms.

Nope, we're good.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor Author

Thank you @WalterKolczynski-NOAA for working with me on this PR and merging it into develop.
Thank you @TerrenceMcGuinness-NOAA for helping us resolve WCOSS2 CI issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update g-w to run JEDI ATM CI on WCOSS2 Activate JEDI ATM DA cycling as part of automated CI

7 participants