Skip to content

Move to contrib installation of spack-stack on Jet#2878

Merged
WalterKolczynski-NOAA merged 32 commits into
NOAA-EMC:developfrom
InnocentSouopgui-NOAA:migration-jet-contrib
Sep 30, 2024
Merged

Move to contrib installation of spack-stack on Jet#2878
WalterKolczynski-NOAA merged 32 commits into
NOAA-EMC:developfrom
InnocentSouopgui-NOAA:migration-jet-contrib

Conversation

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA commented Aug 29, 2024

Description

Migrates Global Workflow to use contrib installation of spack-stack on Jet.
Following the failure of the storage /lfs4 on Jet, the installation of spack spack moved to /contrib.
All softwares relying on spack-stack on Jet needs update.

Resolves #2841
Refs NOAA-EMC/gfs-utils#78
Refs NOAA-EMC/GSI#786
Refs NOAA-EMC/GSI-Monitor#143
Refs NOAA-EMC/GSI-utils#51
Refs ufs-community/UFS_UTILS#977

Type of change

  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

How has this been tested?

Example:

  • Clone and build on Jet

  • Cycled experiments (48+ hours) at resolutions

    • 96/48 on kjet
    • 192/96 on kjet
    • 384/192 on kjet
  • Forecast only experiment (48+ hours) at resolutions

    • 48
    • 96
    • 192
    • 384

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

I am getting errors of the form bellow in forecast steps (tasks gdasfcst_seg0 and all enkfgdasfcst_mem###)
That is while running C192/C96 on Jet. It happened from the very first cycle, so it did not complete a single cycle.

@DavidHuber-NOAA

21: Warn_K=   6 (i,j)=   87   12 (lon,lat)=123.209 -43.765 VA = 264.64157
21:      K=   5    338.73022
21:      K=   7    217.40834
21: Warn_K=   6 (i,j)=   88   13 (lon,lat)=121.423 -43.705 VA = 250.69765
21:      K=   5    297.46832
21:      K=   7    213.49741
21: Warn_K=   6 (i,j)=   84   16 (lon,lat)=122.124 -46.976 VA = 251.07759
21:      K=   5    256.00562
21:      K=   7    241.60887
 0: PASS: fcstRUN phase 2, n_atmsteps =               27 time is         1.264142
 5:
 5: FATAL from PE     5: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
 5:
13:
13: FATAL from PE    13: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
13:
21:
21: FATAL from PE    21: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
21:
13: Image              PC                Routine            Line        Source
13: ufs_model.x        00000000086C51A7  Unknown               Unknown  Unknown
13: ufs_model.x        00000000078AD1B9  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
13: ufs_model.x        0000000007B61AB6  mpp_efp_mod_mp_mp         195  mpp_efp.F90
13: ufs_model.x        0000000007AB2D99  mpp_domains_mod_m         143  mpp_global_sum.fh
13: ufs_model.x        0000000003E9EB6A  fv_grid_utils_mod        3077  fv_grid_utils.F90
13: ufs_model.x        0000000003F2ED3E  fv_mapz_mod_mp_la         794  fv_mapz.F90
13: libiomp5.so        0000146B48A6CBB3  __kmp_invoke_micr     Unknown  Unknown
13: libiomp5.so        0000146B489E8FAC  __kmp_fork_call       Unknown  Unknown
13: libiomp5.so        0000146B489AACB5  __kmpc_fork_call      Unknown  Unknown
13: ufs_model.x        0000000003F2A129  fv_mapz_mod_mp_la         683  fv_mapz.F90
13: ufs_model.x        0000000003E2EE61  fv_dynamics_mod_m         771  fv_dynamics.F90
13: ufs_model.x        0000000003C9236C  atmosphere_mod_mp         688  atmosphere.F90
13: ufs_model.x        0000000003A3490D  atmos_model_mod_m         879  atmos_model.F90
13: ufs_model.x        00000000035F688C  module_fcst_grid_        1335  module_fcst_grid_comp.F90

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

It seems to be just a bad day

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

I build initial conditions for other days and it cycled smoothly.

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA marked this pull request as ready for review September 5, 2024 19:57
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

GSI PR #787 has been merged into GSI develop. Done at 9f44c87.

The sorc/gsi_enkf.fd hash in InnocentSouopgui-NOAA:migration-jet-contrib must be updated to 9f44c87 to bring these changes into g-w.

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA marked this pull request as draft September 6, 2024 16:13
@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

A check is failing with the following message,

fatal: Fetched in submodule path 'ufs_utils.fd', but it did not contain 0426bf793051530794ec8f182e04f5cf129d0a90. Direct fetching of that commit failed.

How to diagnose what is going on?

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor

@InnocentSouopgui-NOAA I suspect that the hash you are pointing to is for your own branch. Update the hash instead to ufs-community/UFS_UTILS@06eec5b.

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

@InnocentSouopgui-NOAA I suspect that the hash you are pointing to is for your own branch. Update the hash instead to ufs-community/UFS_UTILS@06eec5b.

Thanks @DavidHuber-NOAA , that was the issue.

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor

@InnocentSouopgui-NOAA Just a heads up, there is a bug in the newest GSI-utils that will cause the gdasanalcalc job to fail when performing GDASApp analyses (i.e. the C96C48_ufs_hybatmDA CI test) as noted in #2819 (comment).

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

@KateFriedman-NOAA , @DavidHuber-NOAA The automated tests failed on wcoss2. Can you have a look to investigate further? The error says "CPU oversubscription detected for application". I do not have access to wcoss2 to check what is going on.

That looks like a similar problem that cropped up in PR #2895 and which @DavidHuber-NOAA had that PR include updates to the prep job resources to resolve it on WCOSS2. Sync your PR branch with g-w develop and we'll retry the WCOSS2 CI again with those updated prep job resources.

I just updated my branch.

@KateFriedman-NOAA KateFriedman-NOAA added CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. and removed CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed labels Sep 26, 2024
@emcbot emcbot added CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 and removed CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. labels Sep 26, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Sep 26, 2024

CI Update on Wcoss2 at 09/26/24 01:26:06 PM
============================================
Cloning and Building global-workflow PR: 2878
with PID: 111294 on host: clogin03

@emcbot emcbot added CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 labels Sep 26, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Sep 26, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Thu Sep 26 13:31:59 UTC 2024 on clogin03
---------------------------------------------------
Build: Completed at 09/26/24 02:09:45 PM
Case setup: Completed for experiment C48_ATM_e6c639aa
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_e6c639aa
Case setup: Skipped for experiment C48_S2SWA_gefs_e6c639aa
Case setup: Completed for experiment C48_S2SW_e6c639aa
Case setup: Completed for experiment C96_atm3DVar_extended_e6c639aa
Case setup: Skipped for experiment C96_atm3DVar_e6c639aa
Case setup: Completed for experiment C96C48_hybatmaerosnowDA_e6c639aa
Case setup: Completed for experiment C96C48_hybatmDA_e6c639aa
Case setup: Completed for experiment C96C48_ufs_hybatmDA_e6c639aa

@emcbot emcbot added CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress labels Sep 26, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Sep 26, 2024

Experiment C96_atm3DVar_extended_e6c639aa FAIL on Wcoss2 at 09/26/24 11:14:42 PM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f095.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f096.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f097.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f098.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f099.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f100.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f101.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f102.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f103.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f104.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f105.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f106.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f107.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f108.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsfcst_seg0.log

Follow link here to view the contents of the above file(s): (link)

@InnocentSouopgui-NOAA
Copy link
Copy Markdown
Contributor Author

@KateFriedman-NOAA , @DavidHuber-NOAA

WCOSS2 has a disk space issue.

Disk quota exceeded in gfsfcst_seg0.log
FATAL ERROR: write error in gfsatmos_prod_f###.log

@KateFriedman-NOAA
Copy link
Copy Markdown
Contributor

Hmmmmmm the stmp quota on WCOSS2-Cactus is currently at 66%. It's possible it hit 100% and then fell overnight because of the scrubber. We'd need to retry the CI test again.

@KateFriedman-NOAA KateFriedman-NOAA added CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. and removed CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed labels Sep 27, 2024
@emcbot emcbot added CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 and removed CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. labels Sep 27, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Sep 27, 2024

CI Update on Wcoss2 at 09/27/24 02:06:07 PM
============================================
Cloning and Building global-workflow PR: 2878
with PID: 122735 on host: clogin03

@emcbot emcbot added CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 labels Sep 27, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Sep 27, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Fri Sep 27 14:11:52 UTC 2024 on clogin03
---------------------------------------------------
Build: Completed at 09/27/24 02:49:30 PM
Case setup: Completed for experiment C48_ATM_db407437
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_db407437
Case setup: Skipped for experiment C48_S2SWA_gefs_db407437
Case setup: Completed for experiment C48_S2SW_db407437
Case setup: Completed for experiment C96_atm3DVar_extended_db407437
Case setup: Skipped for experiment C96_atm3DVar_db407437
Case setup: Completed for experiment C96C48_hybatmaerosnowDA_db407437
Case setup: Completed for experiment C96C48_hybatmDA_db407437
Case setup: Completed for experiment C96C48_ufs_hybatmDA_db407437

@emcbot emcbot added CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress labels Sep 28, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Sep 28, 2024

All CI Test Cases Passed on Wcoss2:

Experiment C48_ATM_db407437 *** SUCCESS *** at 09/27/24 04:21:17 PM
Experiment C48_S2SW_db407437 *** SUCCESS *** at 09/27/24 04:35:12 PM
Experiment C96C48_hybatmDA_db407437 *** SUCCESS *** at 09/27/24 05:28:22 PM
Experiment C96C48_hybatmaerosnowDA_db407437 *** SUCCESS *** at 09/27/24 06:07:28 PM
Experiment C96C48_ufs_hybatmDA_db407437 *** SUCCESS *** at 09/27/24 07:21:21 PM
Experiment C96_atm3DVar_extended_db407437 *** SUCCESS *** at 09/28/24 03:56:31 AM

@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit 8f0541c into NOAA-EMC:develop Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate Jet to /lfs5

7 participants