Skip to content

Upgrade to spack-stack 1.9.2#3798

Merged
DavidHuber-NOAA merged 204 commits into
NOAA-EMC:developfrom
DavidHuber-NOAA:feature/191
Aug 15, 2025
Merged

Upgrade to spack-stack 1.9.2#3798
DavidHuber-NOAA merged 204 commits into
NOAA-EMC:developfrom
DavidHuber-NOAA:feature/191

Conversation

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Jun 13, 2025

Description

This updates the global workflow to spack-stack 1.9.2 as well as the subcomponents. This PR also includes a partial port to Ursa, though no global workflow tests have been nor can be run there yet.

Fixes #2984
Fixes #3920
Fixes #3921
Fixes #3922
Fixes #3923
Fixes #3852
Fixes #2756
Fixes #3934

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

How has this been tested?

  • Full suite of tests on all platforms

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

Comment thread ush/load_gw_run_modules.sh Fixed
Comment thread ush/load_gw_run_modules.sh Fixed
@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

@JessicaMeixner-NOAA @bbakernoaa @aerorahul The C96mx100_S2S test is failing during the forecast step due to apparent instabilities on Hercules:

FATAL from PE     1: There were a total of     43254 locations detected with extreme surface values!

The log file can be found here: /work2/noaa/stmp/dhuber/HERCULES/para_192d/COMROOT/C96mx100_S2S_192d/logs/1994050100/sfs_fcst_mem000_seg0.log

I am inclined to disable this test on Hercules and open an issue to investigate it. Thoughts?

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

I agree - however, SFS is using hercules for a lot of their runs, so we should also have @NeilBarton-NOAA and @XiaqiongZhou-NOAA weigh in.

@NeilBarton-NOAA
Copy link
Copy Markdown
Contributor

This test should not be deleted. I'll look at the logs

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

@NeilBarton-NOAA Just clarifying that I'm not proposing that we delete the test. I would just make the following change:

--- a/dev/ci/cases/pr/C96mx100_S2S.yaml
+++ b/dev/ci/cases/pr/C96mx100_S2S.yaml
@@ -20,3 +20,4 @@ arguments:
 skip_ci_on_hosts:
   - gaeac5
   - awsepicglobalworkflow
+  - hercules

This would disable the test on Hercules while the issue can be investigated.

@NeilBarton-NOAA
Copy link
Copy Markdown
Contributor

SFS routinely runs on hercules.

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

DavidHuber-NOAA commented Aug 15, 2025

I wonder if this could be a related issue to the change to the stack-oneapi module file seen on 8/13 (noted here). Could there be an issue with the SFS model building with Intel LLVM compilers?

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

JessicaMeixner-NOAA commented Aug 15, 2025

I have successful runs of S2S from 8/14: /work2/noaa/marine/jmeixner/hercules/gw3798_20250814/t3798/COMROOT/C48_S2SW_t3798

and the C96:

/work2/noaa/marine/jmeixner/hercules/gw3798_20250814/t3798/COMROOT/C96mx100_S2S_t3798/logs

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

@JessicaMeixner-NOAA Could you run these to open up permissions on the log directories?

find /work2/noaa/marine/jmeixner/hercules/gw3798_20250814/t3798/COMROOT/* -type d -exec chmod o+rx {} +
find /work2/noaa/marine/jmeixner/hercules/gw3798_20250814/t3798/COMROOT/* -type f -exec chmod o+r {} +

@CatherineThomas-NOAA
Copy link
Copy Markdown
Contributor

Tagging @EdwardSafford-NOAA for awareness:

Ursa update:

We need a minor change in ursa.env - I'll post in code with that update.

vminmod jobs are failing

sample logs: /scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C48mx500_3DVarAOWCDA_ut03/logs/2021032500/gfs_vminmon.log.0


@JessicaMeixner-NOAA This indicates that an additional Perl module is required. I made the same request for other systems after the Rocky 8 updates. I'll make a request shortly to the RDHPCS help desk.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

@DavidNew-NOAA @RussTreadon-NOAA

I'm running this branch on Ursa, and for C96C48_ufs_hybatmDA atmanlvar and atmensanlobs both run out of memory. See logs here: /scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96C48_ufs_hybatmDA_ut03/logs/2024022400/

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA - An out of memory error on Ursa: /scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_ut03/logs/2021122018/gcdas_aeroanlvar.log.0

@bbakernoaa @lipan-NOAA - I don't quite understand this error, b/c I don't think these jobs actually ran out of walltime, but the gcafs forecasts are failling w/"out of walltime" on Ursa using this branch:
Logs are here:
/scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_ut03/logs
/scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_noDA_ut03/logs

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor

Are we only using one node on Ursa? I see 96 cores. Can we change it to max 48 tasks per node for Ursa for the aeroanl job?

@bbakernoaa
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA - An out of memory error on Ursa: /scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_ut03/logs/2021122018/gcdas_aeroanlvar.log.0

@bbakernoaa @lipan-NOAA - I don't quite understand this error, b/c I don't think these jobs actually ran out of walltime, but the gcafs forecasts are failling w/"out of walltime" on Ursa using this branch: Logs are here: /scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_ut03/logs /scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_noDA_ut03/logs

/scratch3/NCEPDEV/climate/Jessica.Meixner/gwpr3798/ut03/COMROOT/C96_gcafs_cycled_ut03/logs/2021122012/gcdas_fcst_seg0.log looks like it completed. Am I missing something?

Comment thread dev/parm/config/gfs/config.resources.URSA
@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

I've been informed that the spack-stack issues have been resolved on Hercules and Orion. I'm going to recompile the SFS model and relaunch the forecasts to see if that resolves the issue.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Confirmed that GSI builds on Orion and Hercules now load the correct fortran compiler.

-- The C compiler identification is IntelLLVM 2024.2.1
-- The Fortran compiler identification is Intel 2021.1.0.20240703

Test done with GSI develop at 3be73271. This is the commit prior to adding

setenv("CC","mpiicc")
setenv("CXX","mpiicpc")
setenv("FC","mpiifort")

to gsi_hercules.intel.lua and gsi_orion.intel.lua.

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

DavidHuber-NOAA commented Aug 15, 2025

I know the issue for the SFS forecasts now. I made a local change to config.ufs to increase layout_x and layout_y for the C96C48_hybatmDA test case and copied it to all tests. However, the GFS config.ufs is not compatible with SFS. I'm relaunching the C96mx100_S2S case now with a clean EXPDIR.

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

SFS forecasts have made it to the 12 hour mark without any signs of instability.

@NeilBarton-NOAA
Copy link
Copy Markdown
Contributor

great. thanks @DavidHuber-NOAA

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

All tests passed on Hercules 🎉:

C48_ATM_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
C48mx500_3DVarAOWCDA_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
================================================================================================================================
C48mx500_hybAOWCDA_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
================================================================================================================================
C48_S2SW_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
C48_S2SWA_gefs_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
C96_atm3DVar_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
================================================================================================================================
================================================================================================================================
C96C48_hybatmDA_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
================================================================================================================================
================================================================================================================================
C96C48mx500_S2SW_cyc_gfs_192d
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
================================================================================================================================
================================================================================================================================
================================================================================================================================
C96mx100_S2S_192e
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

Requesting final approvals for this PR.

Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Copy link
Copy Markdown
Contributor

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!!!!!

@bbakernoaa - thanks for helping get all the gocart model update parts together.

Huge thanks to @DavidHuber-NOAA for all the work on getting all of the spack-stack pieces together!!

@aerorahul
Copy link
Copy Markdown
Contributor

🎉

@aerorahul
Copy link
Copy Markdown
Contributor

Thank you all for your effort in getting this and several other issues resolved with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI-Gaeac6-Passed **Bot use only** CI testing on Gaea C6 for this PR has completed successfully CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully

Projects

Status: Done