Skip to content

Upgrade to spack-stack 1.9.2 on all RDHPCS platforms#890

Merged
RussTreadon-NOAA merged 43 commits into
NOAA-EMC:developfrom
DavidHuber-NOAA:feature/ss_191
Jul 9, 2025
Merged

Upgrade to spack-stack 1.9.2 on all RDHPCS platforms#890
RussTreadon-NOAA merged 43 commits into
NOAA-EMC:developfrom
DavidHuber-NOAA:feature/ss_191

Conversation

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA commented Jun 4, 2025

Description
This updates library versions on all supported platforms (including WCOSS2) and points to new installations of spack-stack 1.9.1 on all supported research platforms. It also drops support for S4, Jet, and Gaea C5 as well as GNU support on Hera. This also updates the sp subroutine calls, which are now part of the ip library and accessed via use sp_mod.

Resolves #884
Resolves #886
Resolves #642
Resolves #662
Resolves #894
Partially addresses #665 (LLVM C compilers are used)

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Regression tests on all platforms.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@DavidHuber-NOAA , let me test your branch on various RDHPCS machines. You tested Gaea C6, right? I have spack-stack/1.9.1 modulefiles for Gaea C6, Hera, Hercules, and Orion.

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator Author

@RussTreadon-NOAA Yes, testing on Gaea completed overnight. All tests passed versus develop with the exception of the global_4denvar test. The loproc_updat test completed in 395.289 seconds, which exceeded the time threshold of 318.423 seconds.

Yes, please go ahead with testing on Hera, Hercules, and Orion. I will start testing on WCOSS2.

Comment thread src/gsi/strong_fast_global_mod.f90 Outdated
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Ursa test

Build DavidHuber-NOAA:feature/ss_191 at ebf1b95 on Ursa. Run global_4denvar ctest. Test failed with

 NST_INIT_NML_: Initializing default NST namelist variables
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libc.so.6          0000150A3A40C730  Unknown               Unknown  Unknown
gsi.x              0000000000F901A0  general_specmod_m         257  general_specmod.f90

Line 257 of general_specmod.f90 is the call to splegend

      do j=sp%jb,sp%je
        call splegend(sp%iromb,sp%jcap,sp%slat(j),sp%clat(j),sp%eps, &
          sp%epstop,sp%pln(1,j:j),sp%plntop(1,j:j))
      end do

This PR modifies the call to general_specmod.

Comment thread src/gsi/general_specmod.f90 Outdated
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Hercules test

Ran rrfs_3denvar_rdasens. Forgot that spack-stack/1.9.1 gsi.x build aborts with mpi_allreduce error on Hercules.

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libc.so.6          00001491F954ED90  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9E1CC19  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9E310FA  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9D11F4E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9D123A6  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9EF3BE4  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9A98658  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9A6F080  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9A5A779  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F9B5F3ED  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00001491F993FD34  PMPI_Allreduce        Unknown  Unknown
libmpifort.so.12.  000014920372C455  mpi_allreduce_        Unknown  Unknown
gsi.x              0000000001446CAA  pcgsoimod_mp_pcgs         360  pcgsoi.f90

This problem has been reported and is being discussed in GSI issue #887

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

GSI issue #887 documents how modifying the mpi module loads yields a gsi.x which successfully runs rrfs_3denvar_rdasens on Hera and Hercules.

Fixing the global_4denvar failure remains to be done.

RussTreadon-NOAA:feature/ss_191 at c36fec9 was installed on Gaea C6 and Ursa.

Below are ctest results from Gaea C6

Test project /gpfs/f6/ira-sti/scratch/Russ.Treadon/git/gsi/ss_191/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #1: global_4denvar ...................***Failed   60.17 sec
2/6 Test #3: rrfs_3denvar_rdasens .............   Passed  482.56 sec
3/6 Test #6: global_enkf ......................***Failed  606.70 sec
4/6 Test #2: rtma .............................   Passed  724.53 sec
5/6 Test #5: hafs_3denvar_hybens ..............   Passed  964.66 sec
6/6 Test #4: hafs_4denvar_glbens ..............   Passed  1029.32 sec

67% tests passed, 2 tests failed out of 6

Total Test time (real) = 1029.33 sec

The global_enkf failure is due to different updat (spack-stack/1.9.1 build) and contrl (spack-stack/1.6.0 build) analysis results.

Below are ctests results from Ursa

Test project /scratch3/NCEPDEV/da/Russ.Treadon/git/gsi/ss_191/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #1: global_4denvar ...................***Failed   60.14 sec
2/6 Test #3: rrfs_3denvar_rdasens .............***Failed  485.43 sec
3/6 Test #6: global_enkf ......................   Passed  488.62 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  728.94 sec
5/6 Test #2: rtma .............................   Passed  785.84 sec
6/6 Test #4: hafs_4denvar_glbens ..............   Passed  849.34 sec

67% tests passed, 2 tests failed out of 6

Total Test time (real) = 849.36 sec

The rrfs_3denvar_rdasens failure is due to

The memory for rrfs_3denvar_rdasens_loproc_updat is 13549728 KBs.  This has exceeded maximum allowable memory of 12902731 KBs,
resulting in Failure memthresh of the regression test.

The memory check is not a robust test. This is not a fatal failure.

@TingLei-NOAA
Copy link
Copy Markdown
Contributor

A check of fv3SAR01_ens_mem* in Cactus /lfs/h2/emc/ptmp/david.huber/ss_192/tmpreg_rrfs_3denvar_rdasens/rrfs_3denvar_rdasens_loproc_updat shows that the old, un-reformatted netcdf files were used in the test.

russ.treadon@clogin05:/lfs/h2/emc/ptmp/david.huber/ss_192/tmpreg_rrfs_3denvar_rdasens/rrfs_3denvar_rdasens_loproc_updat> ls -lL *ens*fv3*phy*
-rw-r--r-- 1 emc.lam lam 509225068 May 22  2024 fv3SAR01_ens_mem001-fv3_phyvars
-rw-r--r-- 1 emc.lam lam 509225068 May 22  2024 fv3SAR01_ens_mem002-fv3_phyvars
-rw-r--r-- 1 emc.lam lam 509225068 May 22  2024 fv3SAR01_ens_mem003-fv3_phyvars
-rw-r--r-- 1 emc.lam lam 509225068 May 22  2024 fv3SAR01_ens_mem004-fv3_phyvars
-rw-r--r-- 1 emc.lam lam 509225068 May 22  2024 fv3SAR01_ens_mem005-fv3_phyvars

@RussTreadon-NOAA Thanks. I hadn't got a chance to check them by myself.

@TingLei-NOAA
Copy link
Copy Markdown
Contributor

TingLei-NOAA commented Jul 3, 2025

@TingLei-NOAA I suppose I don't know for certain if the reformatted netcdf files are used or not. I'm also not sure which issue you are referring to: the hang with netCDF 4.9.2 or #894 or perhaps the combination.

The hanging issue I believe was documented in #766 and partially mitigated in #788 for Orion. Since this issue is replicated on WCOSS2, it appears that there is a requirement by the RRFS test to have a certain minimum number of processes.

For #894, it was requested that this PR close that issue.

@DavidHuber-NOAA Thanks for bringing up that old issue 776. Yes they were also for the issue reading fed . So it should be resolved by the recent 894 issue too. I will add a link to that issue.

@TingLei-NOAA
Copy link
Copy Markdown
Contributor

@ShunLiu-NOAA Could you also update the physics netcdf 4 files used by rdas ctest on cactus too? Thanks.
(in fact, those files on all machines could be updated ).

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@TingLei-NOAA : Are you satisfied with the changes in this PR?

  • YES: We need your approval to merge this PR into develop
  • NO: What remains to be done before you can approve this PR?

@ShunLiu-NOAA
Copy link
Copy Markdown
Contributor

@ShunLiu-NOAA Could you also update the physics netcdf 4 files used by rdas ctest on cactus too? Thanks. (in fact, those files on all machines could be updated ).
@TingLei-daprediction I am syncing the data directory to Cactus now. It should be available on Cactus soon.

@TingLei-NOAA
Copy link
Copy Markdown
Contributor

TingLei-NOAA commented Jul 7, 2025

@RussTreadon-NOAA Thanks for asking! I am happy to take this opportunity to explain on what I am still hoping to see further clarifications . 1) Seems we are not still not sure if the failure of rdas ctest was caused by several issues (including #766 as @DavidHuber-NOAA pointed to) , which I believe should have been resolved by recent issue #894. If not, I can further work on them including incorporating the recent upgrading of the parallel reading in fv3-reg. So, i hope with the updated physics file ( to be done with the help from @ShunLiu-NOAA ), I hope it will be confirmed/clarified that there are no regional GSI-specific issues causing any failures for this PR.
Second, and maybe more difficult to address quickly is the non-reproductability issues with this PR. the issue exists for both global and regional GSI. The latter , from my investigation as reported in #894 , doesn't include any codes changes from this PR. It appears that the different compiler caused the differences. While we have no any work-around to allow all ctests pass (at the same time, help narrowing down on the reasons of such differences), I expect to see more discussions/investigation to convince me we could go ahead to accept this differences and believe we could take the new results from the new compiler incorporated in this PR to be the control. If I missed any discussions/works on this, please refer me to them. If the current compiler caused the differences, could we consider a different/maybe newer version of the compiler? If we take the current compiler as the control (merged into develop), what can we say about the possible differences between it and future newer compilers? I definitely hope to see more comments from experts on this aspect.

@ShunLiu-NOAA
Copy link
Copy Markdown
Contributor

@ShunLiu-NOAA Could you also update the physics netcdf 4 files used by rdas ctest on cactus too? Thanks. (in fact, those files on all machines could be updated ).
@TingLei-daprediction I am syncing the data directory to Cactus now. It should be available on Cactus soon.

@TingLei-NOAA Data became available on Cactus now.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

RussTreadon-NOAA commented Jul 7, 2025

Thank you @TingLei-NOAA for the update.

@mhu, @ShunLiu-NOAA and @DavidBurrows-NCO: In addition to WCOSS2, we need to update the RRFS physics files for the 2023061012 case on the following machines:

machine directory owner
Gaea C6 /gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest/regional/rrfs @DavidBurrows-NCO
Ursa /scratch1/BMC/wrfruc/mhu/code/data/regional/rrfs @hu5970
MSU (Orion / Hercules) /work/noaa/wrfruc/mhu/data/regional/rrfs @hu5970
WCOSS2 (Cactus / Dogwood /lfs/h2/emc/lam/noscrub/emc.lam/data/regional/rrfs @ShunLiu-NOAA
Acorn /lfs/h2/emc/da/noscrub/russ.treadon/CASES/regtest/regional/rrfs @RussTreadon-NOAA

I will rsync the WCOSS2 files to Acorn once I get the green light from Shun that the WCOSS2 rsync is complete.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

RussTreadon-NOAA commented Jul 7, 2025

@ShunLiu-NOAA , @hu5970 , @DavidBurrows-NCO , please check the box below when the RRFS physics files for the 2023061012 case have been updated on the indicated machine

  • Gaea C6
  • Ursa
  • MSU
  • WCOSS2
  • Acorn

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

RRFS case missing on Hera

Attempts to run RRFS regression test on Hera fail. Directory /scratch1/BMC/wrfruc/mhu/code/data/regional/rrfs no longer exists. I see /scratch4/BMC/wrfruc/mhu/code. This directory does not contain data/regional/rrfs.

@hu5970 : Have you installed the RRFS case (with or without @TingLei-NOAA 's updated physics files) somewhere on Ursa? If so, where?

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@DavidBurrows-NCO and @hu5970 : The following has been done to move this PR forward

  1. rsync /gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest to /gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest. Populate rrfs with WCOSS2 files
  2. rsync WCOSS2 rrfs files to /scratch3/NCEPDEV/da/Russ.Treadon/CASES/regtest/regional/rrfs
  3. rsync WCOSS2 rrfs files to /work/noaa/da/rtreadon/CASES/regtest/regional/rrfs

@DavidHuber-NOAA : If you want, you can replace

export casesdir="/gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest"

in the gaeac6 section of regression/regression_var.sh with

export casesdir="/gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest"

@DavidBurrows-NCO
Copy link
Copy Markdown
Contributor

Thanks @RussTreadon-NOAA I was waiting for Ursa to populate then transfer from there. I don't have WCOSS access. Let me know if you need anything else. Thanks.

@hu5970
Copy link
Copy Markdown
Collaborator

hu5970 commented Jul 7, 2025

@DavidBurrows-NCO and @hu5970 : The following has been done to move this PR forward

  1. rsync /gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest to /gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest. Populate rrfs with WCOSS2 files
  2. rsync WCOSS2 rrfs files to /scratch3/NCEPDEV/da/Russ.Treadon/CASES/regtest/regional/rrfs
  3. rsync WCOSS2 rrfs files to /work/noaa/da/rtreadon/CASES/regtest/regional/rrfs

@DavidHuber-NOAA : If you want, you can replace

export casesdir="/gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest"

in the gaeac6 section of regression/regression_var.sh with

export casesdir="/gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest"

@RussTreadon-NOAA Thanks for populating the RRFS cases to Hera and Gaea. I did not realize the GSI RRFS case data is in my space and removed it. Did you also populated RRFS cases to Ursa?

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Yes, @hu5970 , the new rrfs case has been rsync'd to Ursa (/scratch3/NCEPDEV/da/Russ.Treadon/CASES/regtest/regional/rrfs).

@hu5970
Copy link
Copy Markdown
Collaborator

hu5970 commented Jul 7, 2025

@RussTreadon-NOAA Do you still need me to populate the RRFS case on Hera?

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@hu5970 : I copied the WCOSS2 RRFS case directory to Hera.

I do not view Hera as a viable machine for GSI testing. /scratch1 and /scratch2 will be inaccessible at the end of July 2025.

07/31  Hera scratch 1-2 disk access ends (file moves to Ursa must be complete)
08/05*  Hera scratch 1-2 disks decommissioned (Hera CPU nodes still available)

@hu5970
Copy link
Copy Markdown
Collaborator

hu5970 commented Jul 7, 2025

@RussTreadon-NOAA Thanks. I agree we should move to Ursa as major test machine.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@DavidBurrows-NCO

  1. do you want to update rrfs files in /gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest/regional/rrfs/2023061012. or,
  2. have us update casesdir to /gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest/regional/rrfs/2023061012

If 1, the files to copy are

cp $SOURCE/ens/mem0001/fcst_fv3lam/RESTART/20230610.120000.phy_data.nc $TARGET/ens/mem0001/fcst_fv3lam/RESTART/
cp $SOURCE/ens/mem0002/fcst_fv3lam/RESTART/20230610.120000.phy_data.nc $TARGET/ens/mem0002/fcst_fv3lam/RESTART/
cp $SOURCE/ens/mem0003/fcst_fv3lam/RESTART/20230610.120000.phy_data.nc $TARGET/ens/mem0003/fcst_fv3lam/RESTART/
cp $SOURCE/ens/mem0004/fcst_fv3lam/RESTART/20230610.120000.phy_data.nc $TARGET/ens/mem0004/fcst_fv3lam/RESTART/
cp $SOURCE/ens/mem0005/fcst_fv3lam/RESTART/20230610.120000.phy_data.nc $TARGET/ens/mem0005/fcst_fv3lam/RESTART/
cp $SOURCE/ges/20230610.120000.phy_data.nc $TARGET/ges/

where
SOURCE=/gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest/regional/rrfs/2023061012
and
TARGET=/gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest/regional/rrfs/2023061012

If 2, I'll work with @DavidHuber-NOAA to update casesdir in the gaeac6 section of regression/regression_var.sh.

@DavidBurrows-NCO
Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA It would be great if you could work with @DavidHuber-NOAA to get those files copied into place. It might be even better to move /gpfs/f6/bil-fire8/world-shared/GSI_data out of bil-fire8 space. I'll leave that up to you though. Thanks.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

RussTreadon-NOAA commented Jul 8, 2025

Thank you @DavidBurrows-NCO for the quick reply.

@DavidHuber-NOAA, can you replace

export casesdir="/gpfs/f6/bil-fire8/world-shared/GSI_data/CASES/regtest"

in the gaeac6 section of regression/regression_var.sh with

export casesdir="/gpfs/f6/ira-sti/world-shared/Russ.Treadon/CASES/regtest"

The above modification has been made in /gpfs/f6/ira-sti/scratch/Russ.Treadon/git/gsi/pr890/regression/regression_var.sh

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@DavidHuber-NOAA : GSI PR #3 has been created to change the Gaea C6 path for casesdir from @DavidBurrows-NCO 's directory to my directory.

Once PR #3 is merged into DavidHuber-NOAA:feature/ss_191 we can forward this PR to the GSI handling review team.

Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

10 participants