Skip to content

[develop] Jet switch from CentOS to Rocky#1045

Merged
MichaelLueken merged 19 commits into
ufs-community:developfrom
RatkoVasic-NOAA:Jet-Rocky8
Mar 13, 2024
Merged

[develop] Jet switch from CentOS to Rocky#1045
MichaelLueken merged 19 commits into
ufs-community:developfrom
RatkoVasic-NOAA:Jet-Rocky8

Conversation

@RatkoVasic-NOAA
Copy link
Copy Markdown
Collaborator

DESCRIPTION OF CHANGES:

Jet is switching from CentOS to Rocky OS.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • hercules.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
  • gaea.intel
  • gaeac5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests

ISSUE:

Solves issue #1044

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

@MichaelLueken MichaelLueken changed the title Jet switch from CentOS to Rocky [develop] Jet switch from CentOS to Rocky Feb 29, 2024
@MichaelLueken MichaelLueken linked an issue Feb 29, 2024 that may be closed by this pull request
@MichaelLueken MichaelLueken added the enhancement New feature or request label Feb 29, 2024
Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RatkoVasic-NOAA -

The fundamental tests were run on Jet Rocky8 and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              10.56
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.49
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.27
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              34.57
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240229185  COMPLETE              25.68
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022918551  COMPLETE              23.57
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             132.53

Approving now.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

The fundamental tests were also successfully run on Jet using CentOS:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               9.10
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              15.51
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.07
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              27.90
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240229203  COMPLETE              21.75
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022920365  COMPLETE              21.27
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             119.88

@EdwardSnyder-NOAA
Copy link
Copy Markdown
Collaborator

Built the SRW App on Rocky 8 using the changes from this PR and ensured the changes worked by running this case: /lfs4/HFIP/hfv3gfs/Edward.Snyder/PR_1045/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 1, 2024
@natalie-perlin
Copy link
Copy Markdown
Collaborator

Fundamental tests ran successfully on Jet (xjet):

All 7 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               9.90
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              13.67
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.12
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.18
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              30.38
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240301215  COMPLETE              22.14
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030121531  COMPLETE              22.77
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             122.16

Detailed summary written to /mnt/lfs4/HFIP/hfv3gfs/Natalie.Perlin/SRW/expt_dirs/WE2E_summary_20240301223112.txt

Copy link
Copy Markdown
Collaborator

@natalie-perlin natalie-perlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the tests, with the following changes made to allow testing for Rocky 8 OS:

modulefiles/build_jet_intel.lua:
uncommented
prepend_path("MODULEPATH","/mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8/install/modulefiles/Core")

commented out
prepend_path("MODULEPATH","/mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core")

In ./ush/machine/jet.yaml, set the all the partitions to xjet:

PARTITION_DEFAULT: xjet
 ...
PARTITION_FCST: xjet

@MichaelLueken
Copy link
Copy Markdown
Collaborator

The Hera Jenkins tests failed due to the system coming down yesterday for maintenance. These tests have been requeued.

There was also a failure on Jet. The get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h test failed in make_lbcs with an OOM error. Using rocotorewind/rocotoboot allowed this test to pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
community_20240304152101                                           COMPLETE              21.59
custom_ESGgrid_20240304152102                                      COMPLETE              18.35
custom_ESGgrid_Great_Lakes_snow_8km_20240304152104                 COMPLETE              13.40
custom_GFDLgrid_20240304152106                                     COMPLETE               9.45
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              10.26
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              49.66
get_from_HPSS_ics_RAP_lbcs_RAP_20240304152110                      COMPLETE              15.30
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240304152111  COMPLETE             222.35
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              43.97
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               9.64
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             533.34
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.62
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             957.93

Once the Hera tests complete, this PR can be merged.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

The Hera Intel tests were run on Rocky8 and all tests passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240308143348                            COMPLETE              18.07
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024030  COMPLETE               6.05
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             766.89
get_from_HPSS_ics_HRRR_lbcs_RAP_20240308143351                     COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               5.96
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              12.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240308143354  COMPLETE              10.19
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             235.54
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240308  COMPLETE             313.52
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403081  COMPLETE             328.98
pregen_grid_orog_sfc_climo_20240308143359                          COMPLETE               7.09
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1725.63

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@RatkoVasic-NOAA -

Unfortunately, while running the WE2E tests with Rocky8 on Hera GNU, the issue that you noted during the UFS apps and components coordination meeting showed up - all tests are failing due to using srun and not being able to find libpmi.so.0 and libpmi2.so.0.

We will need to hope that the tests are able to run over the weekend on CentOS and no longer set in queue.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@RatkoVasic-NOAA -

Given that Hera GNU tests are just sitting in queue for days and the inability to run Hera GNU on Rocky8, the successful run of the Hera Intel and the rest of the platforms will be enough to get this work merged.

Since Rocky8 will be the default package of the nodes following today's update, I will go ahead and set the spack-stack path to point at the rocky8 location and change the ush/machine/jet.yaml file to use xJet for the forecast tasks. Once Jet is returned, Kris Booker and I will check to ensure that the Jet runner is using one of the Rocky8 front ends, then I will run the Jet tests one last time. Once complete, this PR will get merged.

….lua and set PARTITION_FCST=xjet in ush/machine/jet.yaml
@MichaelLueken
Copy link
Copy Markdown
Collaborator

The rerun of the Jenkins tests on Jet had one failure, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2. The run_fcst task was failing with:

FATAL from PE 1: compute_qs: saturation vapor pressure table overflow, nbad= 1

None of the changes made in this PR will cause this issue. The use of rocotorewind/rocotoboot allowed the failed task to successfully pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240312211355                                           COMPLETE              19.64
custom_ESGgrid_20240312211357                                      COMPLETE              18.79
custom_ESGgrid_Great_Lakes_snow_8km_20240312211358                 COMPLETE              14.27
custom_GFDLgrid_20240312211400                                     COMPLETE              10.02
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              11.20
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              57.17
get_from_HPSS_ics_RAP_lbcs_RAP_20240312211404                      COMPLETE              17.22
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240312211405  COMPLETE             223.35
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              40.85
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.38
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             496.47
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.68
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             927.04

Moving forward with merging this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test SRW app with new OS on Jet (Rocky 8)

4 participants