Skip to content

[develop]: Update for Gaea software stack location and Lmod initialization script #627

Merged
MichaelLueken merged 4 commits into
ufs-community:developfrom
natalie-perlin:develop-gaea-update
Mar 10, 2023
Merged

[develop]: Update for Gaea software stack location and Lmod initialization script #627
MichaelLueken merged 4 commits into
ufs-community:developfrom
natalie-perlin:develop-gaea-update

Conversation

@natalie-perlin
Copy link
Copy Markdown
Collaborator

@natalie-perlin natalie-perlin commented Feb 23, 2023

DESCRIPTION OF CHANGES:

Gaea hpc-stack location has changed to match that for the UFS-WM that results in all regression test passed successfully.
Lmod initialization has been updated as well, similar to the one that worked for the UFS-WM to pass RTs. It is done by sourcing a single initialization script.

Files changed:
./modulefiles/build_gaea_intel.lua
./etc/lmod-setup.sh

UPDATE: ./etc/lmod-setup.csh is not changed

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

UFS-WM has passed all the regression tests using this updated hpc-stack location.
The SRW code has successfully compiled.

  • gaea.intel

DEPENDENCIES:

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests

Comment thread etc/lmod-setup.csh Outdated

setenv LMOD_SYSTEM_DEFAULT_MODULES "modules/3.2.11.4"
module --initial_load --no_redirect restore
source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work since you would need a "csh" script.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @danielabdi-noaa , I will prepare a separate Lmod_init.csh script!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @natalie-perlin, for preparing a separate Lmod_init.csh script! Just to reiterate what @danielabdi-noaa noted, while attempting to source the etc/lmod-setup.csh file, I encountered the following:

gaea15 Michael.Lueken/ufs-srweather-app> source etc/lmod-setup.csh gaea
Illegal variable name.

Comment thread etc/lmod-setup.sh
elif [ "$L_MACHINE" = gaea ]; then
export BASH_ENV="/lustre/f2/dev/role.epic/contrib/apps/lmod/lmod/init/bash"
source $BASH_ENV
source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is best to call the "bash" script directly instead of what looks like a custom "Lmod_init.sh" script.
Could you please print the contents of the script so that we can see what it does differently?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization script for Lmod 8.7.12 stores a list of modules that are loaded by default into the user environment, purges the modules, sources the /lustre/f2/dev/role.epic/contrib/apps/lmod/lmod/init/profile, and then loads the default modules in a preferred order, due to some interdependencies. After a default module management manager on Gaea, modules/3.2.11.4 , is loaded, $MODULEHOME is changed to point to a path set by modules/3.2.11.4. In the end of the updated Lmod initialization script, $MODULEHOME is reset to correspond to Lmod 8.7.12 module management package.

Copy link
Copy Markdown
Collaborator Author

@natalie-perlin natalie-perlin Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the modules/3.2.11.4 is always a default module, but the list of the rest of the modules differ depending on whether you got into a login node or into a compute node during the model run.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lmod_init.sh has the following:

#!/bin/bash
  
loaded_modules=$(echo ${LOADEDMODULES:-} | tr ":" "\n")
module purge 2>/dev/null

echo "Initializing lua module environment Lmod 8.7.12, loading modules (wait...)"
export LMOD_SYSTEM_DEFAULT_MODULES=modules/3.2.11.4
export BASH_ENV=/lustre/f2/dev/role.epic/contrib/apps/lmod/lmod/init/profile
source $BASH_ENV
export PATH=$MODULESHOME/libexec:$MODULESHOME/init/ksh_funcs:$PATH
module --initial_load --no_redirect restore
#
if [[ -d /opt/cray/ari/modulefiles ]] ; then
    module use -a /opt/cray/ari/modulefiles
fi
if [[ -d /opt/cray/pe/ari/modulefiles ]] ; then
    module use -a /opt/cray/pe/ari/modulefiles
fi
if [[ -d /opt/cray/pe/craype/default/modulefiles ]] ; then
    module use -a /opt/cray/pe/craype/default/modulefiles
fi
# Load craype module first, then DefApps, then all others
for module in $loaded_modules
do
        [[ $module == craype/* ]] &&  module try-load $module
done
for module in $loaded_modules
do
        [[ $module == DefApps ]] &&  module try-load $module
done
for module in $loaded_modules
do
        [[ $module == craype/* || $module == DefApps ]] ||  module is-loaded $module || module try-load  $module
done

# Set environment variables
export MODULESHOME=$LMOD_ROOT/lmod
#
# Report when done loading
#echo "... done loading "        

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin Thanks for the details. I was not aware the logic for replacing cray modules has grown now, makes sense to put it in its own script.

Comment thread etc/lmod-setup.sh
@natalie-perlin
Copy link
Copy Markdown
Collaborator Author

@MichaelLueken @danielabdi-noaa -
The script for csh initialization has been corrected, and could be invoked as following:
source /lustre/f2/dev/role.epic/contrib/Lmod_init.csh

Note that there are some differences between the *.sh version, because the modules are not purged in Lmod_init.csh. They simply do not appear in user environment when Lmod init script is sources. I avoided to make a module purge call, because there is no clear way of stderr redirection in C shell that could be flushed to /dev/null as easy as in Bash.

The Lmod_init.csh has the following:

#!/bin/csh 
  
set loaded_modules = `echo $LOADEDMODULES | tr : '\n' `
#echo "loaded_modules = ${loaded_modules[*]} "

echo "Initializing lua module environment Lmod 8.7.12, loading modules (wait...)"
setenv LMOD_SYSTEM_DEFAULT_MODULES "modules/3.2.11.4"
setenv BASH_ENV "/lustre/f2/dev/role.epic/contrib/apps/lmod/lmod/init/csh"
source ${BASH_ENV}
setenv PATH "$MODULESHOME/libexec:$MODULESHOME/init/ksh_funcs:$PATH"
module --initial_load --no_redirect restore
#
if ( -d /opt/cray/ari/modulefiles ) then
    module use -a /opt/cray/ari/modulefiles
endif
if ( -d /opt/cray/pe/ari/modulefiles ) then
    module use -a /opt/cray/pe/ari/modulefiles
endif
if ( -d /opt/cray/pe/craype/default/modulefiles ) then
    module use -a /opt/cray/pe/craype/default/modulefiles
endif
# Load craype module first, then DefApps, then all others
foreach module ( $loaded_modules )
   set name = `echo $module | cut -d"/" -f1`
   if ( $name == "craype" ) then
      module try-load $module
   endif
end
foreach module ( $loaded_modules )
   set name = `echo $module | cut -d"/" -f1`
   if ( $name == "DefApps" ) then
      module try-load $module
   endif
end
foreach module ( $loaded_modules )
   set name = `echo $module | cut -d"/" -f1`
   if ! ( $name == "craype" || $name == "DefApps" ) then
       module is-loaded $module || module try-load  $module
   endif
end

# Set environment variables
setenv MODULESHOME $LMOD_ROOT/lmod
#
# Report when done loading
#echo "... done loading "

Copy link
Copy Markdown
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin I was able to load srw modules from my default tcsh login now, so approving.

Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin Thank you for updating the etc/lmod-setup.csh script. I was ultimately able to build the SRW and run the fundamental tests on Gaea. Having said that, I did note some weird behavior after using source etc/lmod-setup.csh gaea. It would be nice if the warning messages noted in my review could be addressed, but I'm not even sure why they are showing up with your changes (a test using the current develop shows no messages like I see with your fork's branch).

Comment thread etc/lmod-setup.csh Outdated
Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natalie-perlin The SRW App builds without issue on Gaea. Loading the Lmod environment using csh occurs without issue and subsequent builds and WE2E test generation occurs without issue. Approving these changes now!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 3, 2023
@MichaelLueken
Copy link
Copy Markdown
Collaborator

@natalie-perlin The failure of the Jenkins tests on Orion is due to the issue noted in issue #635. The Jenkins tests successfully built and ran on Gaea without issue.

@MichaelLueken
Copy link
Copy Markdown
Collaborator

@natalie-perlin I was able to check out externals on Orion manually (the issue with the default Git version used on Orion). Your branch was built without issue and the WE2E fundamental tests ran through to completion without issue. It sounds like the ufs-weather-model is hoping to merge your changes in tomorrow. Once merged, I will move forward with this PR. Thanks!

@MichaelLueken
Copy link
Copy Markdown
Collaborator

Since I will be away due to jury duty tomorrow, I will be unable to merge this work. Once PR #1645 has been merged, please go ahead and merge this PR into develop. The tests have successfully passed for Gaea and the Orion tests were manually run and passed as well. If you are fine with waiting until Friday (March 10) morning, I can merge this work at that time. If this work is merged before Friday, please make sure to replace the four commit messages with the following:

Gaea hpc-stack location has changed to match that for the UFS-WM that results in all regression test passed successfully.
Lmod initialization has been updated as well, similar to the one that worked for the UFS-WM to pass RTs. It is done by sourcing a single initialization script.

Thanks!

@natalie-perlin
Copy link
Copy Markdown
Collaborator Author

@MichaelLueken
The PR-1645 ( ufs-community/ufs-weather-model#1645 ) has been merged , and the this current PR could then be merged and closed.

@MichaelLueken MichaelLueken merged commit 550f400 into ufs-community:develop Mar 10, 2023
@natalie-perlin natalie-perlin deleted the develop-gaea-update branch October 13, 2023 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants