Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing Cardinal on Sawtooth #819

Open
delcmo opened this issue Dec 4, 2023 · 16 comments
Open

Installing Cardinal on Sawtooth #819

delcmo opened this issue Dec 4, 2023 · 16 comments

Comments

@delcmo
Copy link

delcmo commented Dec 4, 2023

Bug Description

I am trying to compile Cardinal on Sawtooth and get the following error message with make -j8:

Cardinal is using HDF5 from    /home/delcmarc/cardinal/contrib/moose/petsc/arch-moose
Cardinal is using MOOSE from   /home/delcmarc/cardinal/contrib/moose
Cardinal is using NekRS from   /home/delcmarc/cardinal/contrib/nekRS
Cardinal is using OpenMC from  /home/delcmarc/cardinal/contrib/openmc
Cardinal is compiled with the following MOOSE modules
  FLUID_PROPERTIES
  HEAT_TRANSFER
  NAVIER_STOKES
  REACTOR
  SOLID_PROPERTIES
  STOCHASTIC_TOOLS
  TENSOR_MECHANICS
  THERMAL_HYDRAULICS
Linking libpng: -lpng16 -lz 
Linking Library /home/delcmarc/cardinal/contrib/moose/framework/libmoose-opt.la...
Linking Library /home/delcmarc/cardinal/contrib/moose/modules/solid_properties/lib/libsolid_properties-opt.la...
libtool: warning: '/home/delcmarc/cardinal/contrib/moose/scripts/../libmesh/installed/lib/libmesh_opt.la' seems to be moved
libtool: warning: '/home/delcmarc/cardinal/contrib/moose/scripts/../libmesh/installed/lib/libnetcdf.la' seems to be moved
libtool: warning: '/home/delcmarc/cardinal/contrib/moose/scripts/../libmesh/installed/lib/libtimpi_opt.la' seems to be moved
/usr/bin/grep: /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpicxx.la: No such file or directory
/usr/bin/sed: can't read /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpicxx.la: No such file or directory
libtool:   error: '/apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpicxx.la' is not a valid libtool archive
make: *** [/home/delcmarc/cardinal/contrib/moose/framework/moose.mk:397: /home/delcmarc/cardinal/contrib/moose/framework/libmoose-opt.la] Error 1
make: *** Waiting for unfinished jobs....

I did follow the installation instructions on the Cardinal webpage and loaded all modules as instructed. PETSC and LibMesh compiled fine as far as I can tell. I also checked that the path of the files mentioned in the libtool: warning messages are all valid.

Steps to Reproduce

On Sawtooth using the installation instructions from here.

Impact

I need Cardinal installed on Sawtooth for a project with the NEAMS Workbench.

@aprilnovak
Copy link
Collaborator

Hi @delcmo - I have very recently built Cardinal on Sawtooth without issue, so am confident we can get to the bottom of this :)

What modules are you using for MPI? The recommended modules on the Cardinal website you link are for OpenMPI, but it looks like the error you are seeing is from mvapich.

@delcmo
Copy link
Author

delcmo commented Dec 4, 2023

You are correct and I made sure to use the same modules as in here. The modules I load are:

odule purge
module load use.moose
module load moose-tools
module load openmpi/4.1.5_ucx1.14.1
module load cmake/3.27.7-oneapi-2023.2.1-4uzb
module load git-lfs
export CC=mpicc
export CXX=mpicxx
export FC=mpif90
export ENABLE_NEK=true
export ENABLE_OPENMC=true
export ENABLE_DAGMC=false
export CARDINAL_DIR=$HOME/cardinal
export OPENMC_CROSS_SECTIONS=$HOME/cross_sections/endfb-vii.1-hdf5/cross_sections.xml
export NEKRS_HOME=$CARDINAL_DIR/install
export MOOSE_DIR=$CARDINAL_DIR/contrib/moose
export LIBMESH_DIR=$CARDINAL_DIR/contrib/moose/libmesh/installed
export PYTHONPATH=$CARDINAL_DIR/contrib/moose/python:$PYTHONPATH

I used to load module load mvapich2/2.3.3-gcc-9.2.0-xpjm to build Cardinal last year. I cleared my build and install directories but there seems to have a left-over libraries from my previous builds.

Marco

@aprilnovak
Copy link
Collaborator

Yes, that's certainly possible - I'd try also cleaning out the MOOSE submodule just to be sure we get everything:

cd cardinal
rm -rf build/ install/
cd contrib/moose
git clean -xfd
cd ../../
make

@delcmo
Copy link
Author

delcmo commented Dec 5, 2023 via email

@aprilnovak
Copy link
Collaborator

Would you please attach the whole console output?

./run_tests > out.txt

@delcmo
Copy link
Author

delcmo commented Dec 5, 2023 via email

@aprilnovak
Copy link
Collaborator

@delcmo I think the attachment did not go through properly - can you please attach it on github, instead of via an email reply? Or you can email it to me directly.

@delcmo
Copy link
Author

delcmo commented Dec 5, 2023

Here it is.

unit_tests.txt

@aprilnovak
Copy link
Collaborator

Thanks - it looks like some tests are failing due to MPI-related reasons (not normal - something is definitely wrong). Here's one case which fails, looks like all fail in the same way.

    File     : /home/delcmarc/cardinal/contrib/nekRS/3rd_party/occa/src/occa/internal/utils/sys.cpp
    Line     : 937
    Function : dlopen
    Message  : Error loading binary [d810f609fc22f78e/binary] with dlopen: libmpi.so.12: cannot open shared object file: No such file or directory

Perhaps @loganharbour has an idea?

@delcmo
Copy link
Author

delcmo commented Dec 7, 2023

Any idea on why I get these odd error messages when running the unit tests?

@loganharbour
Copy link
Member

Is there some old state in your install from the previous build? Or did you forget to load the relevant modules?

That error comes from MPI no longer being in LD_LIBRARY_PATH - i.e., not "loaded"

@aprilnovak
Copy link
Collaborator

aprilnovak commented Dec 8, 2023

Thanks @loganharbour. In that case, @delcmo I'd suggest wiping out Cardinal (rm -rf cardinal) and rebuilding from scratch to make sure we don't have any old state.

@delcmo
Copy link
Author

delcmo commented Dec 8, 2023

@aprilnovak I followed your suggestions and was able to recompile Cardinal and run the unit tests. 5 of them failed:

test:nek_standalone/channel.test ^[[90m......................................^[[0m ^[[31m[min_cpus=2,FINISHED]^[[0m ^[[31mFAILED (TIMEOUT)^[[0m
test:nek_stochastic/quiet_init.driver_multi_2 ^[[90m.........................^[[0m ^[[31m[min_cpus=2,FINISHED]^[[0m ^[[31mFAILED (TIMEOUT)^[[0m
utils/meshes/interassembly.specs ^[[90m.............................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
utils/meshes/interassembly_w_structures.specs ^[[90m................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
utils/meshes/assembly.specs ^[[90m..................................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
--------------------------------------------------------------------------------------------------------------
Ran 528 tests in 3254.8 seconds. Average test time 29.3 seconds, maximum test time 401.5 seconds.
^[[1m^[[32m523 passed^[[0m, ^[[1m89 skipped^[[0m, ^[[1m0 pending^[[0m, ^[[1m^[[31m5 FAILED^[[0m

which seems a more reasonable behavior.

@aprilnovak
Copy link
Collaborator

That looks better! Those are normal - we have a few tests (on the order of 5) which take a long time to run. Depending on the parallel settings you used to launch the test suite, those may time out. NekRS has a very slow JIT process the first time you run a test case.

If you re-run the test suite, you should (hopefully) see everything pass because NekRS will be able to use the JIT cache produced on the first test run, saving lots of time on each individual test.

@delcmo
Copy link
Author

delcmo commented Dec 8, 2023

I re-run it. Only 4 tests failed and one of them TIMEOUT.

For the other three, I get a CODE 1 error because of numpy being not found.

^[[31mutils/meshes/assembly.specs: ^[[0mWorking Directory: /home/delcmarc/cardinal/utils/meshes/assembly
^[[31mutils/meshes/assembly.specs: ^[[0mRunning command: python mesh.py
^[[31mutils/meshes/assembly.specs: ^[[0mTraceback (most recent call last):
^[[31mutils/meshes/assembly.specs: ^[[0m  File "mesh.py", line 5, in <module>
^[[31mutils/meshes/assembly.specs: ^[[0m    import numpy as np
^[[31mutils/meshes/assembly.specs: ^[[0mModuleNotFoundError: No module named 'numpy'
^[[31mutils/meshes/assembly.specs: ^[[0m
^[[31mutils/meshes/assembly.specs: ^[[0m################################################################################
^[[31mutils/meshes/assembly.specs: ^[[0mTester failed, reason: CODE 1
^[[31mutils/meshes/assembly.specs: ^[[0m

I updated the PYTHONPATH with export PYTHONPATH=$CARDINAL_DIR/contrib/moose/python:$PYTHONPATH but it does not seem to help.

I was able to run the tests that are in the documentation https://cardinal.cels.anl.gov/hpc.html.

@aprilnovak
Copy link
Collaborator

aprilnovak commented Dec 8, 2023

** REVISED

I would just try the following: pip install numpy, and then re-run. Those tests are running a python script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants