{numlib,chem,tollchain}[NVHPC/23.7-CUDA-12.1.1] nvompi-2023a + QuantumESPRESSO-7.3.1 (GPU enabled)#20364
{numlib,chem,tollchain}[NVHPC/23.7-CUDA-12.1.1] nvompi-2023a + QuantumESPRESSO-7.3.1 (GPU enabled)#20364Crivella wants to merge 12 commits intoeasybuilders:developfrom
Conversation
|
Comparison of code efficiency when linked to EB numlibs (no prefix) VS linked to NVHPC math_libs ( |
|
Thanks for putting all of this together! Our site is interested in a GPU enabled QuantumESPRESSO build, so we've been testing this. Were you able to get around the "other error"s that occur in LAPACK testing when building OpenBLAS? Using the OpenBLAS_0.3.24-NVHPC-23.7-CUDA-12.1.1.eb EasyConfig as provided gives us 55 other errors: I saw that you had done some work with on OpenBLAS issue #4652 to get some of the numerical failures down, but was wondering if you were ever able to get rid of the other errors that stop EasyBuild from finishing. |
|
@cgross95 What hardware are you trying this on? I think i was still getting some other errors as well with 0.3.27 but i didn't investigate much further into it as i was aiming at 0.3.24 for this release (In that case i was getting 14 errors related to the ZHSQR and ZGEEV routines failing to find all eigenvalues). The logs should give you further details on which lapack routine failed and with what error code (each function should have the meaning of the errors as comments in the source/documentation). |
|
I'm compiling on a v100s with an Intel Xeon Skylake on Ubuntu 22.04. We also have some a100 cards, but we're in the midst of transferring everything in our cluster to Ubuntu, so they're not easily accessible at the moment. I'll dig into the LAPACK testing logs and see if I can produce some more useful debugging information. |
|
I finally got access to our A100 cards, and can report that there were no "other error"s in the LAPACK tests. I ended up with 152 numerical errors, so increased the |
|
Hi, I'm compiling on a v100s with an Intel Xeon Skylake on Ubuntu 22.04. What more changes do you think i should do to be able to use QuantumEspresso(GPU enabled)?? Because, when i use this PR, eb --from-pr 20364 -r, i got checksum error in libxc, which i fixed, afterwards i am getting error in OpenBLAS/0.3.24-NVHPC-23.7-CUDA-12.1.1.... The error i get is
|
|
@beeebiii The other error you are reporting seems related to OpenBLAS. In my tests on an A100 with an AMD zen2 CPU I did not encounter failures in the compilation (only some failures in the test suite). One weird thing is I am not sure |
|
Yeah you are right, i think gcc is being used instead of nvcc. |
|
If you look at the OpenBLAS easyconfig, only NVHPC should be used and easybuild should not be aware for other compiler toolchains in that instance. |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
That is basically easybuild reporting the easyconfig file being used, the debug logging adds much more.
I would venture to guess the problem is in either of them. |
|
To summarize the problem @beeebiii was having, the |
This reverts commit 706e9d1.
Updated software
|
This reverts commit 706e9d1.
This reverts commit 706e9d1.
5869132 to
32ab56b
Compare
Added easyconfig files for nvofbf toolchain + QE 7.3.1
local compilers:
GCC/12.3.0CUDA/12.1.1Added toolchain/numlib
nvofbf-2023anvompi-2023aNVHPC-23.7-CUDA-12.1.1OpenMPI-4.1.5FlexiBLAS-3.3.1OpenBLAS-0.3.24FFTW-3.3.10FFTW.MPI-3.3.10ScaLAPACK-2.2.0-fbAdded easyconfigs
HDF5-1.14.0-nvompi-2023a-CUDA-12.1.1.eblibxc-6.2.2-NVHPC-23.7-CUDA-12.1.1.ebQuantumESPRESSO-7.3.1-nvompi-2023a-CUDA-12.1.1.ebNOTES:
cudacompilers which requires specified compute capability (CC), while QE useshpc-sdkcompilers which if not specified compiles for all supported CCsSolved issues:
v0.3.24v0.3.27Open issue:
ZHEEVBLAS routinecuda-gdbnvompilinking directly to OpenBLAS and the error was not presentRMM-DISdiagonalization with k points other than GAMMA, most likely a QE bug (https://gitlab.com/QEF/q-e/-/issues/675)CMAKEand only experimental withautotools, and also not a really widely used feature of QE, it is ok to not have the libxc routines run on GPU