-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MKL_USE_STATIC_LIBS broken #18255
Comments
Thanks, @leezu, our team will look into the issue and get back soon. |
Hi @leezu, I would expect the flag |
mkl is used if available by default. |
I cannot reproduce the issue. Pulled the latest master branch and built it with the command line as below (I dont have lapack and ninja in my system):
Try to reproduce:
openblas is linked. MKL is installed to /opt/intel/mkl, so I assume it can be found in cmake.
|
@TaoLv you can reproduce it with the version of MKL installed in the CI environment: https://github.com/apache/incubator-mxnet/blob/68cb9555c4b4779aaae90e593b745270cbb59033/ci/docker/Dockerfile.build.ubuntu#L36-L61 I don't know why mkl is not detected on your system. What happens if you set See #17794 where the error happens on CI. |
@leezu, with adding
|
In my environment, the problem can be mitigated by pre loading the libraries. I'm trying to see if we can fix it on build or link stage.
|
Hi @leezu and @TaoLv, I didn't know whether this issue is solved or not. Thus, I tried to reproduce & check it on the latest master (and also at some builds from the March, 2020) but couldn't. I compiled mxnet with the following command line:
Should I set something more to get this MKL error? |
Thanks @akarbown for following up on this issue. I first noticed the failure reported here due to the CI failure of #17794 It would previously immediately abort. I rebased the PR and included an update to the latest MKL version. Now the CI does not immediately abort but there are still 20 failed (aborted) tests. Perhaps there has been a change in MKL / MXNet that works around the immediate issue?
I haven't looked into these failures in more detail. |
I did a small research. Got the repro of the issue. Determined that MKL libraries static linking ( |
When I run the tests with the following cmdline:
I've checked that it's enough to LD_PRELOAD libmkl_rt.so to fix the issue. However, when I link libmkl_rt.so (or compile mxnet with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1) I get the same problem as it's described #17641. It's because of multiple openmp libraries linked into MXNET. It seems to be a catch-22 situation. Assuming, that we compile MXNET with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1 we will get the problem with linked multiple openmp which could probably be worked around with the following solutions: index 07075d752..1555f3f40 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -411,6 +411,7 @@ if(USE_OPENMP)
AND SYSTEM_ARCHITECTURE STREQUAL "x86_64"
AND NOT CMAKE_BUILD_TYPE STREQUAL "Distribution"
AND NOT BLAS STREQUAL "MKL"
+ AND NOT BLAS STREQUAL "mkl"
AND NOT MSVC
AND NOT CMAKE_CROSSCOMPILING)
load_omp() All the ~20 tests that were failing passed without any issues (except for the one test case: test_optimizer.py::test_lamb). To be more precise, it's compiled with the following cmdline: Conclusions:
Now I want to concentrate on root causing the hang issue. |
I have compiled MxNet library (libmxnet.so), that uses MKL via Single Dynamic Library (so libmxnet.so links with libmkl_rt.so) and uses OpenMP separately (so libmxnet.so also links with libgomp.so.1). Since, MxNet depends on GNU OpenMP, I'm forcing MKL to use GNU threading layer (MKL_THREADING_LAYER=GNU), so I can have a single (GNU) OpenMP runtime in the process.
So I end up having two different OpenMP implementations in a single process and that's causing runtime issues - I'm observing hangs during unit tests.When I create a symlink (libgomp.so -> libgomp.so.1) I can confirm that MKL opens GNU OpenMP and my tests pass. |
Description
If we compile mxnet via
cmake -GNinja -DUSE_MKLDNN=0 -DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 ..; ninja
, running an operation that requires MKL will cause termination.Error Message
To Reproduce
python3 -c 'import mxnet as mx; print(mx.nd.square(mx.nd.random.uniform(shape=(1024,))))'
Discussion
The missing symbol is defined in
/opt/intel/mkl/lib/intel64/libmkl_core.so
:000000000095ad00 T mkl_lapack_dspevd
.libmxnet.so
does depend onlibmkl_core.so
:Thus it's unclear why MKL complains that the symbol is missing when attempting to dlopen
libmkl_vml_avx512.so
cc: @pengzhao-intel
The text was updated successfully, but these errors were encountered: