Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKL_USE_STATIC_LIBS broken #18255

Open
leezu opened this issue May 7, 2020 · 12 comments
Open

MKL_USE_STATIC_LIBS broken #18255

leezu opened this issue May 7, 2020 · 12 comments

Comments

@leezu
Copy link
Contributor

leezu commented May 7, 2020

Description

If we compile mxnet via cmake -GNinja -DUSE_MKLDNN=0 -DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 ..; ninja, running an operation that requires MKL will cause termination.

Error Message

INTEL MKL ERROR: /opt/intel/mkl/lib/intel64/libmkl_vml_avx512.so: undefined symbol: mkl_lapack_dspevd.
Intel MKL FATAL ERROR: cannot load libmkl_vml_avx512.so or libmkl_vml_def.so.
terminate called without an active exception

To Reproduce

python3 -c 'import mxnet as mx; print(mx.nd.square(mx.nd.random.uniform(shape=(1024,))))'

Discussion

The missing symbol is defined in /opt/intel/mkl/lib/intel64/libmkl_core.so: 000000000095ad00 T mkl_lapack_dspevd. libmxnet.so does depend on libmkl_core.so:

% ldd libmxnet.so
[...]
        libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007fe082883000)
        libmkl_intel_thread.so => /opt/intel/mkl/lib/intel64/libmkl_intel_thread.so (0x00007fe080317000)
        libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007fe07bff7000)
[...]

Thus it's unclear why MKL complains that the symbol is missing when attempting to dlopen libmkl_vml_avx512.so

cc: @pengzhao-intel

@pengzhao-intel
Copy link
Contributor

Thanks, @leezu, our team will look into the issue and get back soon.

@TaoLv
Copy link
Member

TaoLv commented May 7, 2020

Hi @leezu, I would expect the flag -DMKL_USE_STATIC_LIBS should take effect only when -DUSE_BLAS is set to mkl. That means, with your command line, mkl libraries should not be linked at all.

@leezu
Copy link
Contributor Author

leezu commented May 7, 2020

mkl is used if available by default.

@TaoLv
Copy link
Member

TaoLv commented May 8, 2020

I cannot reproduce the issue. Pulled the latest master branch and built it with the command line as below (I dont have lapack and ninja in my system):

cmake -DUSE_MKLDNN=0 -DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 .. -DUSE_LAPACK=0
make -j40

Try to reproduce:

(mxnet) [lvtao@mlt2-clx017 ~]$ python3 -c 'import mxnet as mx; print(mx.nd.square(mx.nd.random.uniform(shape=(1024,))))'

[0.30119628 0.35146472 0.51149577 ... 0.12286104 0.28916994 0.35195467]
<NDArray 1024 @cpu(0)>

openblas is linked. MKL is installed to /opt/intel/mkl, so I assume it can be found in cmake.

(mxnet) [lvtao@mlt2-clx017 build]$ ldd libmxnet.so
        linux-vdso.so.1 =>  (0x00007ffd5e5bf000)
        libopenblas.so.0 => /home/lvtao/miniconda3/envs/mxnet/lib/libopenblas.so.0 (0x00007f73d6030000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f73d5e28000)
        libopencv_imgproc.so.2.4 => /lib64/libopencv_imgproc.so.2.4 (0x00007f73d59aa000)
        libopencv_highgui.so.2.4 => /lib64/libopencv_highgui.so.2.4 (0x00007f73d5763000)
        libopencv_core.so.2.4 => /lib64/libopencv_core.so.2.4 (0x00007f73d532a000)
        libomp.so => /home/lvtao/Workspace/mxnet-temp/build/3rdparty/openmp/runtime/src/libomp.so (0x00007f73d5046000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f73d4e2a000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f73d4c26000)
        libgomp.so.1 => /home/lvtao/miniconda3/envs/mxnet/lib/libgomp.so.1 (0x00007f73d483c000)
        libstdc++.so.6 => /home/lvtao/miniconda3/envs/mxnet/lib/libstdc++.so.6 (0x00007f73dc258000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f73d453a000)
        libgcc_s.so.1 => /home/lvtao/miniconda3/envs/mxnet/lib/libgcc_s.so.1 (0x00007f73dc243000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f73d416c000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f73dc1c7000)
        libgfortran.so.4 => /home/lvtao/miniconda3/envs/mxnet/lib/./libgfortran.so.4 (0x00007f73d403e000)
        libz.so.1 => /home/lvtao/miniconda3/envs/mxnet/lib/libz.so.1 (0x00007f73dc222000)
        libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007f73d3de9000)
        libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f73d3bbe000)
        libtiff.so.5 => /lib64/libtiff.so.5 (0x00007f73d394a000)
        libjasper.so.1 => /lib64/libjasper.so.1 (0x00007f73d36f0000)
        libImath.so.6 => /lib64/libImath.so.6 (0x00007f73d34de000)
        libIlmImf.so.7 => /lib64/libIlmImf.so.7 (0x00007f73d3216000)
        libIex.so.6 => /lib64/libIex.so.6 (0x00007f73d2ff7000)
        libHalf.so.6 => /lib64/libHalf.so.6 (0x00007f73d2db4000)
        libIlmThread.so.6 => /lib64/libIlmThread.so.6 (0x00007f73d2bad000)
        libgtk-x11-2.0.so.0 => /lib64/libgtk-x11-2.0.so.0 (0x00007f73d252b000)
        libgdk-x11-2.0.so.0 => /lib64/libgdk-x11-2.0.so.0 (0x00007f73d226a000)
        libatk-1.0.so.0 => /lib64/libatk-1.0.so.0 (0x00007f73d2044000)
        libgio-2.0.so.0 => /lib64/libgio-2.0.so.0 (0x00007f73d1ca5000)
        libpangoft2-1.0.so.0 => /lib64/libpangoft2-1.0.so.0 (0x00007f73d1a8f000)
        libpangocairo-1.0.so.0 => /lib64/libpangocairo-1.0.so.0 (0x00007f73d1881000)
        libgdk_pixbuf-2.0.so.0 => /lib64/libgdk_pixbuf-2.0.so.0 (0x00007f73d1659000)
        libcairo.so.2 => /lib64/libcairo.so.2 (0x00007f73d1322000)
        libpango-1.0.so.0 => /lib64/libpango-1.0.so.0 (0x00007f73d10dc000)
        libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f73d0e9a000)
        libgobject-2.0.so.0 => /lib64/libgobject-2.0.so.0 (0x00007f73d0c49000)
        libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x00007f73d0933000)
        libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f73d0674000)
        libgthread-2.0.so.0 => /lib64/libgthread-2.0.so.0 (0x00007f73d0472000)
        libgstbase-0.10.so.0 => /lib64/libgstbase-0.10.so.0 (0x00007f73d021e000)
        libgstreamer-0.10.so.0 => /lib64/libgstreamer-0.10.so.0 (0x00007f73cff35000)
        libgmodule-2.0.so.0 => /lib64/libgmodule-2.0.so.0 (0x00007f73cfd31000)
        libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f73cf9c7000)
        libgstapp-0.10.so.0 => /lib64/libgstapp-0.10.so.0 (0x00007f73cf7bb000)
        libgstvideo-0.10.so.0 => /lib64/libgstvideo-0.10.so.0 (0x00007f73cf59e000)
        libv4l1.so.0 => /lib64/libv4l1.so.0 (0x00007f73cf398000)
        libquadmath.so.0 => /home/lvtao/miniconda3/envs/mxnet/lib/./libquadmath.so.0 (0x00007f73cf35e000)
        libjbig.so.2.0 => /lib64/libjbig.so.2.0 (0x00007f73cf152000)
        libIexMath.so.6 => /lib64/libIexMath.so.6 (0x00007f73cef4d000)
        libX11.so.6 => /lib64/libX11.so.6 (0x00007f73cec0f000)
        libXfixes.so.3 => /lib64/libXfixes.so.3 (0x00007f73cea09000)
        libXrender.so.1 => /lib64/libXrender.so.1 (0x00007f73ce7fe000)
        libXinerama.so.1 => /lib64/libXinerama.so.1 (0x00007f73ce5fb000)
        libXi.so.6 => /lib64/libXi.so.6 (0x00007f73ce3eb000)
        libXrandr.so.2 => /lib64/libXrandr.so.2 (0x00007f73ce1e0000)
        libXcursor.so.1 => /lib64/libXcursor.so.1 (0x00007f73cdfd5000)
        libXcomposite.so.1 => /lib64/libXcomposite.so.1 (0x00007f73cddd2000)
        libXdamage.so.1 => /lib64/libXdamage.so.1 (0x00007f73cdbcf000)
        libXext.so.6 => /lib64/libXext.so.6 (0x00007f73cd9bd000)
        libffi.so.6 => /home/lvtao/miniconda3/envs/mxnet/lib/libffi.so.6 (0x00007f73cd7b4000)
        libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f73cd552000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f73cd32b000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f73cd112000)
        libmount.so.1 => /lib64/libmount.so.1 (0x00007f73ccecf000)
        libharfbuzz.so.0 => /lib64/libharfbuzz.so.0 (0x00007f73ccc32000)
        libpixman-1.so.0 => /lib64/libpixman-1.so.0 (0x00007f73cc989000)
        libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f73cc775000)
        libxcb-shm.so.0 => /lib64/libxcb-shm.so.0 (0x00007f73cc571000)
        libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f73cc349000)
        libxcb-render.so.0 => /lib64/libxcb-render.so.0 (0x00007f73cc13b000)
        libGL.so.1 => /lib64/libGL.so.1 (0x00007f73cbeaf000)
        libthai.so.0 => /lib64/libthai.so.0 (0x00007f73cbca3000)
        libfribidi.so.0 => /lib64/libfribidi.so.0 (0x00007f73cba87000)
        libexpat.so.1 => /home/lvtao/miniconda3/envs/mxnet/lib/libexpat.so.1 (0x00007f73cba53000)
        libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f73cb84e000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f73cb63e000)
        liblzma.so.5 => /home/lvtao/miniconda3/envs/mxnet/lib/liblzma.so.5 (0x00007f73cb418000)
        liborc-0.4.so.0 => /lib64/liborc-0.4.so.0 (0x00007f73cb194000)
        libv4l2.so.0 => /lib64/libv4l2.so.0 (0x00007f73caf86000)
        libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f73cad46000)
        libgraphite2.so.3 => /lib64/libgraphite2.so.3 (0x00007f73cab18000)
        libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007f73ca862000)
        libXau.so.6 => /lib64/libXau.so.6 (0x00007f73ca65e000)
        libGLX.so.0 => /lib64/libGLX.so.0 (0x00007f73ca42c000)
        libv4lconvert.so.0 => /lib64/libv4lconvert.so.0 (0x00007f73ca1b3000)

@leezu
Copy link
Contributor Author

leezu commented May 8, 2020

@TaoLv you can reproduce it with the version of MKL installed in the CI environment: https://github.com/apache/incubator-mxnet/blob/68cb9555c4b4779aaae90e593b745270cbb59033/ci/docker/Dockerfile.build.ubuntu#L36-L61

I don't know why mkl is not detected on your system. What happens if you set -DUSE_BLAS=mkl?

See #17794 where the error happens on CI.

@TaoLv
Copy link
Member

TaoLv commented May 9, 2020

@leezu, with adding -DUSE_BLAS=mkl to the cmake line, I can get a similar crash issue:

$ python3 -c 'import mxnet as mx; print(mx.nd.square(mx.nd.random.uniform(shape=(1024,))))'
python3: symbol lookup error: /opt/intel/mkl/lib/intel64/libmkl_vml_avx512.so: undefined symbol: mkl_serv_getenv

@TaoLv
Copy link
Member

TaoLv commented May 11, 2020

In my environment, the problem can be mitigated by pre loading the libraries. I'm trying to see if we can fix it on build or link stage.

(mxnet) [lvtao@mlt2-clx020 mxnet-temp]$ export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_core.so:/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so:/opt/intel/mkl/lib/intel64/libmkl_intel_thread.so:/opt/intel/lib/intel64_lin/libiomp5.so
(mxnet) [lvtao@mlt2-clx020 mxnet-temp]$ python3 -c 'import mxnet as mx; print(mx.nd.square(mx.nd.random.uniform(shape=(1024,))))'

[0.30119628 0.35146472 0.51149577 ... 0.12286104 0.28916994 0.35195467]
<NDArray 1024 @cpu(0)>

@akarbown
Copy link
Contributor

akarbown commented Nov 2, 2020

Hi @leezu and @TaoLv, I didn't know whether this issue is solved or not. Thus, I tried to reproduce & check it on the latest master (and also at some builds from the March, 2020) but couldn't. I compiled mxnet with the following command line: ' cmake -GNinja -DUSE_MKLDNN=0 -DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 -DUSE_BLAS=mkl ..; ninja' and get the following output:

MKL_VERBOSE=1 python3 -c 'import mxnet as mx; print(mx.nd.square(mx.nd.random.
niform(shape=(1024,))))'
mkl-service + Intel(R) MKL: THREADING LAYER: (null)
mkl-service + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
mkl-service + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2020.0 Update 2 Product build 20200624 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with supprt of Vector Neural Network Instructions enabled processors, Lnx 2.70GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x561209eba390,1,0x561209eba390,1) 17.30us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
[12:23:30] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

[0.30119628 0.35146472 0.51149577 ... 0.12286104 0.28916994 0.35195467]
<NDArray 1024 @cpu(0)>

Should I set something more to get this MKL error?

@leezu
Copy link
Contributor Author

leezu commented Nov 4, 2020

Thanks @akarbown for following up on this issue. I first noticed the failure reported here due to the CI failure of #17794 It would previously immediately abort.

I rebased the PR and included an update to the latest MKL version. Now the CI does not immediately abort but there are still 20 failed (aborted) tests. Perhaps there has been a change in MKL / MXNet that works around the immediate issue?

https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-17794/runs/14/nodes/284/steps/417/log/?start=0

[2020-11-02T17:50:40.151Z] 
[2020-11-02T17:50:40.151Z] =================================== FAILURES ===================================
[2020-11-02T17:50:40.151Z] ____________________ tests/python/unittest/test_autograd.py ____________________
[2020-11-02T17:50:40.151Z] [gw2] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw2' crashed while running 'tests/python/unittest/test_autograd.py::test_unary_func'
[2020-11-02T17:50:40.151Z] ____________________ tests/python/unittest/test_autograd.py ____________________
[2020-11-02T17:50:40.151Z] [gw3] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw3' crashed while running 'tests/python/unittest/test_autograd.py::test_training'
[2020-11-02T17:50:40.151Z] ___________ tests/python/unittest/test_contrib_gluon_data_vision.py ____________
[2020-11-02T17:50:40.151Z] [gw1] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw1' crashed while running 'tests/python/unittest/test_contrib_gluon_data_vision.py::TestImage::test_imageiter'
[2020-11-02T17:50:40.151Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.151Z] [gw4] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw4' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_linalg_svd[False-float32-shape3]'
[2020-11-02T17:50:40.151Z] ________________ tests/python/unittest/test_contrib_intgemm.py _________________
[2020-11-02T17:50:40.151Z] [gw0] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw0' crashed while running 'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api0-8-64-1]'
[2020-11-02T17:50:40.151Z] ________________ tests/python/unittest/test_gluon_estimator.py _________________
[2020-11-02T17:50:40.151Z] [gw7] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw7' crashed while running 'tests/python/unittest/test_gluon_estimator.py::test_trainer'
[2020-11-02T17:50:40.151Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.151Z] [gw8] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw8' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_matmul[float64-True-add-write-shape_a9-shape_b9]'
[2020-11-02T17:50:40.151Z] ___________________ tests/python/unittest/test_optimizer.py ____________________
[2020-11-02T17:50:40.151Z] [gw6] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw6' crashed while running 'tests/python/unittest/test_optimizer.py::test_lamb'
[2020-11-02T17:50:40.151Z] ________________ tests/python/unittest/test_contrib_intgemm.py _________________
[2020-11-02T17:50:40.151Z] [gw9] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.151Z] worker 'gw9' crashed while running 'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api0-8-128-1]'
[2020-11-02T17:50:40.405Z] _______________ tests/python/unittest/test_higher_order_grad.py ________________
[2020-11-02T17:50:40.405Z] [gw5] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw5' crashed while running 'tests/python/unittest/test_higher_order_grad.py::test_arctan'
[2020-11-02T17:50:40.405Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.405Z] [gw11] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw11' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_mixed_mxnp_op_funcs'
[2020-11-02T17:50:40.405Z] ________________ tests/python/unittest/test_contrib_intgemm.py _________________
[2020-11-02T17:50:40.405Z] [gw12] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw12' crashed while running 'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api0-8-64-2]'
[2020-11-02T17:50:40.405Z] ___________________ tests/python/unittest/test_optimizer.py ____________________
[2020-11-02T17:50:40.405Z] [gw13] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw13' crashed while running 'tests/python/unittest/test_optimizer.py::test_lans'
[2020-11-02T17:50:40.405Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.405Z] [gw10] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw10' crashed while running 'tests/python/unittest/test_numpy_op.py::test_npx_batch_dot'
[2020-11-02T17:50:40.405Z] ________________ tests/python/unittest/test_contrib_intgemm.py _________________
[2020-11-02T17:50:40.405Z] [gw15] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw15' crashed while running 'tests/python/unittest/test_contrib_intgemm.py::test_contrib_intgemm_multiply[api0-8-192-1]'
[2020-11-02T17:50:40.405Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.405Z] [gw17] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw17' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_matmul[float32-True-add-null-shape_a6-shape_b6]'
[2020-11-02T17:50:40.405Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.405Z] [gw18] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw18' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_matmul[float32-True-write-add-shape_a4-shape_b4]'
[2020-11-02T17:50:40.405Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.405Z] [gw19] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw19' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_matmul[float64-False-write-write-shape_a3-shape_b3]'
[2020-11-02T17:50:40.405Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:50:40.405Z] [gw14] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw14' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_linalg_svd[False-float32-shape4]'
[2020-11-02T17:50:40.405Z] _____________________ tests/python/unittest/test_metric.py _____________________
[2020-11-02T17:50:40.405Z] [gw16] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:50:40.405Z] worker 'gw16' crashed while running 'tests/python/unittest/test_metric.py::test_ce'
[2020-11-02T17:50:40.405Z] =============================== warnings summary ===============================
[2020-11-02T17:51:15.713Z] =================================== FAILURES ===================================
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_autograd.py ____________________
[2020-11-02T17:51:15.713Z] [gw3] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw3' crashed while running 'tests/python/unittest/test_autograd.py::test_unary_func'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_autograd.py ____________________
[2020-11-02T17:51:15.713Z] [gw2] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw2' crashed while running 'tests/python/unittest/test_autograd.py::test_training'
[2020-11-02T17:51:15.713Z] ___________ tests/python/unittest/test_contrib_gluon_data_vision.py ____________
[2020-11-02T17:51:15.713Z] [gw1] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw1' crashed while running 'tests/python/unittest/test_contrib_gluon_data_vision.py::TestImage::test_imageiter'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_executor.py ____________________
[2020-11-02T17:51:15.713Z] [gw0] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw0' crashed while running 'tests/python/unittest/test_executor.py::test_dot'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw5] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw5' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_mixed_mxnp_op_funcs'
[2020-11-02T17:51:15.713Z] ________ tests/python/unittest/test_numpy_contrib_gluon_data_vision.py _________
[2020-11-02T17:51:15.713Z] [gw8] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw8' crashed while running 'tests/python/unittest/test_numpy_contrib_gluon_data_vision.py::TestImage::test_bbox_augmenters'
[2020-11-02T17:51:15.713Z] _______________ tests/python/unittest/test_gluon_data_vision.py ________________
[2020-11-02T17:51:15.713Z] [gw9] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw9' crashed while running 'tests/python/unittest/test_gluon_data_vision.py::test_random_gray'
[2020-11-02T17:51:15.713Z] ___________________ tests/python/unittest/test_optimizer.py ____________________
[2020-11-02T17:51:15.713Z] [gw7] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw7' crashed while running 'tests/python/unittest/test_optimizer.py::test_lamb'
[2020-11-02T17:51:15.713Z] __________________ tests/python/unittest/test_subgraph_op.py ___________________
[2020-11-02T17:51:15.713Z] [gw6] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw6' crashed while running 'tests/python/unittest/test_subgraph_op.py::test_subgraph_exe4[sym1-op_names1-default]'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw4] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw4' crashed while running 'tests/python/unittest/test_numpy_op.py::test_npx_batch_dot'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw13] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw13' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_linalg_svd[True-float64-shape10]'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw11] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw11' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_matmul[float64-True-add-add-shape_a6-shape_b6]'
[2020-11-02T17:51:15.713Z] _____________________ tests/python/unittest/test_metric.py _____________________
[2020-11-02T17:51:15.713Z] [gw15] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw15' crashed while running 'tests/python/unittest/test_metric.py::test_ce'
[2020-11-02T17:51:15.713Z] ___________________ tests/python/unittest/test_optimizer.py ____________________
[2020-11-02T17:51:15.713Z] [gw16] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw16' crashed while running 'tests/python/unittest/test_optimizer.py::test_lans'
[2020-11-02T17:51:15.713Z] ___________________ tests/python/unittest/test_extensions.py ___________________
[2020-11-02T17:51:15.713Z] [gw10] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw10' crashed while running 'tests/python/unittest/test_extensions.py::test_subgraph'
[2020-11-02T17:51:15.713Z] _________________ tests/python/unittest/test_contrib_krprod.py _________________
[2020-11-02T17:51:15.713Z] [gw14] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw14' crashed while running 'tests/python/unittest/test_contrib_krprod.py::test_krprod_one_input'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw19] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw19' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_linalg_slogdet[a_shape6-False-float64-add]'
[2020-11-02T17:51:15.713Z] _________________ tests/python/unittest/test_contrib_krprod.py _________________
[2020-11-02T17:51:15.713Z] [gw17] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw17' crashed while running 'tests/python/unittest/test_contrib_krprod.py::test_krprod_two_inputs'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw18] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw18' crashed while running 'tests/python/unittest/test_numpy_op.py::test_np_linalg_qr'
[2020-11-02T17:51:15.713Z] ____________________ tests/python/unittest/test_numpy_op.py ____________________
[2020-11-02T17:51:15.713Z] [gw12] linux -- Python 3.6.9 /usr/bin/python3
[2020-11-02T17:51:15.713Z] worker 'gw12' crashed while running 'tests/python/unittest/test_numpy_op.py::test_npx_special_unary_func'
[2020-11-02T17:51:15.713Z] =============================== warnings summary ===============================

I haven't looked into these failures in more detail.

@akarbown
Copy link
Contributor

akarbown commented Nov 9, 2020

I did a small research. Got the repro of the issue. Determined that MKL libraries static linking (-DMKL_USE_STATIC_LIBS=1) results in the following passrate: 1 failed, 10236 passed, 151 skipped, 1 xpassed, while turning off the MKL libraries static linking results in: 23 failed, 7849 passed, 33 skipped.
Then I back to the MKL shared libraries, did LD_PRELOAD of them before running the tests and it seemed to solve the issue with the following passrate: 1 failed, 10236 passed, 151 skipped, 1 xpassed (the same way as @TaoLv did).
I'm going to find out why. I'll keep you posted about the next findings.

@akarbown
Copy link
Contributor

When I run the tests with the following cmdline: python3 -m pytest -sv failing_test_case I've got the following output:

File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pINTEL MKL ERROR: /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_avx512.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so or libmkl_def.so.

I've checked that it's enough to LD_PRELOAD libmkl_rt.so to fix the issue. However, when I link libmkl_rt.so (or compile mxnet with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1) I get the same problem as it's described #17641. It's because of multiple openmp libraries linked into MXNET. It seems to be a catch-22 situation.

Assuming, that we compile MXNET with MKL_USE_SINGLE_DYNAMIC_LIBRARY=1 we will get the problem with linked multiple openmp which could probably be worked around with the following solutions:
1. Setting up the KMP_DUPLICATE_LIB_OK=TRUE helps while running each test separately but not while running all of the unit tests in a row (it results in a hang => see conclusions).
2. I tried to use MKL_THREADING_LAYER and the test passes for 'tbb' and 'sequential' but not for 'intel' or 'GNU' (because it couldn't find the libgomp.so and do the fallback to the libiopm5.so).
3. Compile MXNET with USE_OPENMP=0 causes that running unit tests would probably would last "forever" (I assume it's not under consideration).
4. Finally, I'm checking the following solution, but it requires adding the following line in the CMakeLists.txt (or compile -DUSE_BLAS=MKL because STREQUAL is case sensitive):

index 07075d752..1555f3f40 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -411,6 +411,7 @@ if(USE_OPENMP)
      AND SYSTEM_ARCHITECTURE STREQUAL "x86_64"
      AND NOT CMAKE_BUILD_TYPE STREQUAL "Distribution"
      AND NOT BLAS STREQUAL "MKL"
+     AND NOT BLAS STREQUAL "mkl"
      AND NOT MSVC
      AND NOT CMAKE_CROSSCOMPILING)
     load_omp()

All the ~20 tests that were failing passed without any issues (except for the one test case: test_optimizer.py::test_lamb). To be more precise, it's compiled with the following cmdline:
cmake -GNinja -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=1 -DUSE_MKLDNN=1 -DMKL_USE_STATIC_LIBS=0 -DUSE_CUDA=0 -DUSE_BLAS=mkl -DUSE_OPENMP=0 -DCMAKE_BUILD_TYPE=Debug ..; ninja

Conclusions:

  • While running all the unit tests in a row with the command line: ../runtime_functions.sh unittest_ubuntu_python3_cpu after 84% of the executed tests it hangs (looks like deadlock or race condition, but I need more time to investigate the issue).
  • The failing test cases (the ~19 tests that I found as the reproduction of the issue) ran one by one passes.
  • Moreover, I've also found some symbol issues while loading libiomp5.so:
      error: symbol lookup error: undefined symbol: ompt_start_tool
      error: symbol lookup error: undefined symbol: scalable_malloc (fatal)
    
  • I've also set up unite tests run times (just to compare):
Command line: ../runtime_funcions.sh unittest_ubuntu_python3_cpu with the following options: Time
LD_PRELOAD=<path_to_the_library>/libmkl_rt.so real    34m2.185s
MKL_THREADING_LAYER=sequential real    33m19.842s
MKL_THREADING_LAYER=tbb real    33m35.528s
MKL_THREADING_LAYER=sequential  KMP_HW_SUBSET=64c,1t real    25m26.099s
MKL_THREADING_LAYER=intel hangs
KMP_DUPLICATE_LIB_OK=TRUE hangs
Without any extraordinary options hangs

Now I want to concentrate on root causing the hang issue.

@akarbown
Copy link
Contributor

I have compiled MxNet library (libmxnet.so), that uses MKL via Single Dynamic Library (so libmxnet.so links with libmkl_rt.so) and uses OpenMP separately (so libmxnet.so also links with libgomp.so.1). Since, MxNet depends on GNU OpenMP, I'm forcing MKL to use GNU threading layer (MKL_THREADING_LAYER=GNU), so I can have a single (GNU) OpenMP runtime in the process.
However, libmkl_rt fails to find GNU OpenMP (strace revealed that it looks for libgomp.so which is not present in filesystem - instead of libgomp.so.1) and then fallbacks for libiomp5.so:

[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_core.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libgomp.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/rh/rh-python36/root/usr/bin/libgomp.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/../../../compiler/lib/intel64/libgomp.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/libgomp.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libiomp5.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/rh/rh-python36/root/usr/bin/libiomp5.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/../../../compiler/lib/intel64/libiomp5.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 29641] open("/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_avx2.so", O_RDONLY|O_CLOEXEC <unfinished ...>

So I end up having two different OpenMP implementations in a single process and that's causing runtime issues - I'm observing hangs during unit tests.When I create a symlink (libgomp.so -> libgomp.so.1) I can confirm that MKL opens GNU OpenMP and my tests pass.
I have reported the issue internally and I'm waiting for the solution of that issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants