Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

importing mxnet causing subprocess to crash #13875

Closed
dabraude opened this issue Jan 14, 2019 · 23 comments
Closed

importing mxnet causing subprocess to crash #13875

dabraude opened this issue Jan 14, 2019 · 23 comments

Comments

@dabraude
Copy link
Contributor

dabraude commented Jan 14, 2019

may be related to #13831

Description

importing mxnet causes OSErrors in subprocess

Environment info (Required)

Scientific Linux 7.5
Python 3.6.3
MXnet 1.5.0 (from packages)
(tried on multiple computers running different cuda builds)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

Using the following script (or just using the appropriate commands)

import mxnet
import subprocess
n = 0
while True:
    if not n%1000: print ("RUN", n)
    ret = subprocess.call(['ls', '/tmp'], stdout=subprocess.PIPE)
    n += 1

will eventually give this error message:

Traceback (most recent call last):
File "subcrash.py", line 13, in
ret = subprocess.call(['ls', '/'], stdout=subprocess.PIPE)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 709, in init
restore_signals, start_new_session)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'ls'

Doesn't seem to matter which executable.

What have you tried to solve it?

Don't even know where to start.
If you try putting in a stack tract or pdb it won't break.

@piyushghai
Copy link
Contributor

@dabraude Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it.
@mxnet-label-bot add [Build]

I tried running the script you provided locally on my Mac (So essentially non-CUDA build).
I did not face any crashes. My MXNet version is : 1.5.0b20190112.

Can you try this on your machine and then run the script ?
Run : pip install -U mxnet --pre

I'm trying to see if it's something to do with the CUDA builds of MXNet.
Also, what's the CUDA version that you tried it on ?

@dabraude
Copy link
Contributor Author

Ok we will try running that script and get back to you.

We are running CUDA 10

@dabraude
Copy link
Contributor Author

dabraude commented Jan 15, 2019

It still crashes with the --pre

We have found that it only happens with the MKL version:
mxnet-cu100mkl-1.5.0b20190115 - crashing (Intel or AMD CPU)
mxnet-cu100-1.5.0b20190115 - stable (up to ~4 million calls)

@piyushghai
Copy link
Contributor

@mxnet-label-bot update [Build, MKL]

@azai91 @mseth10 Can we have a look at this crash ?

@marcoabreu marcoabreu added the MKL label Jan 15, 2019
@aaronmarkham
Copy link
Contributor

I'd like to note that the website CI pipeline has been intermittently failing with subprocess errors ever since the MKLDNN merge. This is when it started:
http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/website/job/master/141/
It's really important that we have the website check in CI, but right now it is turned off because of the failures.

@pengzhao-intel
Copy link
Contributor

@dabraude @aaronmarkham thanks for reporting the issues.

We will take a look for the potential issue @ZhennanQin @TaoLv

@TaoLv
Copy link
Member

TaoLv commented Jan 17, 2019

@aaronmarkham I guess for "the MKLDNN merge" you mean #13681, right? But I tried the script @dabraude shared in this issue, it also crashes with mxnet-mkl==1.3.1.

@dabraude said this issue only happens with MKL build. But website build is not using MKL or MKL-DNN. So I'm afraid they are not the same issue.

BTW, @dabraude have you ever tried it with python2?

@TaoLv
Copy link
Member

TaoLv commented Jan 17, 2019

@dabraude Please try export KMP_INIT_AT_FORK=false before running your script. Let me know if it works for you. Thank you.

@dabraude
Copy link
Contributor Author

I can confirm it happens with python2 and that
export KMP_INIT_AT_FORK=false seems to stop it, but with intermittent errors I can't be 100% it did

@aaronmarkham
Copy link
Contributor

aaronmarkham commented Jan 17, 2019

Researching the build logs from the first crash... I see that mkldnn is set to 0 in some of the earlier build routines, but when making the docs, mkldnn files are being built by mshadow:

+ make docs SPHINXOPTS=-W
/work/mxnet /work/mxnet
make -C docs html
make[1]: Entering directory '/work/mxnet/docs'
export BUILD_VER=
Env var set for BUILD_VER: 
sphinx-build -b html -d _build/doctrees  -W . _build/html
Running Sphinx v1.5.6
making output directory...
Building version default
Document sets to generate:
scala_docs    : 1
java_docs     : 1
clojure_docs  : 1
doxygen_docs  : 1
r_docs        : 0
Building MXNet!
Building Doxygen!
Building Scala!
Building Scala Docs!
Building Java Docs!
Building Clojure Docs!
loading pickled environment... not yet created
make[2]: Entering directory '/work/mxnet'
g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -g -O0 -I/work/mxnet/3rdparty/mshadow/ -I/work/mxnet/3rdparty/dmlc-core/include -fPIC -I/work/mxnet/3rdparty/tvm/nnvm/include -I/work/mxnet/3rdparty/dlpack/include -I/work/mxnet/3rdparty/tvm/include -Iinclude -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -I/include -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv -fopenmp -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_LAPACK  -DMXNET_USE_NCCL=0 -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c \
src/operator/subgraph/mkldnn/mkldnn_conv_property.cc -o build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o

Then down further I see more...

ar crv lib/libmxnet.a build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o 
build/src/operator/subgraph/mkldnn/mkldnn_conv_post_quantize_property.o build/src/operator/subgraph/mkldnn/mkldnn_conv.o build/src/operator/nn/mkldnn/mkldnn_convolution.o build/src/operator/nn/mkldnn/mkldnn_concat.o build/src/operator/nn/mkldnn/mkldnn_base.o build/src/operator/nn/mkldnn/mkldnn_act.o build/src/operator/nn/mkldnn/mkldnn_softmax.o build/src/operator/nn/mkldnn/mkldnn_deconvolution.o build/src/operator/nn/mkldnn/mkldnn_copy.o 
...
a - build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o
a - build/src/operator/subgraph/mkldnn/mkldnn_conv_post_quantize_property.o
a - build/src/operator/subgraph/mkldnn/mkldnn_conv.o
a - build/src/operator/nn/mkldnn/mkldnn_convolution.o
a - build/src/operator/nn/mkldnn/mkldnn_concat.o
...

So would this help reveal why docs is experiencing the same kind of crashing?

@TaoLv
Copy link
Member

TaoLv commented Jan 23, 2019

@dabraude Can you confirm if the issue is still there with the environmental variable?

@ZhennanQin
Copy link
Contributor

@aaronmarkham The log of build mkldnn file is expected on USE_MKLDNN=0. Because USE_MKLDNN is only used as c macro in those files, instead of Makefile source control. In other words to say, USE_MKLDNN won't change the source files collected to build, but change the mkldnn file contains seen by compiler.

@dabraude
Copy link
Contributor Author

@TaoLv It didn't crash when running overnight so I assume it is working.

@leotac
Copy link
Contributor

leotac commented Sep 23, 2019

Hi all,
this bug keeps biting us. This is easily reproducible, meaning that it occurs randomly but with pretty high frequency, always within a few hundred attempts, but non deterministic.
The code I'm using (essentially the same as above):

import mxnet
import subprocess

for i in range(1000):
    if not i%100: print(i)
    try:
        ret = subprocess.call(["ls","/tmp"], stdout=subprocess.PIPE)
    except Exception as e:
        print(i, e)
        exit()

and you always get a nice
OSError: [Errno 14] Bad address: 'ls'

I managed to isolate some requirements to recreate a conda environment where this issue occurs.
This is obtained with conda + pip as follows (I have conda 4.7.12):

conda create -n mxnet-test --file env_conda.txt
conda activate mxnet-test
#make sure we're using the pip in the env
echo $(which pip)
pip install -r env_pip.txt

where the content of the files is:

env_conda.txt:
mkl=2019.0=118
numpy=1.16.4=py36h99e49ec_0
env_pip.txt:
mxnet-cu80mkl==1.5.0

and this is the result of conda list -e

# platform: linux-64
_libgcc_mutex=0.1=main
blas=1.0=openblas
ca-certificates=2019.5.15=1
certifi=2019.9.11=py36_0
chardet=3.0.4=pypi_0
idna=2.8=pypi_0
intel-openmp=2019.4=243
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libopenblas=0.3.6=h5a2b251_1
libstdcxx-ng=9.1.0=hdf63c60_0
mkl=2019.0=118
mxnet-cu80mkl=1.5.0=pypi_0
ncurses=6.1=he6710b0_1
numpy=1.16.4=py36h99e49ec_0
numpy-base=1.16.4=py36h2f8d375_0
openssl=1.1.1d=h7b6447c_1
pip=19.2.2=py36_0
python=3.6.9=h265db76_0
python-graphviz=0.8.4=pypi_0
readline=7.0=h7b6447c_5
requests=2.22.0=pypi_0
setuptools=41.0.1=py36_0
sqlite=3.29.0=h7b6447c_0
tk=8.6.8=hbc83047_0
urllib3=1.25.5=pypi_0
wheel=0.33.4=py36_0
xz=5.2.4=h14c3975_4
zlib=1.2.11=h7b6447c_3

I was able to reproduce this issue both with and without gpu (mxnet-mkl), on ubuntu 16.04, also inside docker containers.

Note : this is non-deterministic also at "build-time", in the sense that, creating environments with exactly the same requirements and exactly the same installed libraries, you can randomly end up with an environment where the issue does not occur.

@leotac
Copy link
Contributor

leotac commented Sep 23, 2019

This does seem related (or really, the same thing) as numpy/numpy#10060 and #12710
and setting KMP_INIT_AT_FORK=FALSE as suggested, seems to fix the issue.
Not sure if something could be done by the libraries that use MKL, such as mxnet, to warn about this behavior and how to prevent it.

@TaoLv
Copy link
Member

TaoLv commented Sep 24, 2019

@Sbebo I take this issue as a defect of the openmp library and the library will be excluded in the next minor release.

@leezu
Copy link
Contributor

leezu commented Sep 24, 2019

@szha @eric-haibin-lin this bugreports describes the cause of the OSErrors that time-to-time happen on ci.mxnet.io (CI used by gluon-nlp.mxnet.io , gluon-cv.mxnet.io , ...)

@leezu
Copy link
Contributor

leezu commented Sep 24, 2019

Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile )

@TaoLv
Copy link
Member

TaoLv commented Sep 24, 2019

Hi @leezu, is it possible for you to try export KMP_INIT_AT_FORK=false in your CI environment?

@leotac
Copy link
Contributor

leotac commented Sep 24, 2019

Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile )

Oh, will try to move to Ubuntu 18.04 if possible. Thanks!

@leezu
Copy link
Contributor

leezu commented Sep 24, 2019

@TaoLv the issue occurred only rarely (few times a month) for us and did not occur anymore during the recent months. What would be the expectation of setting export KMP_INIT_AT_FORK=false? Should it fix the issue or are you asking to confirm if setting the env variable reintroduces the problem?

@TaoLv
Copy link
Member

TaoLv commented Sep 26, 2019

The env variable fixed the problem reported in this issue. So if GluonNLP CI is facing the same issue, I think it can be fixed by this env variable too.

@TaoLv
Copy link
Member

TaoLv commented Dec 9, 2019

@leezu, the same issue and same fix as #14979. I'm closing this issue as:

  1. libiomp5.so has been removed from the pip releases of MXNet;
  2. The problem should have been fixed in a newer version of libiomp5.so.

Feel free to reopen if you have any question. Thanks!

@TaoLv TaoLv closed this as completed Dec 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants