importing mxnet causing subprocess to crash #13875

dabraude · 2019-01-14T18:21:07Z

may be related to #13831

Description

importing mxnet causes OSErrors in subprocess

Environment info (Required)

Scientific Linux 7.5
Python 3.6.3
MXnet 1.5.0 (from packages)
(tried on multiple computers running different cuda builds)

Error Message:

(Paste the complete error message, including stack trace.)

Minimum reproducible example

Using the following script (or just using the appropriate commands)

import mxnet
import subprocess
n = 0
while True:
    if not n%1000: print ("RUN", n)
    ret = subprocess.call(['ls', '/tmp'], stdout=subprocess.PIPE)
    n += 1

will eventually give this error message:

Traceback (most recent call last):
File "subcrash.py", line 13, in
ret = subprocess.call(['ls', '/'], stdout=subprocess.PIPE)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 709, in init
restore_signals, start_new_session)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'ls'

Doesn't seem to matter which executable.

What have you tried to solve it?

Don't even know where to start.
If you try putting in a stack tract or pdb it won't break.

The text was updated successfully, but these errors were encountered:

piyushghai · 2019-01-14T19:53:43Z

@dabraude Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it.
@mxnet-label-bot add [Build]

I tried running the script you provided locally on my Mac (So essentially non-CUDA build).
I did not face any crashes. My MXNet version is : 1.5.0b20190112.

Can you try this on your machine and then run the script ?
Run : pip install -U mxnet --pre

I'm trying to see if it's something to do with the CUDA builds of MXNet.
Also, what's the CUDA version that you tried it on ?

dabraude · 2019-01-15T09:36:23Z

Ok we will try running that script and get back to you.

We are running CUDA 10

dabraude · 2019-01-15T15:09:13Z

It still crashes with the --pre

We have found that it only happens with the MKL version:
mxnet-cu100mkl-1.5.0b20190115 - crashing (Intel or AMD CPU)
mxnet-cu100-1.5.0b20190115 - stable (up to ~4 million calls)

piyushghai · 2019-01-15T18:40:24Z

@mxnet-label-bot update [Build, MKL]

@azai91 @mseth10 Can we have a look at this crash ?

aaronmarkham · 2019-01-17T02:12:14Z

I'd like to note that the website CI pipeline has been intermittently failing with subprocess errors ever since the MKLDNN merge. This is when it started:
http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/website/job/master/141/
It's really important that we have the website check in CI, but right now it is turned off because of the failures.

pengzhao-intel · 2019-01-17T03:33:21Z

@dabraude @aaronmarkham thanks for reporting the issues.

We will take a look for the potential issue @ZhennanQin @TaoLv

TaoLv · 2019-01-17T04:43:12Z

@aaronmarkham I guess for "the MKLDNN merge" you mean #13681, right? But I tried the script @dabraude shared in this issue, it also crashes with mxnet-mkl==1.3.1.

@dabraude said this issue only happens with MKL build. But website build is not using MKL or MKL-DNN. So I'm afraid they are not the same issue.

BTW, @dabraude have you ever tried it with python2?

TaoLv · 2019-01-17T06:15:05Z

@dabraude Please try export KMP_INIT_AT_FORK=false before running your script. Let me know if it works for you. Thank you.

dabraude · 2019-01-17T15:24:28Z

I can confirm it happens with python2 and that
export KMP_INIT_AT_FORK=false seems to stop it, but with intermittent errors I can't be 100% it did

aaronmarkham · 2019-01-17T17:27:43Z

Researching the build logs from the first crash... I see that mkldnn is set to 0 in some of the earlier build routines, but when making the docs, mkldnn files are being built by mshadow:

+ make docs SPHINXOPTS=-W
/work/mxnet /work/mxnet
make -C docs html
make[1]: Entering directory '/work/mxnet/docs'
export BUILD_VER=
Env var set for BUILD_VER: 
sphinx-build -b html -d _build/doctrees  -W . _build/html
Running Sphinx v1.5.6
making output directory...
Building version default
Document sets to generate:
scala_docs    : 1
java_docs     : 1
clojure_docs  : 1
doxygen_docs  : 1
r_docs        : 0
Building MXNet!
Building Doxygen!
Building Scala!
Building Scala Docs!
Building Java Docs!
Building Clojure Docs!
loading pickled environment... not yet created
make[2]: Entering directory '/work/mxnet'
g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -g -O0 -I/work/mxnet/3rdparty/mshadow/ -I/work/mxnet/3rdparty/dmlc-core/include -fPIC -I/work/mxnet/3rdparty/tvm/nnvm/include -I/work/mxnet/3rdparty/dlpack/include -I/work/mxnet/3rdparty/tvm/include -Iinclude -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -I/include -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv -fopenmp -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_LAPACK  -DMXNET_USE_NCCL=0 -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c \
src/operator/subgraph/mkldnn/mkldnn_conv_property.cc -o build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o

Then down further I see more...

ar crv lib/libmxnet.a build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o 
build/src/operator/subgraph/mkldnn/mkldnn_conv_post_quantize_property.o build/src/operator/subgraph/mkldnn/mkldnn_conv.o build/src/operator/nn/mkldnn/mkldnn_convolution.o build/src/operator/nn/mkldnn/mkldnn_concat.o build/src/operator/nn/mkldnn/mkldnn_base.o build/src/operator/nn/mkldnn/mkldnn_act.o build/src/operator/nn/mkldnn/mkldnn_softmax.o build/src/operator/nn/mkldnn/mkldnn_deconvolution.o build/src/operator/nn/mkldnn/mkldnn_copy.o 
...
a - build/src/operator/subgraph/mkldnn/mkldnn_conv_property.o
a - build/src/operator/subgraph/mkldnn/mkldnn_conv_post_quantize_property.o
a - build/src/operator/subgraph/mkldnn/mkldnn_conv.o
a - build/src/operator/nn/mkldnn/mkldnn_convolution.o
a - build/src/operator/nn/mkldnn/mkldnn_concat.o
...

So would this help reveal why docs is experiencing the same kind of crashing?

TaoLv · 2019-01-23T00:04:43Z

@dabraude Can you confirm if the issue is still there with the environmental variable?

ZhennanQin · 2019-01-23T04:10:18Z

@aaronmarkham The log of build mkldnn file is expected on USE_MKLDNN=0. Because USE_MKLDNN is only used as c macro in those files, instead of Makefile source control. In other words to say, USE_MKLDNN won't change the source files collected to build, but change the mkldnn file contains seen by compiler.

dabraude · 2019-01-23T09:20:14Z

@TaoLv It didn't crash when running overnight so I assume it is working.

leotac · 2019-09-23T17:02:26Z

Hi all,
this bug keeps biting us. This is easily reproducible, meaning that it occurs randomly but with pretty high frequency, always within a few hundred attempts, but non deterministic.
The code I'm using (essentially the same as above):

import mxnet
import subprocess

for i in range(1000):
    if not i%100: print(i)
    try:
        ret = subprocess.call(["ls","/tmp"], stdout=subprocess.PIPE)
    except Exception as e:
        print(i, e)
        exit()

and you always get a nice
OSError: [Errno 14] Bad address: 'ls'

I managed to isolate some requirements to recreate a conda environment where this issue occurs.
This is obtained with conda + pip as follows (I have conda 4.7.12):

conda create -n mxnet-test --file env_conda.txt
conda activate mxnet-test
#make sure we're using the pip in the env
echo $(which pip)
pip install -r env_pip.txt

where the content of the files is:

env_conda.txt:
mkl=2019.0=118
numpy=1.16.4=py36h99e49ec_0
env_pip.txt:
mxnet-cu80mkl==1.5.0

and this is the result of conda list -e

# platform: linux-64
_libgcc_mutex=0.1=main
blas=1.0=openblas
ca-certificates=2019.5.15=1
certifi=2019.9.11=py36_0
chardet=3.0.4=pypi_0
idna=2.8=pypi_0
intel-openmp=2019.4=243
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libopenblas=0.3.6=h5a2b251_1
libstdcxx-ng=9.1.0=hdf63c60_0
mkl=2019.0=118
mxnet-cu80mkl=1.5.0=pypi_0
ncurses=6.1=he6710b0_1
numpy=1.16.4=py36h99e49ec_0
numpy-base=1.16.4=py36h2f8d375_0
openssl=1.1.1d=h7b6447c_1
pip=19.2.2=py36_0
python=3.6.9=h265db76_0
python-graphviz=0.8.4=pypi_0
readline=7.0=h7b6447c_5
requests=2.22.0=pypi_0
setuptools=41.0.1=py36_0
sqlite=3.29.0=h7b6447c_0
tk=8.6.8=hbc83047_0
urllib3=1.25.5=pypi_0
wheel=0.33.4=py36_0
xz=5.2.4=h14c3975_4
zlib=1.2.11=h7b6447c_3

I was able to reproduce this issue both with and without gpu (mxnet-mkl), on ubuntu 16.04, also inside docker containers.

Note : this is non-deterministic also at "build-time", in the sense that, creating environments with exactly the same requirements and exactly the same installed libraries, you can randomly end up with an environment where the issue does not occur.

leotac · 2019-09-23T17:17:00Z

This does seem related (or really, the same thing) as numpy/numpy#10060 and #12710
and setting KMP_INIT_AT_FORK=FALSE as suggested, seems to fix the issue.
Not sure if something could be done by the libraries that use MKL, such as mxnet, to warn about this behavior and how to prevent it.

TaoLv · 2019-09-24T00:32:00Z

@Sbebo I take this issue as a defect of the openmp library and the library will be excluded in the next minor release.

leezu · 2019-09-24T06:47:16Z

@szha @eric-haibin-lin this bugreports describes the cause of the OSErrors that time-to-time happen on ci.mxnet.io (CI used by gluon-nlp.mxnet.io , gluon-cv.mxnet.io , ...)

leezu · 2019-09-24T06:51:29Z

Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile )

TaoLv · 2019-09-24T07:00:14Z

Hi @leezu, is it possible for you to try export KMP_INIT_AT_FORK=false in your CI environment?

leotac · 2019-09-24T07:20:56Z

Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile )

Oh, will try to move to Ubuntu 18.04 if possible. Thanks!

leezu · 2019-09-24T19:03:03Z

@TaoLv the issue occurred only rarely (few times a month) for us and did not occur anymore during the recent months. What would be the expectation of setting export KMP_INIT_AT_FORK=false? Should it fix the issue or are you asking to confirm if setting the env variable reintroduces the problem?

TaoLv · 2019-09-26T14:49:07Z

The env variable fixed the problem reported in this issue. So if GluonNLP CI is facing the same issue, I think it can be fixed by this env variable too.

TaoLv · 2019-12-09T09:34:48Z

@leezu, the same issue and same fix as #14979. I'm closing this issue as:

libiomp5.so has been removed from the pip releases of MXNet;
The problem should have been fixed in a newer version of libiomp5.so.

Feel free to reopen if you have any question. Thanks!

marcoabreu added the Build label Jan 14, 2019

marcoabreu added the MKL label Jan 15, 2019

aaronmarkham mentioned this issue Jan 25, 2019

Fix document build #13927

Merged

7 tasks

aaronmarkham mentioned this issue Mar 8, 2019

[Doc] Start the tutorials for MKL-DNN backend #14202

Merged

7 tasks

This was referenced Sep 24, 2019

pytest: bad address dmlc/gluon-nlp#460

Closed

OSError in examples/word_embedding_evaluation/word_embedding_evaluation.ipynb dmlc/gluon-nlp#122

Closed

TaoLv closed this as completed Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importing mxnet causing subprocess to crash #13875

importing mxnet causing subprocess to crash #13875

dabraude commented Jan 14, 2019 •

edited

Loading

piyushghai commented Jan 14, 2019

dabraude commented Jan 15, 2019

dabraude commented Jan 15, 2019 •

edited

Loading

piyushghai commented Jan 15, 2019

aaronmarkham commented Jan 17, 2019

pengzhao-intel commented Jan 17, 2019

TaoLv commented Jan 17, 2019

TaoLv commented Jan 17, 2019

dabraude commented Jan 17, 2019

aaronmarkham commented Jan 17, 2019 •

edited

Loading

TaoLv commented Jan 23, 2019

ZhennanQin commented Jan 23, 2019

dabraude commented Jan 23, 2019

leotac commented Sep 23, 2019 •

edited

Loading

leotac commented Sep 23, 2019 •

edited

Loading

TaoLv commented Sep 24, 2019

leezu commented Sep 24, 2019

leezu commented Sep 24, 2019

TaoLv commented Sep 24, 2019

leotac commented Sep 24, 2019 •

edited

Loading

leezu commented Sep 24, 2019

TaoLv commented Sep 26, 2019

TaoLv commented Dec 9, 2019

importing mxnet causing subprocess to crash #13875

importing mxnet causing subprocess to crash #13875

Comments

dabraude commented Jan 14, 2019 • edited Loading

Description

Environment info (Required)

Error Message:

Minimum reproducible example

What have you tried to solve it?

piyushghai commented Jan 14, 2019

dabraude commented Jan 15, 2019

dabraude commented Jan 15, 2019 • edited Loading

piyushghai commented Jan 15, 2019

aaronmarkham commented Jan 17, 2019

pengzhao-intel commented Jan 17, 2019

TaoLv commented Jan 17, 2019

TaoLv commented Jan 17, 2019

dabraude commented Jan 17, 2019

aaronmarkham commented Jan 17, 2019 • edited Loading

TaoLv commented Jan 23, 2019

ZhennanQin commented Jan 23, 2019

dabraude commented Jan 23, 2019

leotac commented Sep 23, 2019 • edited Loading

leotac commented Sep 23, 2019 • edited Loading

TaoLv commented Sep 24, 2019

leezu commented Sep 24, 2019

leezu commented Sep 24, 2019

TaoLv commented Sep 24, 2019

leotac commented Sep 24, 2019 • edited Loading

leezu commented Sep 24, 2019

TaoLv commented Sep 26, 2019

TaoLv commented Dec 9, 2019

dabraude commented Jan 14, 2019 •

edited

Loading

dabraude commented Jan 15, 2019 •

edited

Loading

aaronmarkham commented Jan 17, 2019 •

edited

Loading

leotac commented Sep 23, 2019 •

edited

Loading

leotac commented Sep 23, 2019 •

edited

Loading

leotac commented Sep 24, 2019 •

edited

Loading