-
Notifications
You must be signed in to change notification settings - Fork 6.8k
importing mxnet causing subprocess to crash #13875
Comments
@dabraude Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it. I tried running the script you provided locally on my Mac (So essentially non-CUDA build). Can you try this on your machine and then run the script ? I'm trying to see if it's something to do with the CUDA builds of MXNet. |
Ok we will try running that script and get back to you. We are running CUDA 10 |
It still crashes with the --pre We have found that it only happens with the MKL version: |
@mxnet-label-bot update [Build, MKL] |
I'd like to note that the website CI pipeline has been intermittently failing with subprocess errors ever since the MKLDNN merge. This is when it started: |
@dabraude @aaronmarkham thanks for reporting the issues. We will take a look for the potential issue @ZhennanQin @TaoLv |
@aaronmarkham I guess for "the MKLDNN merge" you mean #13681, right? But I tried the script @dabraude shared in this issue, it also crashes with mxnet-mkl==1.3.1. @dabraude said this issue only happens with MKL build. But website build is not using MKL or MKL-DNN. So I'm afraid they are not the same issue. BTW, @dabraude have you ever tried it with python2? |
@dabraude Please try |
I can confirm it happens with python2 and that |
Researching the build logs from the first crash... I see that mkldnn is set to 0 in some of the earlier build routines, but when making the docs, mkldnn files are being built by mshadow:
Then down further I see more...
So would this help reveal why docs is experiencing the same kind of crashing? |
@dabraude Can you confirm if the issue is still there with the environmental variable? |
@aaronmarkham The log of build mkldnn file is expected on USE_MKLDNN=0. Because |
@TaoLv It didn't crash when running overnight so I assume it is working. |
Hi all, import mxnet
import subprocess
for i in range(1000):
if not i%100: print(i)
try:
ret = subprocess.call(["ls","/tmp"], stdout=subprocess.PIPE)
except Exception as e:
print(i, e)
exit() and you always get a nice I managed to isolate some requirements to recreate a conda environment where this issue occurs. conda create -n mxnet-test --file env_conda.txt
conda activate mxnet-test
#make sure we're using the pip in the env
echo $(which pip)
pip install -r env_pip.txt where the content of the files is: env_conda.txt:
mkl=2019.0=118
numpy=1.16.4=py36h99e49ec_0
env_pip.txt:
mxnet-cu80mkl==1.5.0 and this is the result of
I was able to reproduce this issue both with and without gpu (mxnet-mkl), on ubuntu 16.04, also inside docker containers. Note : this is non-deterministic also at "build-time", in the sense that, creating environments with exactly the same requirements and exactly the same installed libraries, you can randomly end up with an environment where the issue does not occur. |
This does seem related (or really, the same thing) as numpy/numpy#10060 and #12710 |
@Sbebo I take this issue as a defect of the openmp library and the library will be excluded in the next minor release. |
@szha @eric-haibin-lin this bugreports describes the cause of the |
Note that we didn't face the OSError related crashes anymore after upgrading to Ubuntu 18.04 (more specifically, using the following Docker container https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/docker/Dockerfile ) |
Hi @leezu, is it possible for you to try |
Oh, will try to move to Ubuntu 18.04 if possible. Thanks! |
@TaoLv the issue occurred only rarely (few times a month) for us and did not occur anymore during the recent months. What would be the expectation of setting |
The env variable fixed the problem reported in this issue. So if GluonNLP CI is facing the same issue, I think it can be fixed by this env variable too. |
may be related to #13831
Description
importing mxnet causes OSErrors in subprocess
Environment info (Required)
Scientific Linux 7.5
Python 3.6.3
MXnet 1.5.0 (from packages)
(tried on multiple computers running different cuda builds)
Error Message:
(Paste the complete error message, including stack trace.)
Minimum reproducible example
Using the following script (or just using the appropriate commands)
will eventually give this error message:
Traceback (most recent call last):
File "subcrash.py", line 13, in
ret = subprocess.call(['ls', '/'], stdout=subprocess.PIPE)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 709, in init
restore_signals, start_new_session)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py",
line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'ls'
Doesn't seem to matter which executable.
What have you tried to solve it?
Don't even know where to start.
If you try putting in a stack tract or pdb it won't break.
The text was updated successfully, but these errors were encountered: