SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360

mseth10 · 2020-10-16T08:27:11Z

Description

Nightly CD pipeline fails for CUDA 11.0 during testing of MXNet binaries using pytest. All tests run successfully. The error is thrown during cleanup after pytest is done running a testing module. This error was first recorded when 480d027 commit was merged, which dropped pytest's teardown function. Before this commit, the CD pipeline was running successfully for all flavors.

This error is specific to CUDA 11.0 and is not observed for CUDA 10.0 and 10.1 as can be seen here:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1848/pipeline/361/

Error Message

Stack trace:
Stack trace:
  /usr/lib64/libcudnn_ops_infer.so.8 (                                           + 0x15cb68f)  [0x7f7f4ce3e68f]
  /usr/lib64/libcudnn_ops_infer.so.8 ( cudnnDestroy                              + 0x6f  )  [0x7f7f4ba78ddf]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( mshadow::Stream<mshadow::gpu>::DestroyDnnHandle()  + 0x2c  )  [0x7f81869a29ec]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)  + 0x13b )  [0x7f81869a2c3b]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)  + 0x1bb )  [0x7f81869b83ab]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)  + 0x36  )  [0x7f81869b86f6]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()  + 0x32  )  [0x7f81869b3db2]
/work/runtime_functions.sh: line 747:     6 Segmentation fault      (core dumped) pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py
2020-10-16 07:44:31,682 - root - INFO - Waiting for status of container a8b282e29adf for 600 s.
2020-10-16 07:44:31,853 - root - INFO - Container exit status: {'Error': None, 'StatusCode': 139}
2020-10-16 07:44:31,854 - root - ERROR - Container exited with an error 😞
2020-10-16 07:44:31,854 - root - INFO - Executed command for reproduction:

ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110

Steps to reproduce

I was able to reproduce the error by following these steps on an AWS Ubuntu18 Deep Learning Base AMI:

alias python=python3

git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
pip3 install -r ci/requirements.txt --user

sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

python ci/build.py -e BRANCH=null --docker-registry mxnetci --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_static_libmxnet cu110

python ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110

What have you tried to solve it?

The above script takes a long time to run as it runs a lot of tests. I reduced the reproduction time by reducing the number of tests. Here's a code diff:

diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
index 40405b961..6992caa36 100755
--- a/ci/docker/runtime_functions.sh
+++ b/ci/docker/runtime_functions.sh
@@ -756,7 +756,9 @@ cd_unittest_ubuntu() {
     export DMLC_LOG_STACK_TRACE_DEPTH=10
 
     local mxnet_variant=${1:?"This function requires a mxnet variant as the first argument"}
+    pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py
 
+    : '
     OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -m 'not serial' -n 4 --durations=50 --verbose tests/python/unittest
     pytest -m 'serial' --durations=50 --verbose tests/python/unittest
 
@@ -782,6 +784,7 @@ cd_unittest_ubuntu() {
     if [[ ${mxnet_variant} = *mkl ]]; then
         OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -n 4 --durations=50 --verbose tests/python/mkl
     fi
+    '
 }

I put a print statement before the waitall command to check whether it gets executed and observed that it gets executed after the module ends as expected.
I tried replacing mx.npx.waitall() with mx.nd.waitall(), but that doesn't solve this problem.

Environment

We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information

----------Python Info----------
Version      : 3.6.9
Compiler     : GCC 8.4.0
Build        : ('default', 'Oct  8 2020 12:12:24')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.2.3
Directory    : /usr/local/lib/python3.6/dist-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-5.4.0-1028-aws-x86_64-with-Ubuntu-18.04-bionic
system       : Linux
node         : ip-172-31-5-167
release      : 5.4.0-1028-aws
version      : #29~18.04.1-Ubuntu SMP Tue Oct 6 17:14:23 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3109.947
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0088 sec, LOAD: 0.6286 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1193 sec, LOAD: 0.1101 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.055264949798583984 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0012 sec, LOAD: 0.1100 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0014 sec, LOAD: 0.3008 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0010542869567871094 sec.
----------Environment----------

@leezu @TristonC

The text was updated successfully, but these errors were encountered:

* Remove duplicate setup and teardown functions faccd91 introduced a automatic pytest hooks for handling MXNET_MODULE_SEED adapted from dmlc/gluon-nlp@66e926a but didn't remove the existing seed handling via explicit setup and teardown functions. This commit removes the explicit setup and teardown functions in favor of the automatic pytest version, and thereby also ensures that the seed handling code is not executed twice. As a side benefit, seed handling now works correctly even if contributors forget to add the magic setup_module and teardown_module imports in new test files. If pytest is run with --capture=no (or -s shorthand), output of the module level fixtures is shown to the user. * Fix locale setting

ptrendx · 2020-10-16T15:52:31Z

We recently saw this issue too and I am looking for a fix now. I do not believe it is CUDA 11 specific, rather code layout/timing/environment specific - e.g. in our setup we did not see this issue on Ubuntu 18.04 but encounter it on 20.04. The problem is that MXNet does not actually wait for the side thread to finish before the program teardown. During the main thread teardown CUDA deinitializes itself. If the side thread is still running at this point and tries to destroy its mshadow stream, this calls cudnnDestroy on the cuDNN handle, which internally calls cudaStreamDestroy on cuDNN internal CUDA streams (CUDA is statically linked in cuDNN, which is why you see your segfault coming from libcudnn_ops_infer.so.8). When this call is done after the CUDA deinitialization, crash happens.

I started looking at this yesterday - brief look at the destructors seems to imply that join should actually be called on the side threads, so not yet sure why this does not actually do the right thing. If anyone has more experience with the internals of the ThreadedEnginePerDevice I would be happy to leave that issue to them, but poking in the meantime.

ptrendx · 2020-10-16T19:27:19Z

Ok, so I think I understand this issue more - the problem is that shared_ptr to the engine is a static variable here: https://github.com/apache/incubator-mxnet/blob/master/src/engine/engine.cc#L62 and so the destruction timing of the engine itself is not specified (depends on the order of binaries in the linked executable). This makes it possible for CUDA deinitialization to happen before or after the destruction of the engine. If it happens after then everything is OK, because as part of its destruction engine actually joins on the side threads. However, if the CUDA deinit happens before, then side thread doing the cleanup actually triggers the segfault.

The easiest workaround would be to just skip cleanup on a side thread - @szha @mseth10 @leezu do you think that would be acceptable? Any other ideas?

szha · 2020-10-19T18:28:47Z

As a short-term solution it's ok. At some point, we may benefit from being able to destruct engine properly at runtime other than at exit. For example, this could enable switching the engine at runtime. Thus, it would still be better if we have an actual solution for destruction order.

ptrendx · 2020-10-19T19:32:39Z

Ok, I will open then a PR with the workaround and let's open an issue for the better handling of the destruction order of the engine.

mseth10 added Bug needs triage CUDA labels Oct 16, 2020

leezu removed the needs triage label Oct 17, 2020

This was referenced Oct 19, 2020

Remove cleanup on side threads #19378

Merged

Better handling of the engine destruction #19379

Open

szha closed this as completed in #19378 Oct 21, 2020

ptrendx mentioned this issue Oct 27, 2020

[WIP] Revert "Remove cleanup on side threads (#19378)" #19432

Closed

6 tasks

kpuatamazon mentioned this issue Dec 7, 2020

Fix CI README unittest_ubuntu_python3_armv7 -> unittest_ubuntu_python3_arm #19637

Merged

waytrue17 mentioned this issue Mar 26, 2022

GPU memory leak when using gluon.data.DataLoader with num_workers>0 #20959

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360

SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360

mseth10 commented Oct 16, 2020 •

edited

Loading

ptrendx commented Oct 16, 2020

ptrendx commented Oct 16, 2020

szha commented Oct 19, 2020

ptrendx commented Oct 19, 2020

SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360

SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360

Comments

mseth10 commented Oct 16, 2020 • edited Loading

Description

Error Message

Steps to reproduce

What have you tried to solve it?

Environment

ptrendx commented Oct 16, 2020

ptrendx commented Oct 16, 2020

szha commented Oct 19, 2020

ptrendx commented Oct 19, 2020

mseth10 commented Oct 16, 2020 •

edited

Loading