-
Notifications
You must be signed in to change notification settings - Fork 6.8k
SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360
Comments
* Remove duplicate setup and teardown functions faccd91 introduced a automatic pytest hooks for handling MXNET_MODULE_SEED adapted from dmlc/gluon-nlp@66e926a but didn't remove the existing seed handling via explicit setup and teardown functions. This commit removes the explicit setup and teardown functions in favor of the automatic pytest version, and thereby also ensures that the seed handling code is not executed twice. As a side benefit, seed handling now works correctly even if contributors forget to add the magic setup_module and teardown_module imports in new test files. If pytest is run with --capture=no (or -s shorthand), output of the module level fixtures is shown to the user. * Fix locale setting
We recently saw this issue too and I am looking for a fix now. I do not believe it is CUDA 11 specific, rather code layout/timing/environment specific - e.g. in our setup we did not see this issue on Ubuntu 18.04 but encounter it on 20.04. The problem is that MXNet does not actually wait for the side thread to finish before the program teardown. During the main thread teardown CUDA deinitializes itself. If the side thread is still running at this point and tries to destroy its mshadow stream, this calls I started looking at this yesterday - brief look at the destructors seems to imply that |
Ok, so I think I understand this issue more - the problem is that The easiest workaround would be to just skip cleanup on a side thread - @szha @mseth10 @leezu do you think that would be acceptable? Any other ideas? |
As a short-term solution it's ok. At some point, we may benefit from being able to destruct engine properly at runtime other than at exit. For example, this could enable switching the engine at runtime. Thus, it would still be better if we have an actual solution for destruction order. |
Ok, I will open then a PR with the workaround and let's open an issue for the better handling of the destruction order of the engine. |
Description
Nightly CD pipeline fails for CUDA 11.0 during testing of MXNet binaries using
pytest
. All tests run successfully. The error is thrown during cleanup afterpytest
is done running a testing module. This error was first recorded when 480d027 commit was merged, which droppedpytest
'steardown
function. Before this commit, the CD pipeline was running successfully for all flavors.This error is specific to CUDA 11.0 and is not observed for CUDA 10.0 and 10.1 as can be seen here:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1848/pipeline/361/
Error Message
Steps to reproduce
I was able to reproduce the error by following these steps on an AWS Ubuntu18 Deep Learning Base AMI:
What have you tried to solve it?
I put a print statement before the
waitall
command to check whether it gets executed and observed that it gets executed after the module ends as expected.I tried replacing
mx.npx.waitall()
withmx.nd.waitall()
, but that doesn't solve this problem.Environment
We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3
Environment Information
The text was updated successfully, but these errors were encountered: