Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

GPU tests are unstable #12453

Open
lebeg opened this issue Sep 4, 2018 · 9 comments
Open

GPU tests are unstable #12453

lebeg opened this issue Sep 4, 2018 · 9 comments

Comments

@lebeg
Copy link
Contributor

lebeg commented Sep 4, 2018

Description

Multiple CI jobs were failing with CUDA memory problems:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline/

Message

Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

Log with context

test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=104987558 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_basic ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2134146737 to reproduce.
ERROR
test_operator_gpu.test_exc_multiple_waits ... ok
test_operator_gpu.test_lstm_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=200476953 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_setitem ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2082345391 to reproduce.
ERROR
test_operator_gpu.test_exc_post_fail ... ok
test_operator_gpu.test_gru_sym ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1532640391 to reproduce.
ERROR
test_operator_gpu.test_exc_mutable_var_fail ... ok
test_operator_gpu.test_sparse_nd_slice ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1828661033 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwise ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1460065938 to reproduce.
ERROR
test_operator_gpu.test_gru_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=16762643 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwisesum ... [06:59:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:622: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 639:     8 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu

@vrakesh
Copy link
Contributor

vrakesh commented Sep 4, 2018

@lebeg Thanks for reporting this
@mxnet-label-bot [Build, Breaking, Test]

@aaronmarkham
Copy link
Contributor

@larroy
Copy link
Contributor

larroy commented Oct 16, 2018

This is failing again on a GPU instance p3.2xlarge.

time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn && time ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_python3_gpu

ERROR
test_operator_gpu.test_ndarray_equal ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1470664036 to reproduce.
ERROR
test_operator_gpu.test_size_array ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1016858059 to reproduce.
ERROR
test invalid sparse operator will throw a exception ... ok
test_operator_gpu.test_ndarray_not_equal ... ok
test_operator_gpu.test_nadam ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=440311246 to reproduce.
ERROR
test check_format for sparse ndarray ... [13:03:09] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:too many resources requested for launch
/work/runtime_functions.sh: line 722: 8 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
build.py: 2018-10-16 13:03:10,500 Waiting for status of container dd18847ed3fd for 600 s.
build.py: 2018-10-16 13:03:10,644 Container exit status: {'StatusCode': 134, 'Error': None}
build.py: 2018-10-16 13:03:10,644 Stopping container: dd18847ed3fd
build.py: 2018-10-16 13:03:10,646 Removing container: dd18847ed3fd
build.py: 2018-10-16 13:03:10,716 Execution of ['/work/runtime_functions.sh', 'unittest_ubuntu_python3_gpu'] failed with status: 134

@ChaiBapchya
Copy link
Contributor

CI failed with similar error -
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/67/pipeline/1111

FAIL
test_operator_gpu.test_ndarray_lesser ... [08:27:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 718:     8 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu

@larroy
Copy link
Contributor

larroy commented Dec 7, 2018

Can we close this for now?

@lebeg
Copy link
Contributor Author

lebeg commented Dec 10, 2018

Seems it failing still from time to time, right?

@larroy
Copy link
Contributor

larroy commented Aug 8, 2019

Can we close this? @szha

@jzhou316
Copy link

jzhou316 commented Aug 9, 2019

I had the same problem in some of my NMT experiments running of multi-GPUs on p3.2xlarge. It ran some times but failed other times, and the error was not consistent at where it occurred and what messages it displayed. I tested every part of my code without finding any problems. Though it could be my fault, but is it possible that the issue is with MXNet?

some of the error messages

[18:03:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:680: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
[18:15:03] src/resource.cc:313: Ignore CUDA Error [18:15:03] src/common/random_generator.cu:70: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered

@larroy
Copy link
Contributor

larroy commented Aug 9, 2019

@jzhou316 thanks for pointing this out. Could you give more info about the environment in which this happened? is it running in EC2? How difficult you think is to reproduce? Is there a way to reproduce it every time?

Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants