GPU tests are unstable #12453

lebeg · 2018-09-04T08:35:50Z

Description

Multiple CI jobs were failing with CUDA memory problems:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline/

Message

Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

Log with context

test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=104987558 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_basic ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2134146737 to reproduce.
ERROR
test_operator_gpu.test_exc_multiple_waits ... ok
test_operator_gpu.test_lstm_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=200476953 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_setitem ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2082345391 to reproduce.
ERROR
test_operator_gpu.test_exc_post_fail ... ok
test_operator_gpu.test_gru_sym ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1532640391 to reproduce.
ERROR
test_operator_gpu.test_exc_mutable_var_fail ... ok
test_operator_gpu.test_sparse_nd_slice ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1828661033 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwise ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1460065938 to reproduce.
ERROR
test_operator_gpu.test_gru_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=16762643 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwisesum ... [06:59:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:622: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 639:     8 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu

The text was updated successfully, but these errors were encountered:

vrakesh · 2018-09-04T15:29:23Z

@lebeg Thanks for reporting this
@mxnet-label-bot [Build, Breaking, Test]

aaronmarkham · 2018-09-13T15:25:24Z

This failed on my PR: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12540/1/pipeline

larroy · 2018-10-16T14:31:27Z

This is failing again on a GPU instance p3.2xlarge.

time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn && time ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_python3_gpu

ERROR
test_operator_gpu.test_ndarray_equal ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1470664036 to reproduce.
ERROR
test_operator_gpu.test_size_array ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1016858059 to reproduce.
ERROR
test invalid sparse operator will throw a exception ... ok
test_operator_gpu.test_ndarray_not_equal ... ok
test_operator_gpu.test_nadam ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=440311246 to reproduce.
ERROR
test check_format for sparse ndarray ... [13:03:09] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:too many resources requested for launch
/work/runtime_functions.sh: line 722: 8 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
build.py: 2018-10-16 13:03:10,500 Waiting for status of container dd18847ed3fd for 600 s.
build.py: 2018-10-16 13:03:10,644 Container exit status: {'StatusCode': 134, 'Error': None}
build.py: 2018-10-16 13:03:10,644 Stopping container: dd18847ed3fd
build.py: 2018-10-16 13:03:10,646 Removing container: dd18847ed3fd
build.py: 2018-10-16 13:03:10,716 Execution of ['/work/runtime_functions.sh', 'unittest_ubuntu_python3_gpu'] failed with status: 134

ChaiBapchya · 2018-11-22T10:44:48Z

CI failed with similar error -
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/67/pipeline/1111

FAIL
test_operator_gpu.test_ndarray_lesser ... [08:27:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 718:     8 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu

larroy · 2018-12-07T11:53:21Z

Can we close this for now?

lebeg · 2018-12-10T10:39:19Z

Seems it failing still from time to time, right?

larroy · 2019-08-08T20:20:03Z

Can we close this? @szha

jzhou316 · 2019-08-09T18:20:50Z

I had the same problem in some of my NMT experiments running of multi-GPUs on p3.2xlarge. It ran some times but failed other times, and the error was not consistent at where it occurred and what messages it displayed. I tested every part of my code without finding any problems. Though it could be my fault, but is it possible that the issue is with MXNet?

some of the error messages

[18:03:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:680: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

[18:15:03] src/resource.cc:313: Ignore CUDA Error [18:15:03] src/common/random_generator.cu:70: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered

larroy · 2019-08-09T21:01:56Z

@jzhou316 thanks for pointing this out. Could you give more info about the environment in which this happened? is it running in EC2? How difficult you think is to reproduce? Is there a way to reproduce it every time?

Thanks.

This was referenced Sep 4, 2018

Revert "Subgraph API for integrating accelerators with MXNet (#12157)" #12443

Closed

Master build fails due to Subgraph API commit #12442

Closed

marcoabreu added Breaking Build Test labels Sep 4, 2018

pengzhao-intel mentioned this issue Sep 5, 2018

[MXNET-500]Test cases improvement for MKLDNN on Gluon #10921

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU tests are unstable #12453

GPU tests are unstable #12453

lebeg commented Sep 4, 2018

vrakesh commented Sep 4, 2018

aaronmarkham commented Sep 13, 2018

larroy commented Oct 16, 2018

ChaiBapchya commented Nov 22, 2018

larroy commented Dec 7, 2018

lebeg commented Dec 10, 2018

larroy commented Aug 8, 2019

jzhou316 commented Aug 9, 2019

larroy commented Aug 9, 2019

GPU tests are unstable #12453

GPU tests are unstable #12453

Comments

lebeg commented Sep 4, 2018

Description

vrakesh commented Sep 4, 2018

aaronmarkham commented Sep 13, 2018

larroy commented Oct 16, 2018

ChaiBapchya commented Nov 22, 2018

larroy commented Dec 7, 2018

lebeg commented Dec 10, 2018

larroy commented Aug 8, 2019

jzhou316 commented Aug 9, 2019

larroy commented Aug 9, 2019