Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395

Open
marcoabreu opened this issue Jun 25, 2018 · 21 comments

Comments

@marcoabreu
Copy link
Contributor

Sometimes, our slaves get corrupted and suddenly all test start to fail. This is unrelated to the tests directly.

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11377/5/pipeline/

======================================================================

ERROR: test_operator_gpu.test_op_roi_align

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6269, in test_op_roi_align

    test_roi_align_value()

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6230, in test_roi_align_value

    data = mx.nd.array(np.arange(N*C*W*H).reshape((N,C,H,W)), ctx=ctx, dtype = dtype)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\utils.py", line 146, in array

    return _array(source_array, ctx=ctx, dtype=dtype)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 2357, in array

    arr[:] = source_array

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 444, in __setitem__

    self._set_nd_basic_indexing(key, value)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 710, in _set_nd_basic_indexing

    self._sync_copyfrom(value)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 876, in _sync_copyfrom

    ctypes.c_size_t(source_array.size)))

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\base.py", line 210, in check_call

    raise MXNetError(py_str(_LIB.MXGetLastError()))

MXNetError: [06:35:08] c:\jenkins_slave\workspace\build-gpu\3rdparty\mshadow\mshadow\./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1046236735 to reproduce.

--------------------- >> end captured logging << ---------------------
@marcoabreu
Copy link
Contributor Author

@DickJC123

@DickJC123
Copy link
Contributor

In a private communication, you indicated this was seen on all platforms. Here you tag it as 'Windows'. Please clarify.

@marcoabreu
Copy link
Contributor Author

Sorry Dick, I just double checked my database and it seems to only happen on Windows. It seems like I mixed something up, please excuse me for that.

Config: Windows Server 2016, G3.8xlarge, CUDA8, unknown driver version

@anirudhacharya
Copy link
Member

@marcoabreu
Copy link
Contributor Author

This seems to be caused by test_conv:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1050/nodes/752/steps/1370/log/?start=0

test_operator_gpu.test_ndarray_crop ... ok
test_operator_gpu.test_cell_fill_shape ... ok
test_operator_gpu.test_conv ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\nn\cudnn\./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=448545343 to reproduce.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1330733257 to reproduce.
ERROR
test_operator_gpu.test_layer_fill_shape ... ERROR
test_operator_gpu.test_normal_generator ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1592512793 to reproduce.
ERROR
test_operator_gpu.test_ndarray_concatenate ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=466881030 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_transpose ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=221598447 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_storage_fallback ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\../common/utils.h:417: 
Storage type fallback detected:
operator = broadcast_add
input storage types = [default, default, ]
output storage types = [csr, ]
params = {}
context.dev_mask = gpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1026960206 to reproduce.
ERROR
test_operator_gpu.test_clip ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1647052128 to reproduce.
ERROR
test_operator_gpu.test_convolution_with_type ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
ERROR

@marcoabreu marcoabreu changed the title Check failed: e == cudaSuccess CUDA: unspecified launch failure Flaky test: test_operator_gpu.test_conv causes 'CUDA: unspecified launch failure' Jun 27, 2018
@DickJC123
Copy link
Contributor

DickJC123 commented Jun 28, 2018

I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.

I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last | c++filt
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run deterministically ***
test_operator_gpu.test_sequence_last ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000390 in void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f13f24003f8 is out of bounds
=========     Device Frame:void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) (void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) : 0x390)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24cc4d]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 [0x15680]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaLaunch + 0x14e) [0x33c9e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::SequenceLastOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0xc3a) [0x53384da]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0x363) [0x3214a53]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 0x59) [0x3808f09]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5870]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaStreamSynchronize + 0x176) [0x47336]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mshadow::Stream<mshadow::gpu>::Wait() + 0x26) [0x32635a6]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5945]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
ERROR

======================================================================
ERROR: test_operator_gpu.test_sequence_last
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2998, in test_sequence_last
    check_sequence_func("last", axis=0)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2989, in check_sequence_func
    numeric_eps=1e-2, rtol=1e-2)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 906, in check_numeric_gradient
    eps=numeric_eps, use_forward_train=use_forward_train, dtype=dtype)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 781, in numeric_grad
    f_neps = executor.outputs[0].asnumpy()
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:31:45] /home/dcarter/mxnet_dev/dgx/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f14f31bc658]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(+0x37d5945) [0x7f14f372e945]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f14f3687a15]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f14f369eb1b]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f14f369ed8e]
[bt] (7) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f14f368701a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f151efb1c80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f15269af6ba]


-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
common: WARNING: *** test-level seed set: all "@with_seed()" tests run deterministically ***
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 22.915s

FAILED (errors=1)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:31:45] src/storage/./pooled_storage_manager.h:77: CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x8f) [0x7f14f36aa8cf]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::ReleaseAll()+0x95) [0x7f14f36a2ef5]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager()+0x1a) [0x7f14f36aa9ca]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0xa23) [0x7f14f36a9bd3]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::shared_ptr<mxnet::Storage>::~shared_ptr()+0x52) [0x7f14f36aa822]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8) [0x7f1526617ff8]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(+0x3a045) [0x7f1526618045]
[bt] (9) /usr/bin/python() [0x51dc1f]


=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaFree + 0x1a0) [0x419b0]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle) + 0x32) [0x3751872]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::ReleaseAll() + 0x95) [0x3749ef5]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager() + 0x1a) [0x37519ca]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xa23) [0x3750bd3]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::shared_ptr<mxnet::Storage>::~shared_ptr() + 0x52) [0x3751822]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x39ff8]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x3a045]
=========     Host Frame:/usr/bin/python [0x11dc1f]
=========     Host Frame:/usr/bin/python [0x11b1b7]
=========     Host Frame:/usr/bin/python (PyErr_PrintEx + 0x2d) [0x11aadd]
=========     Host Frame:/usr/bin/python [0x309d5]
=========     Host Frame:/usr/bin/python (Py_Main + 0x612) [0x93ae2]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:/usr/bin/python (_start + 0x29) [0x933e9]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

@aaronmarkham
Copy link
Contributor

I'm still blocked by this. Was it really disabled?

@KellenSunderland
Copy link
Contributor

I don't think anyone disabled the test. Opening a PR to do so here: #11485

@marcoabreu marcoabreu changed the title Flaky test: test_operator_gpu.test_conv causes 'CUDA: unspecified launch failure' Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' Jun 29, 2018
@anirudh2290
Copy link
Member

@DickJC123 Thank you for investigating this issue! I am not able to reproduce the test failure pointed by you with the commit : b786ead with the same test seed: 731510245 . Are you able to reproduce the issue with the latest master ?

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jul 12, 2018

(I'll answer on behalf of dick because he told me that he will be quite busy around this time)
Hey anirudh, could you elaborate how you tried to reproduce this issue? Please be aware that this is about memory corruption because of out of bounds writes. This behaviour might now always be reproducible locally. Did you use cuda memcheck to investigate?

@anirudh2290
Copy link
Member

yes i am not able to similar errors with cuda memcheck either.

@anirudh2290
Copy link
Member

anirudh2290 commented Jul 12, 2018

Tried the following:

MXNET_TEST_SEED=731510245 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last

Also did:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last

couldn't reproduce Invalid __global__ read of size 4 with cudamemcheck

@marcoabreu
Copy link
Contributor Author

How often did you run?

@anirudh2290
Copy link
Member

ran repeatedly around 2k times.

@ChaiBapchya
Copy link
Contributor

@ChaiBapchya
Copy link
Contributor

@aaronmarkham
Copy link
Contributor

Unspecified launch failure here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17435/7/pipeline

What's the workaround? Need to a get PR through that has nothing to do with CUDA...

@ChaiBapchya
Copy link
Contributor

I haven't found one. I kept retriggering and finally gave up.

@aaronmarkham
Copy link
Contributor

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants