Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

cuda memcheck failures with different cuda versions #15273

Open
anirudh2290 opened this issue Jun 18, 2019 · 4 comments
Open

cuda memcheck failures with different cuda versions #15273

anirudh2290 opened this issue Jun 18, 2019 · 4 comments
Labels

Comments

@anirudh2290
Copy link
Member

anirudh2290 commented Jun 18, 2019

This was encountered during work on the PR: #15118. This is also related to #10988.

There are a lot of cuda-memcheck failures when MXNet is built with CUDA-10.0 which I don't see happening on CUDA-9.2.

On CUDA-9.2:

cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_embedding
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=809559325 to reproduce.
test_operator_gpu.test_embedding ... ok

----------------------------------------------------------------------
Ran 1 test in 26.204s

OK
========= ERROR SUMMARY: 0 errors
 ubuntu@ip-172-31-71-199  ~/experimentals/1.4_release   fp16_convert_model ●  cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_broadcast
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1395797003 to reproduce.
test_operator_gpu.test_broadcast ... ok

----------------------------------------------------------------------
Ran 1 test in 75.909s

OK
========= ERROR SUMMARY: 0 errors
  cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_countsketch
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=837487385 to reproduce.
test_operator_gpu.test_countsketch ... ok

----------------------------------------------------------------------
Ran 1 test in 44.046s

OK
========= ERROR SUMMARY: 0 errors

On CUDA-10.0

/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=739806880 to reproduce.
test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2146529091 to reproduce.
ERROR

======================================================================
ERROR: test_operator_gpu.test_countsketch
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/test_operator_gpu.py", line 95, in test_countsketch
    check_countsketch(in_dim, out_dim, n)
  File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/test_operator_gpu.py", line 82, in check_countsketch
    check_symbolic_backward(sym, locations, [out_grad], [a], rtol=1e-3, atol=1e-5, ctx=mx.gpu(0))
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/test_utils.py", line 1191, in check_symbolic_backward
    grads = {k: v.asnumpy() for k, v in args_grad_data.items()}
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/test_utils.py", line 1191, in <dictcomp>
    grads = {k: v.asnumpy() for k, v in args_grad_data.items()}
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [23:04:54] ../include/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure
Stack trace:
  [bt] (0) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7f633d1d1642]
  [bt] (1) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0x168) [0x7f633d3489d0]
  [bt] (2) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(+0x2a763f5) [0x7f633d3993f5]
  [bt] (3) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(+0x2a7be78) [0x7f633d39ee78]
  [bt] (4) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x56) [0x7f633d33f322]
  [bt] (5) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x3b1) [0x7f633d355df3]
  [bt] (6) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x231) [0x7f633d35bcef]
  [bt] (7) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}::operator()(dmlc::ManualEvent) const+0x50) [0x7f633d357990]
  [bt] (8) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x5c) [0x7f633d35ec56]


-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=739806880 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2146529091 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 49.733s

FAILED (errors=1)

cuda memcheck output: more than 1000 errors

Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute.

When I change to cuda 10.1 these errors go away. Note that I have only observed them with DEV=1 with make (especially the --werror cross-space-execution) nvcc flag.
I think we should also update centos7 docker image to run on cuda 10.1

EDIT: I still see issues for countsketch on 10.1 when run with memcheck but these seem to be still addressable issues with operator but this is different from 10.0 where multiple operators are impacted and seem to be difficult to address.

@marcoabreu @stu1130

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Cuda, Installation, Build

@ptrendx
Copy link
Member

ptrendx commented Jun 19, 2019

Do you also see those errors when testing operator that does not have issues?

@anirudh2290
Copy link
Member Author

i tested with broadcast, countsketch, embedding. all ops failed with this error:

Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute.

In 10.1 broadcast and embedding has no issue, countsketch had a read out of bounds issue which is specific to operator.

But the cuda memcheck issue invalid device function happened for the three ops i tested for cuda 10.

@leleamol
Copy link
Contributor

@mxnet-label-bot add [Cuda]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants