You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
This was encountered during work on the PR: #15118. This is also related to #10988.
There are a lot of cuda-memcheck failures when MXNet is built with CUDA-10.0 which I don't see happening on CUDA-9.2.
On CUDA-9.2:
cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_embedding
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=809559325 to reproduce.
test_operator_gpu.test_embedding ... ok
----------------------------------------------------------------------
Ran 1 test in 26.204s
OK
========= ERROR SUMMARY: 0 errors
ubuntu@ip-172-31-71-199 ~/experimentals/1.4_release fp16_convert_model ● cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_broadcast
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1395797003 to reproduce.
test_operator_gpu.test_broadcast ... ok
----------------------------------------------------------------------
Ran 1 test in 75.909s
OK
========= ERROR SUMMARY: 0 errors
cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_countsketch
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=837487385 to reproduce.
test_operator_gpu.test_countsketch ... ok
----------------------------------------------------------------------
Ran 1 test in 44.046s
OK
========= ERROR SUMMARY: 0 errors
On CUDA-10.0
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=739806880 to reproduce.
test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2146529091 to reproduce.
ERROR
======================================================================
ERROR: test_operator_gpu.test_countsketch
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/../unittest/common.py", line 177, in test_new
orig_test(*args, **kwargs)
File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/test_operator_gpu.py", line 95, in test_countsketch
check_countsketch(in_dim, out_dim, n)
File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/test_operator_gpu.py", line 82, in check_countsketch
check_symbolic_backward(sym, locations, [out_grad], [a], rtol=1e-3, atol=1e-5, ctx=mx.gpu(0))
File "/home/ubuntu/experimentals/1.4_release/python/mxnet/test_utils.py", line 1191, in check_symbolic_backward
grads = {k: v.asnumpy() for k, v in args_grad_data.items()}
File "/home/ubuntu/experimentals/1.4_release/python/mxnet/test_utils.py", line 1191, in <dictcomp>
grads = {k: v.asnumpy() for k, v in args_grad_data.items()}
File "/home/ubuntu/experimentals/1.4_release/python/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/ubuntu/experimentals/1.4_release/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [23:04:54] ../include/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure
Stack trace:
[bt] (0) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7f633d1d1642]
[bt] (1) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0x168) [0x7f633d3489d0]
[bt] (2) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(+0x2a763f5) [0x7f633d3993f5]
[bt] (3) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(+0x2a7be78) [0x7f633d39ee78]
[bt] (4) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x56) [0x7f633d33f322]
[bt] (5) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x3b1) [0x7f633d355df3]
[bt] (6) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x231) [0x7f633d35bcef]
[bt] (7) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}::operator()(dmlc::ManualEvent) const+0x50) [0x7f633d357990]
[bt] (8) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x5c) [0x7f633d35ec56]
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=739806880 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2146529091 to reproduce.
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 1 test in 49.733s
FAILED (errors=1)
cuda memcheck output: more than 1000 errors
Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute.
When I change to cuda 10.1 these errors go away. Note that I have only observed them with DEV=1 with make (especially the --werror cross-space-execution) nvcc flag.
I think we should also update centos7 docker image to run on cuda 10.1
EDIT: I still see issues for countsketch on 10.1 when run with memcheck but these seem to be still addressable issues with operator but this is different from 10.0 where multiple operators are impacted and seem to be difficult to address.
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Cuda, Installation, Build
This was encountered during work on the PR: #15118. This is also related to #10988.
There are a lot of cuda-memcheck failures when MXNet is built with CUDA-10.0 which I don't see happening on CUDA-9.2.
On CUDA-9.2:
On CUDA-10.0
cuda memcheck output: more than 1000 errors
When I change to cuda 10.1 these errors go away. Note that I have only observed them with DEV=1 with make (especially the --werror cross-space-execution) nvcc flag.
I think we should also update centos7 docker image to run on cuda 10.1
EDIT: I still see issues for countsketch on 10.1 when run with memcheck but these seem to be still addressable issues with operator but this is different from 10.0 where multiple operators are impacted and seem to be difficult to address.
@marcoabreu @stu1130
The text was updated successfully, but these errors were encountered: