Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x][CI]Flaky tests on Python3:GPU and cpp package GPU Makefile test suites #20011

Open
access2rohit opened this issue Mar 11, 2021 · 0 comments
Labels

Comments

@access2rohit
Copy link
Contributor

access2rohit commented Mar 11, 2021

Description

unix-gpu has some flaky tests on Python3:GPU and cpp package GPU Makefile they fail quite frequenty even without any code that touches them.

Occurrences

Python3:GPU failing test:

[2021-03-11T18:04:29.187Z] test_operator_gpu.test_kernel_error_checking ... [18:04:24] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine

[2021-03-11T18:04:32.459Z] Process SpawnProcess-1:

[2021-03-11T18:04:32.460Z] Traceback (most recent call last):

[2021-03-11T18:04:32.460Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

[2021-03-11T18:04:32.460Z]     self.run()

[2021-03-11T18:04:32.460Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run

[2021-03-11T18:04:32.460Z]     self._target(*self._args, **self._kwargs)

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2238, in kernel_error_check_imperative

[2021-03-11T18:04:32.460Z]     c = (a / b).asnumpy()

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 354, in __truediv__

[2021-03-11T18:04:32.460Z]     return divide(self, other)

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 3820, in divide

[2021-03-11T18:04:32.460Z]     _internal._rdiv_scalar)

[2021-03-11T18:04:32.460Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 3576, in _ufunc_helper

[2021-03-11T18:04:32.460Z]     return fn_array(lhs, rhs)

[2021-03-11T18:04:32.460Z]   File "<string>", line 52, in broadcast_div

[2021-03-11T18:04:32.460Z]   File "mxnet/cython/ndarray.pyx", line 219, in mxnet._cy3.ndarray._imperative_invoke

[2021-03-11T18:04:32.460Z]   File "mxnet/cython/./base.pyi", line 58, in mxnet._cy3.ndarray.CALL

[2021-03-11T18:04:32.460Z] mxnet.base.MXNetError: Traceback (most recent call last):

[2021-03-11T18:04:32.460Z]   [bt] (9) /usr/local/bin/python3(_PyEval_EvalFrameDefault+0x44b2) [0x561b1fe37ac2]

[2021-03-11T18:04:32.460Z]   [bt] (8) /usr/local/bin/python3(_PyCFunction_FastCallKeywords+0x20) [0x561b1fdc3de0]

[2021-03-11T18:04:32.460Z]   [bt] (7) /usr/local/bin/python3(_PyMethodDef_RawFastCallKeywords+0x250) [0x561b1fdc4050]

[2021-03-11T18:04:32.460Z]   [bt] (6) /work/mxnet/tests/python/unittest/../../../python/mxnet/_cy3/ndarray.cpython-37m-x86_64-linux-gnu.so(+0x14699) [0x7eff14049699]

[2021-03-11T18:04:32.460Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeEx+0x8b) [0x7eff8be0653b]

[2021-03-11T18:04:32.460Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x543) [0x7eff8be04c73]

[2021-03-11T18:04:32.460Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0xe6) [0x7eff8b566836]

[2021-03-11T18:04:32.460Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x140e) [0x7eff8b560b6e]

[2021-03-11T18:04:32.460Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::BinaryBroadcastShape(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x38e) [0x7eff86af62ae]

[2021-03-11T18:04:32.460Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7eff8682df82]

[2021-03-11T18:04:32.460Z]   File "src/operator/numpy/linalg/./../../tensor/elemwise_binary_broadcast_op.h", line 68

[2021-03-11T18:04:32.460Z] MXNetError: Check failed: l == 1 || r == 1: operands could not be broadcast together with shapes [3] [0]

[2021-03-11T18:04:32.460Z] [18:04:28] src/engine/naive_engine.cc:74: Engine shutdown

[2021-03-11T18:04:34.985Z] [18:04:30] src/engine/engine.cc:55: MXNet start using engine: NaiveEngine

[2021-03-11T18:04:38.257Z] Process SpawnProcess-2:

[2021-03-11T18:04:38.257Z] Traceback (most recent call last):

[2021-03-11T18:04:38.257Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

[2021-03-11T18:04:38.257Z]     self.run()

[2021-03-11T18:04:38.257Z]   File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run

[2021-03-11T18:04:38.257Z]     self._target(*self._args, **self._kwargs)

[2021-03-11T18:04:38.257Z]   File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 2247, in kernel_error_check_symbolic

[2021-03-11T18:04:38.257Z]     'b':mx.nd.array([],ctx=mx.gpu(0))})

[2021-03-11T18:04:38.257Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/symbol/symbol.py", line 2119, in bind

[2021-03-11T18:04:38.257Z]     ctypes.byref(handle)))

[2021-03-11T18:04:38.257Z]   File "/work/mxnet/tests/python/unittest/../../../python/mxnet/base.py", line 246, in check_call

[2021-03-11T18:04:38.257Z]     raise get_last_ffi_error()

[2021-03-11T18:04:38.257Z] mxnet.base.MXNetError: Traceback (most recent call last):

[2021-03-11T18:04:38.257Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x8f5) [0x7f1e070e99f5]

[2021-03-11T18:04:38.257Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*)+0x219) [0x7f1e071f1139]

[2021-03-11T18:04:38.257Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&)+0x120c) [0x7f1e071e4a0c]

[2021-03-11T18:04:38.257Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::InferShape(nnvm::Graph&&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x69) [0x7f1e071c08a9]

[2021-03-11T18:04:38.257Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f05d99) [0x7f1e071bdd99]

[2021-03-11T18:04:38.257Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x6f0242b) [0x7f1e071ba42b]

[2021-03-11T18:04:38.257Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(bool mxnet::op::ElemwiseShape<2, 1>(nnvm::NodeAttrs const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*)+0x5ab) [0x7f1e0266305b]

[2021-03-11T18:04:38.257Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::ElemwiseAttrHelper<mxnet::TShape, &mxnet::op::shape_is_none, &mxnet::op::shape_assign, true, &mxnet::op::shape_string[abi:cxx11], -1, -1>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >*, mxnet::TShape const&)::{lambda(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*)#1}::operator()(std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, unsigned long, char const*) const+0x1276) [0x7f1e01bc6126]

[2021-03-11T18:04:38.257Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7f1e01b59f82]

[2021-03-11T18:04:38.257Z] MXNetError: Error in operator _div0: [18:04:33] src/operator/numpy/linalg/./../../tensor/../elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node _div0 at 1-th input: expected [3], got [0]
[2021-03-11T18:04:38.257Z] ok (11.0016s)

cpp package GPU Makefile failing test:

[2021-03-11T18:29:20.262Z] [18:29:15] cpp-package/example/test_regress_label.cpp:32: Running LinearRegressionOutput symbol testing, executor should be able to bind without label.

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] Segmentation fault: 11

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] Segmentation fault: 11

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] 

[2021-03-11T18:29:20.262Z] Segmentation fault: 11

Next Steps

Since they are blocking the PRs and making CI unstable. Immediate action is to disable them and investigate

@access2rohit access2rohit changed the title Flaky tests on Python3:GPU and cpp package GPU Makefile test suites [v1.x][CI]Flaky tests on Python3:GPU and cpp package GPU Makefile test suites Mar 11, 2021
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this issue Mar 11, 2021
Zha0q1 pushed a commit that referenced this issue Mar 12, 2021
…eline (#19974)

* migrating cd builds to ninja + removing static links to nvidia libs and leagacy cuda versions

* installing NCCL manually for cuda11.2 container

* set MSHADOW_USE_CUDNN=1 in CMakelists of mshadow to build properly for CUDNN support

* adding coverage to cd requirements file to fix cu100, cu101 and cu102 tests

* updating cd_test containers to ubuntu 18

* adding cmake config for linux native and adding USE_KV_STORE in linux_cpu

* updating zmq builds to statically link to libmxnet.so

* updating toolchains for r, clang and llvm for ubuntu18. OpenBlas Static link for 'distribution' build type only. Fix caffe build to use openCV 3. Remove leagacy Clang 3.9 from CI

* fix versions for pip install in ubuntu_core_sh add new search path for cuDNN

* finxing cudnn link problem for CUDA<=11.0

* adding library paths for libjpegturbo and lapack to fix failing CI on ubuntu 18 images

* removing ASAN integration test from miscellaneous CI as its not required

* fix lapack path for gpu builds

* correctly installing libjpegturbo for ubuntu 18

* updating docker images of r,jekyll,julia etc test containers+ fix java version to 8

* installing libomp.so

* removing debug test as its not required. Code clean-up

* adding alternate URL source for MNIST dataset as original website is down

* skipping flaky tests issue tracked #20011

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this issue Mar 12, 2021
…eline (apache#19974)

* migrating cd builds to ninja + removing static links to nvidia libs and leagacy cuda versions

* installing NCCL manually for cuda11.2 container

* set MSHADOW_USE_CUDNN=1 in CMakelists of mshadow to build properly for CUDNN support

* adding coverage to cd requirements file to fix cu100, cu101 and cu102 tests

* updating cd_test containers to ubuntu 18

* adding cmake config for linux native and adding USE_KV_STORE in linux_cpu

* updating zmq builds to statically link to libmxnet.so

* updating toolchains for r, clang and llvm for ubuntu18. OpenBlas Static link for 'distribution' build type only. Fix caffe build to use openCV 3. Remove leagacy Clang 3.9 from CI

* fix versions for pip install in ubuntu_core_sh add new search path for cuDNN

* finxing cudnn link problem for CUDA<=11.0

* adding library paths for libjpegturbo and lapack to fix failing CI on ubuntu 18 images

* removing ASAN integration test from miscellaneous CI as its not required

* fix lapack path for gpu builds

* correctly installing libjpegturbo for ubuntu 18

* updating docker images of r,jekyll,julia etc test containers+ fix java version to 8

* installing libomp.so

* removing debug test as its not required. Code clean-up

* adding alternate URL source for MNIST dataset as original website is down

* skipping flaky tests issue tracked apache#20011

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
mseth10 added a commit that referenced this issue Mar 14, 2021
…20015)

* [BACKPORT]Enable CUDA 11.0 on nightly + CUDA 11.2 on pip (#19295)(#19764) (#19930)

* Enable CUDA 11.0 on nightly development builds (#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* [FEATURE]Migrating all CD pipelines to Ninja build + fix cu112 CD pipeline (#19974)

* migrating cd builds to ninja + removing static links to nvidia libs and leagacy cuda versions

* installing NCCL manually for cuda11.2 container

* set MSHADOW_USE_CUDNN=1 in CMakelists of mshadow to build properly for CUDNN support

* adding coverage to cd requirements file to fix cu100, cu101 and cu102 tests

* updating cd_test containers to ubuntu 18

* adding cmake config for linux native and adding USE_KV_STORE in linux_cpu

* updating zmq builds to statically link to libmxnet.so

* updating toolchains for r, clang and llvm for ubuntu18. OpenBlas Static link for 'distribution' build type only. Fix caffe build to use openCV 3. Remove leagacy Clang 3.9 from CI

* fix versions for pip install in ubuntu_core_sh add new search path for cuDNN

* finxing cudnn link problem for CUDA<=11.0

* adding library paths for libjpegturbo and lapack to fix failing CI on ubuntu 18 images

* removing ASAN integration test from miscellaneous CI as its not required

* fix lapack path for gpu builds

* correctly installing libjpegturbo for ubuntu 18

* updating docker images of r,jekyll,julia etc test containers+ fix java version to 8

* installing libomp.so

* removing debug test as its not required. Code clean-up

* adding alternate URL source for MNIST dataset as original website is down

* skipping flaky tests issue tracked #20011

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* update cudnn from 7 to 8 for cu102 (#19506)

* update cudnn from 7 to 8 for cu102 (#19522)

* downloading MNIST dataset from alternate URL (#20014)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* fixing CI issue with v1.8.x

* addressing review comments

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
mseth10 added a commit to mseth10/incubator-mxnet that referenced this issue Mar 15, 2021
…pache#20015)

* [BACKPORT]Enable CUDA 11.0 on nightly + CUDA 11.2 on pip (apache#19295)(apache#19764) (apache#19930)

* Enable CUDA 11.0 on nightly development builds (apache#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (apache#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* [FEATURE]Migrating all CD pipelines to Ninja build + fix cu112 CD pipeline (apache#19974)

* migrating cd builds to ninja + removing static links to nvidia libs and leagacy cuda versions

* installing NCCL manually for cuda11.2 container

* set MSHADOW_USE_CUDNN=1 in CMakelists of mshadow to build properly for CUDNN support

* adding coverage to cd requirements file to fix cu100, cu101 and cu102 tests

* updating cd_test containers to ubuntu 18

* adding cmake config for linux native and adding USE_KV_STORE in linux_cpu

* updating zmq builds to statically link to libmxnet.so

* updating toolchains for r, clang and llvm for ubuntu18. OpenBlas Static link for 'distribution' build type only. Fix caffe build to use openCV 3. Remove leagacy Clang 3.9 from CI

* fix versions for pip install in ubuntu_core_sh add new search path for cuDNN

* finxing cudnn link problem for CUDA<=11.0

* adding library paths for libjpegturbo and lapack to fix failing CI on ubuntu 18 images

* removing ASAN integration test from miscellaneous CI as its not required

* fix lapack path for gpu builds

* correctly installing libjpegturbo for ubuntu 18

* updating docker images of r,jekyll,julia etc test containers+ fix java version to 8

* installing libomp.so

* removing debug test as its not required. Code clean-up

* adding alternate URL source for MNIST dataset as original website is down

* skipping flaky tests issue tracked apache#20011

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* update cudnn from 7 to 8 for cu102 (apache#19506)

* update cudnn from 7 to 8 for cu102 (apache#19522)

* downloading MNIST dataset from alternate URL (apache#20014)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* fixing CI issue with v1.8.x

* addressing review comments

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant