Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky Test Issue of GPU Operator #11592

Open
zhanghang1989 opened this issue Jul 6, 2018 · 9 comments
Open

Flaky Test Issue of GPU Operator #11592

zhanghang1989 opened this issue Jul 6, 2018 · 9 comments

Comments

@zhanghang1989
Copy link
Contributor

Description

kernel_error_check_imperative() and kernel_error_check_symbolic() in test_operator_gpu.py have flaky issues.

test_operator_gpu.test_kernel_error_checking ... Process SpawnProcess-1:

Traceback (most recent call last):

  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap

    self.run()

  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run

    self._target(*self._args, **self._kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 1832, in kernel_error_check_imperative

    c = (a / b).asnumpy()

  File "/work/mxnet/tests/python/unittest/../../../python/mxnet/ndarray/ndarray.py", line 1910, in asnumpy

    ctypes.c_size_t(data.size)))

  File "/work/mxnet/tests/python/unittest/../../../python/mxnet/base.py", line 210, in check_call

    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [19:32:53] src/operator/tensor/././../mxnet_op.h:586: Check failed: err == cudaSuccess (9 vs. 0) Name: mxnet_generic_kernel_ex ErrStr:invalid configuration argument

@marcoabreu
Copy link
Contributor

Hi, would you mind looking a CI run where this occurred?

@zhanghang1989
Copy link
Contributor Author

This fails at "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 1832, in kernel_error_check_imperative http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11502/25/pipeline/799

@haojin2
Copy link
Contributor

haojin2 commented Jul 11, 2018

I cannot reproduce this error on my local machine, are you sure that this failure is not related to your PR? Seems like this is happening to other environments for the same build as well.

@zhanghang1989
Copy link
Contributor Author

I am closing this issue, since I don't have this issue any more. My PR just passed CI test.

@marcoabreu
Copy link
Contributor

Agree with @haojin2 , the error seems to be related to the PR. Thanks everybody!

@szha
Copy link
Member

szha commented Jul 27, 2018

I just had the exact same error in another PR. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/17/pipeline/859#step-1530-log-1018

@zhanghang1989 what did you do to resolve the problem?

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Nov 2, 2018

@ChaiBapchya
Copy link
Contributor

@szha Can we reopen this issue? Your latest PR on Notice #14043 has the same CI issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants