Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

lebeg · 2018-10-09T14:29:02Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [04:48:16] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: 
Failed to find any forward convolution algorithm.  with workspace size of 1073741824 
bytes, please consider reducing batch/model size or increasing the workspace size

The text was updated successfully, but these errors were encountered:

lebeg · 2018-10-09T14:35:43Z

Possibly related to:
Failing test: test_gluon_gpu.test_slice_batchnorm: #12715

larroy · 2018-10-09T14:37:35Z

I'm unsure this is a flaky test, I think it's a cuda / cudnn or CI environment problem. Could you reproduce?

piyushghai · 2018-10-09T14:58:18Z

@mxnet-label-bot [flaky, Gluon]

lebeg · 2018-10-09T16:14:51Z

Another consecutive run failed on master CI:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline

======================================================================
FAIL: test_mkldnn.test_Deconvolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 346, in test_Deconvolution
    check_Deconvolution_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 342, in check_Deconvolution_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 915, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 3.121914 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(2, 1, 5), a=-0.000381, b=-0.001386
 NUMERICAL_data: array([[[-0.6184697 , -0.50860643, -0.6415248 , ..., -0.7978529 ,
         -0.8801222 , -0.7802248 ],
        [-0.26806593, -0.1953423 , -0.14332533, ..., -0.17287433,...
 BACKWARD_data: array([[[-0.6174789 , -0.5086705 , -0.6417394 , ..., -0.79945517,
         -0.88075024, -0.77997565],
        [-0.26776323, -0.19459067, -0.14422962, ..., -0.1742437 ,...

lebeg · 2018-10-09T16:16:16Z

The deconvolution failure is tracked in #12579

gaurav-gireesh · 2018-10-10T00:28:15Z

Flaky test failure.
Please refer to the jenkins' log below:
Log

lebeg · 2018-10-10T13:14:37Z

Another failure:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12766/2/pipeline

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [23:05:11] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

piyushghai · 2018-10-16T22:18:14Z

Another failure can be seen here :
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12826/3/pipeline/996

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:28:14] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

lanking520 · 2018-10-16T22:28:19Z

@lebeg is there anybody working on this? tests are still failing.

ChaiBapchya · 2018-10-17T02:04:20Z

Another failure for me here : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/18/pipeline/996

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [00:03:15] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

lebeg · 2018-10-17T13:20:17Z

@lanking520 I proposed a mitigation here #12768 until this will be fixed. You are welcome to participate in the discussion and help merging it. Although this will not fix the problem, it could help reduce the failure rate.

As far as I know @nswamy was investigating the root case.

We have been working in the direction of updating CUDA drivers: #12850, but it's blocked until the new AMIs will be deployed with updated CUDA drivers.

lebeg · 2018-10-17T13:30:38Z

@larroy is currently doing the driver updates.

eric-haibin-lin · 2018-10-18T17:19:59Z

Failing again in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12860/4/pipeline/997

lanking520 · 2018-10-21T15:56:06Z

#12887 duplicated issue

aaronmarkham · 2018-10-23T18:57:07Z

Failed again here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12919/1/pipeline

nswamy · 2018-11-01T15:16:13Z

did you reenable the test?

…

On Thu, Nov 1, 2018 at 8:05 AM Anton Chernov ***@***.***> wrote: Closed #12767 <#12767>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12767 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJxQtAkR6_7NIzFpsmEoe7SanOAY769ks5uqw3MgaJpZM4XTaqc> .

lebeg · 2018-11-01T15:43:00Z

@nswamy I was thinking https://github.com/apache/incubator-mxnet/pull/12986/files would reenable it

lebeg · 2018-11-02T09:33:33Z

#12986 did enable the test again.

lebeg mentioned this issue Oct 9, 2018

Disabled: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12768

Merged

lebeg mentioned this issue Oct 9, 2018

Python test failure on newly-introduced python test cases introduced in test_gluon.py #11164

Open

marcoabreu added Flaky Gluon labels Oct 9, 2018

lebeg mentioned this issue Oct 10, 2018

Updated CONTRIBUTORS.md #12766

Merged

3 tasks

lanking520 mentioned this issue Oct 21, 2018

flaky test test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12887

Closed

This was referenced Oct 23, 2018

CudnnFind() usage improvements #12804

Merged

enable batchnorm unit tests #12986

Merged

lebeg closed this as completed Nov 1, 2018

lebeg reopened this Nov 1, 2018

lebeg closed this as completed Nov 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

lebeg commented Oct 9, 2018

lebeg commented Oct 9, 2018

larroy commented Oct 9, 2018

piyushghai commented Oct 9, 2018

lebeg commented Oct 9, 2018

lebeg commented Oct 9, 2018

gaurav-gireesh commented Oct 10, 2018

lebeg commented Oct 10, 2018

piyushghai commented Oct 16, 2018

lanking520 commented Oct 16, 2018

ChaiBapchya commented Oct 17, 2018

lebeg commented Oct 17, 2018

lebeg commented Oct 17, 2018

eric-haibin-lin commented Oct 18, 2018

lanking520 commented Oct 21, 2018

aaronmarkham commented Oct 23, 2018

nswamy commented Nov 1, 2018 via email

lebeg commented Nov 1, 2018

lebeg commented Nov 2, 2018

Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

Comments

lebeg commented Oct 9, 2018

lebeg commented Oct 9, 2018

larroy commented Oct 9, 2018

piyushghai commented Oct 9, 2018

lebeg commented Oct 9, 2018

lebeg commented Oct 9, 2018

gaurav-gireesh commented Oct 10, 2018

lebeg commented Oct 10, 2018

piyushghai commented Oct 16, 2018

lanking520 commented Oct 16, 2018

ChaiBapchya commented Oct 17, 2018

lebeg commented Oct 17, 2018

lebeg commented Oct 17, 2018

eric-haibin-lin commented Oct 18, 2018

lanking520 commented Oct 21, 2018

aaronmarkham commented Oct 23, 2018

nswamy commented Nov 1, 2018 via email

lebeg commented Nov 1, 2018

lebeg commented Nov 2, 2018