Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

Closed
lebeg opened this issue Oct 9, 2018 · 18 comments
Closed

Flaky test: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm #12767

lebeg opened this issue Oct 9, 2018 · 18 comments

Comments

@lebeg
Copy link
Contributor

lebeg commented Oct 9, 2018

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [04:48:16] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: 
Failed to find any forward convolution algorithm.  with workspace size of 1073741824 
bytes, please consider reducing batch/model size or increasing the workspace size
@lebeg
Copy link
Contributor Author

lebeg commented Oct 9, 2018

Possibly related to:
Failing test: test_gluon_gpu.test_slice_batchnorm: #12715

@larroy
Copy link
Contributor

larroy commented Oct 9, 2018

I'm unsure this is a flaky test, I think it's a cuda / cudnn or CI environment problem. Could you reproduce?

@piyushghai
Copy link
Contributor

@mxnet-label-bot [flaky, Gluon]

@lebeg
Copy link
Contributor Author

lebeg commented Oct 9, 2018

Another consecutive run failed on master CI:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1728/pipeline

======================================================================
FAIL: test_mkldnn.test_Deconvolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 346, in test_Deconvolution
    check_Deconvolution_training(stype)
  File "/work/mxnet/tests/python/mkl/test_mkldnn.py", line 342, in check_Deconvolution_training
    check_numeric_gradient(test, in_location, numeric_eps=1e-2, rtol=0.16, atol=1e-4)
  File "/work/mxnet/python/mxnet/test_utils.py", line 915, in check_numeric_gradient
    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 3.121914 exceeds tolerance rtol=0.160000, atol=0.000100.  Location of maximum error:(2, 1, 5), a=-0.000381, b=-0.001386
 NUMERICAL_data: array([[[-0.6184697 , -0.50860643, -0.6415248 , ..., -0.7978529 ,
         -0.8801222 , -0.7802248 ],
        [-0.26806593, -0.1953423 , -0.14332533, ..., -0.17287433,...
 BACKWARD_data: array([[[-0.6174789 , -0.5086705 , -0.6417394 , ..., -0.79945517,
         -0.88075024, -0.77997565],
        [-0.26776323, -0.19459067, -0.14422962, ..., -0.1742437 ,...

@lebeg
Copy link
Contributor Author

lebeg commented Oct 9, 2018

The deconvolution failure is tracked in #12579

@gaurav-gireesh
Copy link
Contributor

Flaky test failure.
Please refer to the jenkins' log below:
Log

@lebeg
Copy link
Contributor Author

lebeg commented Oct 10, 2018

Another failure:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12766/2/pipeline

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [23:05:11] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

@lebeg lebeg mentioned this issue Oct 10, 2018
3 tasks
@piyushghai
Copy link
Contributor

Another failure can be seen here :
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12826/3/pipeline/996

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:28:14] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

@lanking520
Copy link
Member

@lebeg is there anybody working on this? tests are still failing.

@ChaiBapchya
Copy link
Contributor

Another failure for me here : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/18/pipeline/996

======================================================================
ERROR: test_gluon_gpu.test_slice_batchnorm_reshape_batchnorm
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 2033, in test_slice_batchnorm_reshape_batchnorm
    check_layer_forward_withinput(net, x)
  File "/work/mxnet/tests/python/gpu/../unittest/test_gluon.py", line 1507, in check_layer_forward_withinput
    mx.test_utils.assert_almost_equal(x.grad.asnumpy(), x_hybrid.grad.asnumpy(), rtol=1e-5, atol=1e-6)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [00:03:15] src/operator/nn/./cudnn/cudnn_convolution-inl.h:875: Failed to find any forward convolution algorithm.  with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size

@lebeg
Copy link
Contributor Author

lebeg commented Oct 17, 2018

@lanking520 I proposed a mitigation here #12768 until this will be fixed. You are welcome to participate in the discussion and help merging it. Although this will not fix the problem, it could help reduce the failure rate.

As far as I know @nswamy was investigating the root case.

We have been working in the direction of updating CUDA drivers: #12850, but it's blocked until the new AMIs will be deployed with updated CUDA drivers.

@lebeg
Copy link
Contributor Author

lebeg commented Oct 17, 2018

@larroy is currently doing the driver updates.

@lanking520
Copy link
Member

#12887 duplicated issue

@aaronmarkham
Copy link
Contributor

@nswamy
Copy link
Member

nswamy commented Nov 1, 2018 via email

@lebeg
Copy link
Contributor Author

lebeg commented Nov 1, 2018

@lebeg lebeg reopened this Nov 1, 2018
@lebeg
Copy link
Contributor Author

lebeg commented Nov 2, 2018

#12986 did enable the test again.

@lebeg lebeg closed this as completed Nov 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants