Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

flaky test: test_operator_gpu.test_depthwise_convolution #12203

Closed
lebeg opened this issue Aug 16, 2018 · 14 comments · Fixed by #14016
Closed

flaky test: test_operator_gpu.test_depthwise_convolution #12203

lebeg opened this issue Aug 16, 2018 · 14 comments · Fixed by #14016

Comments

@lebeg
Copy link
Contributor

lebeg commented Aug 16, 2018

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12181/4/pipeline

======================================================================

FAIL: test_operator_gpu.test_depthwise_convolution

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1663, in test_depthwise_convolution

    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)

  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=0.001, atol=0.001



(mismatch 2.040816326530617%)

 x: array([[[[  7.347052,  -1.722254,   7.837829,   4.21605 ,  -1.359475,

            1.55463 ,   6.701931],

         [ 11.283103,  12.302897,  -9.111632,  -3.390831,  -4.708895,...

 y: array([[[[  7.348634,  -1.720118,   7.836634,   4.217753,  -1.361165,

            1.552893,   6.695152],

         [ 11.284711,  12.303747,  -9.112375,  -3.390379,  -4.709879,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=640580015 to reproduce.

--------------------- >> end captured logging << ---------------------

@ankkhedia
Copy link
Contributor

@lebeg Thanks for filing the issue. We will look into this issue.

@mseth10
Copy link
Contributor

mseth10 commented Aug 30, 2018

Fix in #12402

@haojin2
Copy link
Contributor

haojin2 commented Sep 1, 2018

@lebeg
Copy link
Contributor Author

lebeg commented Oct 9, 2018

As far as I know this is still an issue:
#12441

@lebeg
Copy link
Contributor Author

lebeg commented Dec 7, 2018

During testing we found another failure seed:

======================================================================
FAIL: test_operator_gpu.test_depthwise_convolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1676, in test_depthwise_convolution
    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)
  File "/usr/local/lib/python2.7/dist-packages/numpy/testing/_private/utils.py", line 1452, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python2.7/dist-packages/numpy/testing/_private/utils.py", line 789, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=0.001

(mismatch 3.0612244898%)
 x: array([ 4.791068e+00,  1.593453e+01,  1.434397e+01,  1.545888e+01,
        1.460622e+01, -3.660450e+00,  8.265715e+00, -1.411026e+00,
        2.041084e+01,  1.641194e+01,  6.190044e+00,  2.084945e+01,...
 y: array([ 4.790918e+00,  1.593650e+01,  1.434332e+01,  1.545896e+01,
        1.460644e+01, -3.660306e+00,  8.265750e+00, -1.410326e+00,
        2.041158e+01,  1.641292e+01,  6.189894e+00,  2.084980e+01,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=689972485 to reproduce.
common: INFO: 1 of 100: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=471847954 to reproduce.
--------------------- >> end captured logging << ---------------------

@lebeg
Copy link
Contributor Author

lebeg commented Dec 7, 2018

But it seems that even with fixed seeds the test fails not deterministically.

@mseth10
Copy link
Contributor

mseth10 commented Dec 11, 2018

This flaky test issue has previously been identified (#8712) and fixed (#10365) for Python2: MKLDNN-CPU. During this fix (PR discussion), it was identified that this problem still exists for Python2: MKLDNN-GPU.

PR #10578 supposedly fixed the issue, but as it appears, the test still fails non-deterministically. Can you please have a look at this issue? @nihui @xinyu-intel @pengzhao-intel @zheng-da

@mseth10
Copy link
Contributor

mseth10 commented Dec 11, 2018

Reproduction steps (from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results):

Spin up a p3.8x large instance, with Ubuntu base DLAMI, with at least 150GB EBS storage

Cloning and building MXNet:

Enabling the test - Comment out *line 1634 in file tests/python/unittest/test_operator.py.
# @unittest.skip("Flaky test https://github.com/apache/incubator-mxnet/issues/12203")

Running only this particular test 10,000 times - Modify *line 735 in file ci/docker/runtime_functions.sh to
MXNET_TEST_COUNT=10000 nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu/test_operator_gpu.py:test_depthwise_convolution

  • ci/build.py --nvidiadocker --platform ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_python2_gpu

*Line numbers corresponding to commit e25e18f

@pengzhao-intel
Copy link
Contributor

@juliusshufan could you help take a look for this test case?

@mseth10
Copy link
Contributor

mseth10 commented Dec 17, 2018

@juliusshufan ping

@juliusshufan
Copy link
Contributor

@mseth10 @lebeg sorry for late response.
I have no access to the p3.8 large instance on aws, I therefore make a GPU-MKLDNN build on a CentOS 7.4 server with nvidia V100, build command is:
make -j USE_MKLDNN=1 USE_BLAS=mkl USE_CUDA=1 USE_CUDA_PATH=XX USE_CUDNN=1 USE_OPENCV=1

I run the test case 10000 times with same seen mentioned by the issue description, but can't be reproduced. May I have your comments?

@lebeg
Copy link
Contributor Author

lebeg commented Dec 21, 2018

@juliusshufan We use dockerized builds and tests and therefore the host system shouldn't matter.

You should be able to reproduce the failure by following steps mentioned by @mseth10 above:

Checkout and build

git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
pip3 install -r ci/requirements.txt
ci/build.py --platform ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

Enable the test

Comment out this line in file tests/python/unittest/test_operator.py:

# @unittest.skip("Flaky test https://github.com/apache/incubator-mxnet/issues/12203")

Speed up the testing

Running only this particular test 10,000 times: Modify the unittest_ubuntu_python2_gpu function in file ci/docker/runtime_functions.sh:

MXNET_TEST_COUNT=10000 nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu/test_operator_gpu.py:test_depthwise_convolution

Run the test

In the exact environment where it fails:

ci/build.py --nvidiadocker --platform ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_python2_gpu

@mseth10
Copy link
Contributor

mseth10 commented Jan 14, 2019

@juliusshufan did you try dockerized build and test commands on your system? were you able to reproduce the failure?

@mseth10
Copy link
Contributor

mseth10 commented Jan 29, 2019

Fix in #14016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants