flaky test: test_operator_gpu.test_depthwise_convolution #12203

lebeg · 2018-08-16T15:13:35Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12181/4/pipeline

======================================================================

FAIL: test_operator_gpu.test_depthwise_convolution

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1663, in test_depthwise_convolution

    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)

  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose

    verbose=verbose, header=header, equal_nan=equal_nan)

  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare

    raise AssertionError(msg)

AssertionError: 

Not equal to tolerance rtol=0.001, atol=0.001



(mismatch 2.040816326530617%)

 x: array([[[[  7.347052,  -1.722254,   7.837829,   4.21605 ,  -1.359475,

            1.55463 ,   6.701931],

         [ 11.283103,  12.302897,  -9.111632,  -3.390831,  -4.708895,...

 y: array([[[[  7.348634,  -1.720118,   7.836634,   4.217753,  -1.361165,

            1.552893,   6.695152],

         [ 11.284711,  12.303747,  -9.112375,  -3.390379,  -4.709879,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=640580015 to reproduce.

--------------------- >> end captured logging << ---------------------

The text was updated successfully, but these errors were encountered:

ankkhedia · 2018-08-26T22:49:11Z

@lebeg Thanks for filing the issue. We will look into this issue.

mseth10 · 2018-08-30T20:20:30Z

Fix in #12402

haojin2 · 2018-09-01T08:00:30Z

@mseth10 Happening again after increasing tolerance level: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12429/1/pipeline

lebeg · 2018-10-09T14:26:36Z

As far as I know this is still an issue:
#12441

lebeg · 2018-12-07T14:56:37Z

During testing we found another failure seed:

======================================================================
FAIL: test_operator_gpu.test_depthwise_convolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1676, in test_depthwise_convolution
    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)
  File "/usr/local/lib/python2.7/dist-packages/numpy/testing/_private/utils.py", line 1452, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python2.7/dist-packages/numpy/testing/_private/utils.py", line 789, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=0.001

(mismatch 3.0612244898%)
 x: array([ 4.791068e+00,  1.593453e+01,  1.434397e+01,  1.545888e+01,
        1.460622e+01, -3.660450e+00,  8.265715e+00, -1.411026e+00,
        2.041084e+01,  1.641194e+01,  6.190044e+00,  2.084945e+01,...
 y: array([ 4.790918e+00,  1.593650e+01,  1.434332e+01,  1.545896e+01,
        1.460644e+01, -3.660306e+00,  8.265750e+00, -1.410326e+00,
        2.041158e+01,  1.641292e+01,  6.189894e+00,  2.084980e+01,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=689972485 to reproduce.
common: INFO: 1 of 100: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=471847954 to reproduce.
--------------------- >> end captured logging << ---------------------

lebeg · 2018-12-07T17:05:33Z

But it seems that even with fixed seeds the test fails not deterministically.

mseth10 · 2018-12-11T20:04:48Z

This flaky test issue has previously been identified (#8712) and fixed (#10365) for Python2: MKLDNN-CPU. During this fix (PR discussion), it was identified that this problem still exists for Python2: MKLDNN-GPU.

PR #10578 supposedly fixed the issue, but as it appears, the test still fails non-deterministically. Can you please have a look at this issue? @nihui @xinyu-intel @pengzhao-intel @zheng-da

mseth10 · 2018-12-11T21:22:55Z

Reproduction steps (from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results):

Spin up a p3.8x large instance, with Ubuntu base DLAMI, with at least 150GB EBS storage

Cloning and building MXNet:

git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
pip3 install -r ci/requirements.txt
ci/build.py --platform ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

Enabling the test - Comment out *line 1634 in file tests/python/unittest/test_operator.py.
# @unittest.skip("Flaky test https://github.com/apache/incubator-mxnet/issues/12203")

Running only this particular test 10,000 times - Modify *line 735 in file ci/docker/runtime_functions.sh to
MXNET_TEST_COUNT=10000 nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu/test_operator_gpu.py:test_depthwise_convolution

ci/build.py --nvidiadocker --platform ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_python2_gpu

*Line numbers corresponding to commit e25e18f

pengzhao-intel · 2018-12-12T04:27:06Z

@juliusshufan could you help take a look for this test case?

mseth10 · 2018-12-17T17:10:17Z

@juliusshufan ping

juliusshufan · 2018-12-21T09:23:37Z

@mseth10 @lebeg sorry for late response.
I have no access to the p3.8 large instance on aws, I therefore make a GPU-MKLDNN build on a CentOS 7.4 server with nvidia V100, build command is:
make -j USE_MKLDNN=1 USE_BLAS=mkl USE_CUDA=1 USE_CUDA_PATH=XX USE_CUDNN=1 USE_OPENCV=1

I run the test case 10000 times with same seen mentioned by the issue description, but can't be reproduced. May I have your comments?

lebeg · 2018-12-21T09:50:53Z

@juliusshufan We use dockerized builds and tests and therefore the host system shouldn't matter.

You should be able to reproduce the failure by following steps mentioned by @mseth10 above:

Checkout and build

git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
pip3 install -r ci/requirements.txt
ci/build.py --platform ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

Enable the test

Comment out this line in file tests/python/unittest/test_operator.py:

# @unittest.skip("Flaky test https://github.com/apache/incubator-mxnet/issues/12203")

Speed up the testing

Running only this particular test 10,000 times: Modify the unittest_ubuntu_python2_gpu function in file ci/docker/runtime_functions.sh:

MXNET_TEST_COUNT=10000 nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu/test_operator_gpu.py:test_depthwise_convolution

Run the test

In the exact environment where it fails:

ci/build.py --nvidiadocker --platform ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_python2_gpu

mseth10 · 2019-01-14T20:23:19Z

@juliusshufan did you try dockerized build and test commands on your system? were you able to reproduce the failure?

mseth10 · 2019-01-29T23:13:56Z

Fix in #14016

lebeg mentioned this issue Aug 16, 2018

Disable flaky test: test_operator_gpu.test_depthwise_convolution #12204

Merged

marcoabreu added Test Flaky Disabled test labels Aug 16, 2018

mseth10 mentioned this issue Aug 30, 2018

fixed flaky test issue for test_operator_gpu.test_depthwise_convolution #12402

Merged

5 tasks

mseth10 mentioned this issue Jan 29, 2019

fix test_depthwise_convoltuion for occasional CI failures #14016

Merged

5 tasks

Roshrini closed this as completed in #14016 Feb 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test: test_operator_gpu.test_depthwise_convolution #12203

flaky test: test_operator_gpu.test_depthwise_convolution #12203

lebeg commented Aug 16, 2018

ankkhedia commented Aug 26, 2018

mseth10 commented Aug 30, 2018

haojin2 commented Sep 1, 2018

lebeg commented Oct 9, 2018

lebeg commented Dec 7, 2018

lebeg commented Dec 7, 2018

mseth10 commented Dec 11, 2018 •

edited

Loading

mseth10 commented Dec 11, 2018 •

edited

Loading

pengzhao-intel commented Dec 12, 2018

mseth10 commented Dec 17, 2018

juliusshufan commented Dec 21, 2018

lebeg commented Dec 21, 2018

mseth10 commented Jan 14, 2019

mseth10 commented Jan 29, 2019

flaky test: test_operator_gpu.test_depthwise_convolution #12203

flaky test: test_operator_gpu.test_depthwise_convolution #12203

Comments

lebeg commented Aug 16, 2018

ankkhedia commented Aug 26, 2018

mseth10 commented Aug 30, 2018

haojin2 commented Sep 1, 2018

lebeg commented Oct 9, 2018

lebeg commented Dec 7, 2018

lebeg commented Dec 7, 2018

mseth10 commented Dec 11, 2018 • edited Loading

mseth10 commented Dec 11, 2018 • edited Loading

pengzhao-intel commented Dec 12, 2018

mseth10 commented Dec 17, 2018

juliusshufan commented Dec 21, 2018

lebeg commented Dec 21, 2018

mseth10 commented Jan 14, 2019

mseth10 commented Jan 29, 2019

mseth10 commented Dec 11, 2018 •

edited

Loading

mseth10 commented Dec 11, 2018 •

edited

Loading