Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test test_gluon_rnn.test_layer_bidirectional #13103

Open
KellenSunderland opened this issue Nov 3, 2018 · 15 comments
Open

Flaky test test_gluon_rnn.test_layer_bidirectional #13103

KellenSunderland opened this issue Nov 3, 2018 · 15 comments

Comments

@KellenSunderland
Copy link
Contributor

Example failure: https://travis-ci.org/apache/incubator-mxnet/builds/450064964?utm_source=github_status&utm_medium=notification

======================================================================
FAIL: test_gluon_rnn.test_layer_bidirectional

Traceback (most recent call last):
File "/usr/local/Cellar/numpy/1.14.5/libexec/nose/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/Users/travis/build/apache/incubator-mxnet/tests/python/unittest/common.py", line 106, in test_new
orig_test(*args, **kwargs)
File "/Users/travis/build/apache/incubator-mxnet/tests/python/unittest/test_gluon_rnn.py", line 282, in test_layer_bidirectional
assert_allclose(net(data).asnumpy(), ref_net(data).asnumpy())
File "/usr/local/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 0.0649350649351%)
x: array([[[0.682853, 0.674969, 0.547395, ..., 0.997481, 0.998059
0.994295]
[0.652577, 0.653787, 0.478821, ..., 0.997182, 0.996606,...
y: array([[[0.682853, 0.674969, 0.547395, ..., 0.997481, 0.998059
0.994295]
[0.652577, 0.653787, 0.478821, ..., 0.997182, 0.996606,...

@KellenSunderland
Copy link
Contributor Author

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Nov 4, 2018

AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Do you think the rule is too strict?

@frankfliu
Copy link
Contributor

@mxnet-label-bot [Gluon, Flaky, Test]

@ZhennanQin
Copy link
Contributor

@pengzhao-intel
Copy link
Contributor

@rongzha1 please take a look this issue.

@szha
Copy link
Member

szha commented Apr 11, 2019

@perdasilva could you elaborate on the settings for these pipelines? Are they failing because of CPU tests or GPU tests?

@haojin2
Copy link
Contributor

haojin2 commented Apr 12, 2019

@perdasilva Seems like the mismatch rate is very low (0.xxx%) while rtol=1e-7 and atol is even just 0, I wonder if we could simply bump the tolerances up instead of disabling this?

@perdasilva
Copy link
Contributor

perdasilva commented Apr 12, 2019

@szha in the two cases I've linked to, it was tested against a binary compiled with your tools for static linking, and the variants used were cu80mkl and cu92mkl.

@haojin2 I'm happy to bump them, but I just wouldn't know what to bump them to =S I'm not familiar with this side of the code and don't really know what reasonable tolerance levels would be.

@haojin2
Copy link
Contributor

haojin2 commented Apr 12, 2019

@perdasilva Even rtol=2e-7 would suffice and please try 10000 times for that particular test. If you don't know how to do it I can run it on my side.

@perdasilva
Copy link
Contributor

@haojin2 I'll give it a go, and let you know how it goes. Thanks for the help!

@perdasilva
Copy link
Contributor

@haojin2 no good:

======================================================================
FAIL: test_gluon_rnn.test_layer_bidirectional
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 110, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/unittest/test_gluon_rnn.py", line 283, in test_layer_bidirectional
    assert_allclose(net(data).asnumpy(), ref_net(data).asnumpy(), rtol=2e-7)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1452, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 789, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=2e-07, atol=0

(mismatch 0.06493506493507084%)
 x: array([0.424288, 0.560531, 0.600333, ..., 0.402131, 0.560952, 0.505039],
      dtype=float32)
 y: array([0.424288, 0.560531, 0.600333, ..., 0.402131, 0.560952, 0.505039],
      dtype=float32)
-------------------- >> begin captured logging << --------------------
tests.python.unittest.common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1305130208 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 0.030s

What I did to the test code:

# Added import statement
from tests.python.unittest.common import with_seed

# Added with_seed decorator to test function
@with_seed()
def test_layer_bidirectional():

# Update rtol as suggested to the assertion statement
assert_allclose(net(data).asnumpy(), ref_net(data).asnumpy(), rtol=2e-7)

To test the changes, I have a g3.8xlarge instance with nvidia drivers 418 and nvidia-docker:

# On host
$ docker run -ti -v `pwd`:/work/mxnet mxnetcd/build.ubuntu_cpu_static /bin/bash

# Within container
$ source tools/staticbuild/build.sh cu92mkl pip
$ exit

# On host
$ docker run -ti --runtime=nvidia -v `pwd`:/work/mxnet mxnetcd/build.ubuntu_gpu_cu92 /bin/bash
$ export PYTHONPATH=./python/
$ MXNET_TEST_COUNT=10000 nosetests --logging-level=DEBUG --verbose -s tests/python/unittest/test_gluon_rnn.py:test_layer_bidirectional

@pengzhao-intel
Copy link
Contributor

Could you try rtol=1e-04, atol=1e-02?

@perdasilva
Copy link
Contributor

@pengzhao-intel

[DEBUG] 10000 of 10000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=796240428 to reproduce.
ok

----------------------------------------------------------------------
Ran 1 test in 159.016s

OK

I'll close my skip_test PR and post my fix test PR =)

@perdasilva
Copy link
Contributor

@pengzhao-intel forgot to say thank you. Thank you! =D

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants