Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_operator.test_op_roi_align #11064

Closed
eric-haibin-lin opened this issue May 25, 2018 · 9 comments · Fixed by #13609
Closed

Flaky test: test_operator.test_op_roi_align #11064

eric-haibin-lin opened this issue May 25, 2018 · 9 comments · Fixed by #13609

Comments

@eric-haibin-lin
Copy link
Member

======================================================================

FAIL: test_operator.test_op_roi_align

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/unittest/test_operator.py", line 6170, in test_op_roi_align

    test_roi_align_value()

  File "/work/mxnet/tests/python/unittest/test_operator.py", line 6149, in test_roi_align_value

    assert np.allclose(data.grad.asnumpy(), dx, atol = 1e-6), np.abs(data.grad.asnumpy() - dx).max()

AssertionError: 1.3150275e-06

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1619190489 to reproduce.

--------------------- >> end captured logging << ---------------------

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11058/1/pipeline

@zhreshold
Copy link
Member

Relax with rtol should be fine, the diff is acceptable

@haojin2
Copy link
Contributor

haojin2 commented May 25, 2018

@zhreshold increased the rtol to 1e-5 and passed 500 consecutive test runs, the change is included in #11058

@marcoabreu
Copy link
Contributor

Hi @anirudhacharya, please note that the problem is not this test but many tests are failing if you scroll up. This is documented in #11395

@ThomasDelteil
Copy link
Contributor

Does it still happen? This seems different than what @anirudhacharya reported as in I cannot see more failure above that test: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12542/3/pipeline

@jlcontreras
Copy link
Contributor

Seems to still happen:

======================================================================
FAIL: test_operator_gpu.test_op_roi_align
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 6994, in test_op_roi_align
    test_roi_align_value()
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 6970, in test_roi_align_value
    assert np.allclose(output.asnumpy(), real_output)
    AssertionError: 
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=35650200 to reproduce.
--------------------- >> end captured logging << ---------------------

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/45/pipeline

@zhreshold
Copy link
Member

Maybe we should take another look, however, the log won't show the detailed error anymore?

@wkcn
Copy link
Member

wkcn commented Dec 10, 2018

Maybe there is some difference between C++ implementation and Python implementation. I will check it.

There is a float precision problem.
I may have fixed it.

@wkcn
Copy link
Member

wkcn commented Dec 11, 2018

In this unittest, the real_output is computed in float64, so there is some float precision problem.

When MXNET_TEST_SEED=35650200, the error shows that:

Error 4.049840 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(6, 0, 0, 1), a=0.005887, b=0.005887
 a: array([[[[  172.60527039,   173.5171814 ,   174.42907715,   175.34100342],
         [  195.92706299,   196.83895874,   197.75085449,   198.6627655 ],
         [  219.24882507,   220.16073608,   221.07263184,   221.98455811]],...
 b: array([[[[  172.60527039,   173.5171814 ,   174.42907715,   175.34100342],
         [  195.92706299,   196.83895874,   197.75085449,   198.6627655 ],
         [  219.24884033,   220.16075134,   221.07266235,   221.98455811]],...

I think atol should be not 0.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants