Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

Closed
anirudhacharya opened this issue Mar 13, 2018 · 10 comments
Closed

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

anirudhacharya opened this issue Mar 13, 2018 · 10 comments

Comments

@anirudhacharya
Copy link
Member

anirudhacharya commented Mar 13, 2018

Description

Flaky test on ubuntu_gpu for the test - test_operator_gpu.test_batchnorm_with_type.

It is a precision error.

Environment info (Required)

Package used (Python/R/Scala/Julia):
Python

MXNet commit hash:
8bf1ff1

Link to the CI run log:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-9963/runs/47/nodes/486/log/?start=0

Error Message:

FAIL: test_operator_gpu.test_batchnorm_with_type

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 320, in test_batchnorm_with_type
check_consistency(sym, ctx_list_v2_2D)
File "/work/mxnet/python/mxnet/test_utils.py", line 1346, in check_consistency
raise e
File "/work/mxnet/python/mxnet/test_utils.py", line 1341, in check_consistency
equal_nan=equal_nan)
File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1.588932 exceeds tolerance rtol=0.100000, atol=0.100000. Location of maximum error:(1,), a=-0.082301, b=0.091003
a: array([-1308.3785 , -0.08230112], dtype=float32)
b: array([-1310. , 0.091], dtype=float16)

Steps to reproduce

  1. Not able to reproduce locally
@cjolivier01
Copy link
Member

Did this every used to fail before the MKL changes?

@anirudhacharya
Copy link
Member Author

@cjolivier01 This failure was yesterday. I am not sure which MKL changes you are referring to. If you mean this - #9862 then yes, it was after the MKL changes.

@cjolivier01
Copy link
Member

For @marcoabreu do you know if this test used to fail intermittently before the MKL changes?

@zheng-da
Copy link
Contributor

I have observed the problem before. #9916
It's some precision problem in batchnorm.

@reminisce
Copy link
Contributor

We are going to make the unit test stable. See here for comments and action items.
#9916 (comment)

@cjolivier01
Copy link
Member

Great! Thanks!

@aaronmarkham
Copy link
Contributor

aaronmarkham commented Jun 7, 2018

Another set of failures here (2nd time in two days for two different PRs):
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11180/1/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11157/2/pipeline/

Coming up on three months for making some improvement, so can we disable this test in the meantime?

@larroy
Copy link
Contributor

larroy commented Jun 10, 2018

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 13, 2018

Flakyness has not been solved: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1187/pipeline

======================================================================

FAIL: test_operator_gpu.test_batchnorm_with_type

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\common.py", line 175, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\test_operator_gpu.py", line 339, in test_batchnorm_with_type

    check_consistency(sym, ctx_list_v2_3D)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu_mkldnn\python\mxnet\test_utils.py", line 1354, in check_consistency

    raise e

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu_mkldnn\python\mxnet\test_utils.py", line 1349, in check_consistency

    equal_nan=equal_nan)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu_mkldnn\python\mxnet\test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.114559 exceeds tolerance rtol=0.100000, atol=0.100000.  Location of maximum error:(1,), a=-0.217708, b=-0.370361

 a: array([-96.87876892,  -0.21770805], dtype=float32)

 b: array([-97.        ,  -0.37036133], dtype=float16)

-------------------- >> begin captured stdout << ---------------------

Train Err: ctx 0 vs ctx 2 at norm_gamma



--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=84210791 to reproduce.

--------------------- >> end captured logging << ---------------------

@sandeep-krishnamurthy
Copy link
Contributor

Resolving via fix from @hcho3 in PR - #11873

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants