Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

anirudhacharya · 2018-03-13T17:30:44Z

Description

Flaky test on ubuntu_gpu for the test - test_operator_gpu.test_batchnorm_with_type.

It is a precision error.

Environment info (Required)

Package used (Python/R/Scala/Julia):
Python

MXNet commit hash:
8bf1ff1

Link to the CI run log:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-9963/runs/47/nodes/486/log/?start=0

Error Message:

FAIL: test_operator_gpu.test_batchnorm_with_type

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
orig_test(*args, **kwargs)
File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 320, in test_batchnorm_with_type
check_consistency(sym, ctx_list_v2_2D)
File "/work/mxnet/python/mxnet/test_utils.py", line 1346, in check_consistency
raise e
File "/work/mxnet/python/mxnet/test_utils.py", line 1341, in check_consistency
equal_nan=equal_nan)
File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal
raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1.588932 exceeds tolerance rtol=0.100000, atol=0.100000. Location of maximum error:(1,), a=-0.082301, b=0.091003
a: array([-1308.3785 , -0.08230112], dtype=float32)
b: array([-1310. , 0.091], dtype=float16)

Steps to reproduce

Not able to reproduce locally

The text was updated successfully, but these errors were encountered:

cjolivier01 · 2018-03-13T17:54:50Z

Did this every used to fail before the MKL changes?

anirudhacharya · 2018-03-13T18:01:45Z

@cjolivier01 This failure was yesterday. I am not sure which MKL changes you are referring to. If you mean this - #9862 then yes, it was after the MKL changes.

cjolivier01 · 2018-03-13T18:04:19Z

For @marcoabreu do you know if this test used to fail intermittently before the MKL changes?

zheng-da · 2018-03-13T18:04:43Z

I have observed the problem before. #9916
It's some precision problem in batchnorm.

reminisce · 2018-03-13T18:07:35Z

We are going to make the unit test stable. See here for comments and action items.
#9916 (comment)

cjolivier01 · 2018-03-13T18:09:19Z

Great! Thanks!

aaronmarkham · 2018-06-07T16:37:21Z

Another set of failures here (2nd time in two days for two different PRs):
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11180/1/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11157/2/pipeline/

Coming up on three months for making some improvement, so can we disable this test in the meantime?

larroy · 2018-06-10T14:47:34Z

Flaky in windows:

WinPython 3: MKLDNN-GPU Win
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11186/9/pipeline

marcoabreu · 2018-07-13T14:58:41Z

Flakyness has not been solved: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1187/pipeline

======================================================================

FAIL: test_operator_gpu.test_batchnorm_with_type

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\common.py", line 175, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\test_operator_gpu.py", line 339, in test_batchnorm_with_type

    check_consistency(sym, ctx_list_v2_3D)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu_mkldnn\python\mxnet\test_utils.py", line 1354, in check_consistency

    raise e

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu_mkldnn\python\mxnet\test_utils.py", line 1349, in check_consistency

    equal_nan=equal_nan)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu_mkldnn\python\mxnet\test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.114559 exceeds tolerance rtol=0.100000, atol=0.100000.  Location of maximum error:(1,), a=-0.217708, b=-0.370361

 a: array([-96.87876892,  -0.21770805], dtype=float32)

 b: array([-97.        ,  -0.37036133], dtype=float16)

-------------------- >> begin captured stdout << ---------------------

Train Err: ctx 0 vs ctx 2 at norm_gamma



--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=84210791 to reproduce.

--------------------- >> end captured logging << ---------------------

sandeep-krishnamurthy · 2018-09-05T19:33:40Z

Resolving via fix from @hcho3 in PR - #11873

anirudhacharya mentioned this issue Mar 13, 2018

Flaky Tests Tracking Issue #9412

Closed

szha added Test Flaky labels Mar 13, 2018

marcoabreu mentioned this issue Jun 7, 2018

Flaky test test_operator_gpu.test_batchnorm_with_type #10942

Closed

szha self-assigned this Jun 23, 2018

hcho3 mentioned this issue Jun 25, 2018

[MXNET-582] Fix flaky test test_operator_gpu.test_batchnorm_with_type #11396

Merged

2 tasks

szha closed this as completed Jul 2, 2018

marcoabreu reopened this Jul 13, 2018

marcoabreu added the Disabled test label Jul 13, 2018

marcoabreu mentioned this issue Jul 13, 2018

Disable batchnorm_with_type #11746

Merged

hcho3 mentioned this issue Jul 24, 2018

[MXNET-582] Fix flaky test test_operator_gpu.test_batchnorm_with_type (follow-up) #11873

Merged

2 tasks

sandeep-krishnamurthy closed this as completed Sep 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

anirudhacharya commented Mar 13, 2018 •

edited

Loading

FAIL: test_operator_gpu.test_batchnorm_with_type

cjolivier01 commented Mar 13, 2018

anirudhacharya commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018

zheng-da commented Mar 13, 2018

reminisce commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018

aaronmarkham commented Jun 7, 2018 •

edited

Loading

larroy commented Jun 10, 2018

marcoabreu commented Jul 13, 2018 •

edited

Loading

sandeep-krishnamurthy commented Sep 5, 2018

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

Flaky test on Ubuntu: test_operator_gpu.test_batchnorm_with_type #10087

Comments

anirudhacharya commented Mar 13, 2018 • edited Loading

Description

Environment info (Required)

Error Message:

FAIL: test_operator_gpu.test_batchnorm_with_type

Steps to reproduce

cjolivier01 commented Mar 13, 2018

anirudhacharya commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018

zheng-da commented Mar 13, 2018

reminisce commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018

aaronmarkham commented Jun 7, 2018 • edited Loading

larroy commented Jun 10, 2018

marcoabreu commented Jul 13, 2018 • edited Loading

sandeep-krishnamurthy commented Sep 5, 2018

anirudhacharya commented Mar 13, 2018 •

edited

Loading

aaronmarkham commented Jun 7, 2018 •

edited

Loading

marcoabreu commented Jul 13, 2018 •

edited

Loading