Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

flaky test_bf16_operator.test_bf16_bn #17669

Closed
haojin2 opened this issue Feb 24, 2020 · 5 comments · Fixed by #17673
Closed

flaky test_bf16_operator.test_bf16_bn #17669

haojin2 opened this issue Feb 24, 2020 · 5 comments · Fixed by #17673

Comments

@haojin2
Copy link
Contributor

haojin2 commented Feb 24, 2020

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17619/6/pipeline

[2020-02-22T02:22:11.175Z] ======================================================================

[2020-02-22T02:22:11.175Z] FAIL: test_bf16_operator.test_bf16_bn

[2020-02-22T02:22:11.175Z] ----------------------------------------------------------------------

[2020-02-22T02:22:11.175Z] Traceback (most recent call last):

[2020-02-22T02:22:11.175Z]   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

[2020-02-22T02:22:11.175Z]     self.test(*self.arg)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 215, in test_new

[2020-02-22T02:22:11.175Z]     orig_test(*args, **kwargs)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/tests/python/mkl/test_bf16_operator.py", line 130, in test_bf16_bn

[2020-02-22T02:22:11.175Z]     check_operator_accuracy(sym_fp32=bn_fp32, sym_bf16=bn_bf16, data_shape=(32, 16, 64, 64), bf16_use_fp32_params=True, etol=1e-3)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/tests/python/mkl/test_bf16_operator.py", line 118, in check_operator_accuracy

[2020-02-22T02:22:11.175Z]     assert_almost_equal_with_err(output_bf16_2_fp32, output_fp32, rtol=rtol, atol=atol, etol=etol)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 697, in assert_almost_equal_with_err

[2020-02-22T02:22:11.175Z]     raise AssertionError(msg)

[2020-02-22T02:22:11.175Z] AssertionError: 

[2020-02-22T02:22:11.175Z] Items are not equal:

[2020-02-22T02:22:11.175Z] Error 5.486012 exceeds tolerance rtol=1.000000e-01, atol=5.000000e-01 (mismatch at least 0.000525%).

[2020-02-22T02:22:11.175Z] Location of maximum error: (8, 10, 57, 30), a=7.40625000, b=3.01126194

[2020-02-22T02:22:11.175Z]  ACTUAL: array([[[[ 6.5       ,  6.71875   ,  6.5625    , ...,  7.3125    ,

[2020-02-22T02:22:11.175Z]            6.53125   ,  7.3125    ],

[2020-02-22T02:22:11.175Z]          [ 7.3125    ,  6.71875   ,  6.84375   , ...,  6.53125   ,...

[2020-02-22T02:22:11.175Z]  DESIRED: array([[[[ 6.4971795 ,  6.7077727 ,  6.5601    , ...,  7.2980003 ,

[2020-02-22T02:22:11.175Z]            6.5235353 ,  7.306244  ],

[2020-02-22T02:22:11.175Z]          [ 7.3047543 ,  6.7267437 ,  6.8369255 , ...,  6.541499  ,...

[2020-02-22T02:22:11.175Z] -------------------- >> begin captured stdout << ---------------------

[2020-02-22T02:22:11.175Z] 

[2020-02-22T02:22:11.175Z] *** Maximum errors for vector of size 2097152:  rtol=0.1, atol=0.5

[2020-02-22T02:22:11.175Z] 

[2020-02-22T02:22:11.175Z]   1: Error 5.486012  Location of error: (8, 10, 57, 30), a=7.40625000, b=3.01126194

[2020-02-22T02:22:11.175Z]   2: Error 5.464664  Location of error: (29, 10, 37, 13), a=-1.37500000, b=2.99279308

[2020-02-22T02:22:11.175Z]   3: Error 5.440769  Location of error: (11, 10, 42, 27), a=-1.37500000, b=2.95090246

[2020-02-22T02:22:11.175Z]   4: Error 5.436756  Location of error: (3, 10, 58, 39), a=7.40625000, b=3.03682470

[2020-02-22T02:22:11.175Z]   5: Error 5.431772  Location of error: (22, 10, 47, 30), a=-1.37500000, b=2.93524361

[2020-02-22T02:22:11.175Z]   6: Error 5.425523  Location of error: (18, 10, 28, 4), a=-1.37500000, b=2.92440319

[2020-02-22T02:22:11.175Z]   7: Error 5.420064  Location of error: (23, 10, 48, 4), a=7.40625000, b=3.04552412

[2020-02-22T02:22:11.175Z]   8: Error 5.419101  Location of error: (6, 10, 14, 20), a=-1.37500000, b=2.91329479

[2020-02-22T02:22:11.175Z]   9: Error 5.410485  Location of error: (12, 10, 42, 59), a=-1.37500000, b=2.89843893

[2020-02-22T02:22:11.175Z]  10: Error 5.396444  Location of error: (23, 10, 33, 15), a=-1.37500000, b=2.87434864

[2020-02-22T02:22:11.175Z] 

[2020-02-22T02:22:11.175Z] --------------------- >> end captured stdout << ----------------------

[2020-02-22T02:22:11.175Z] -------------------- >> begin captured logging << --------------------

[2020-02-22T02:22:11.175Z] root: INFO: TVM op config has been loaded

[2020-02-22T02:22:11.175Z] common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=259105799 to reproduce.

[2020-02-22T02:22:11.175Z] --------------------- >> end captured logging << ---------------------
@TaoLv
Copy link
Member

TaoLv commented Feb 24, 2020

@rongzha1 Could you please take a look at this issue?

@rongzha1
Copy link
Contributor

rongzha1 commented Feb 24, 2020

@ElaineBao @ZhennanQin Can you double check it pls? As we discussed, we can't avoid such cast error. What about skip this case ? Or use fixed seed ?

@ElaineBao
Copy link
Contributor

Yes, this happens when the variance of bn meets some specific value which can not be computed as fp32 by bf16, as you can see, these error occurs in the same channel (channel = 10 in above case), because they share the same variance. This case can be fixed either by enlarging the error tolerance or using fixed seed.

@pengzhao-intel
Copy link
Contributor

Could we increase the tolerance for this case to align BF16 precision?

@rongzha1
Copy link
Contributor

increase the tolerance for this case in #17673

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants