flaky test_bf16_operator.test_bf16_bn #17669

haojin2 · 2020-02-24T05:14:13Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17619/6/pipeline

[2020-02-22T02:22:11.175Z] ======================================================================

[2020-02-22T02:22:11.175Z] FAIL: test_bf16_operator.test_bf16_bn

[2020-02-22T02:22:11.175Z] ----------------------------------------------------------------------

[2020-02-22T02:22:11.175Z] Traceback (most recent call last):

[2020-02-22T02:22:11.175Z]   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

[2020-02-22T02:22:11.175Z]     self.test(*self.arg)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 215, in test_new

[2020-02-22T02:22:11.175Z]     orig_test(*args, **kwargs)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/tests/python/mkl/test_bf16_operator.py", line 130, in test_bf16_bn

[2020-02-22T02:22:11.175Z]     check_operator_accuracy(sym_fp32=bn_fp32, sym_bf16=bn_bf16, data_shape=(32, 16, 64, 64), bf16_use_fp32_params=True, etol=1e-3)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/tests/python/mkl/test_bf16_operator.py", line 118, in check_operator_accuracy

[2020-02-22T02:22:11.175Z]     assert_almost_equal_with_err(output_bf16_2_fp32, output_fp32, rtol=rtol, atol=atol, etol=etol)

[2020-02-22T02:22:11.175Z]   File "/work/mxnet/python/mxnet/test_utils.py", line 697, in assert_almost_equal_with_err

[2020-02-22T02:22:11.175Z]     raise AssertionError(msg)

[2020-02-22T02:22:11.175Z] AssertionError: 

[2020-02-22T02:22:11.175Z] Items are not equal:

[2020-02-22T02:22:11.175Z] Error 5.486012 exceeds tolerance rtol=1.000000e-01, atol=5.000000e-01 (mismatch at least 0.000525%).

[2020-02-22T02:22:11.175Z] Location of maximum error: (8, 10, 57, 30), a=7.40625000, b=3.01126194

[2020-02-22T02:22:11.175Z]  ACTUAL: array([[[[ 6.5       ,  6.71875   ,  6.5625    , ...,  7.3125    ,

[2020-02-22T02:22:11.175Z]            6.53125   ,  7.3125    ],

[2020-02-22T02:22:11.175Z]          [ 7.3125    ,  6.71875   ,  6.84375   , ...,  6.53125   ,...

[2020-02-22T02:22:11.175Z]  DESIRED: array([[[[ 6.4971795 ,  6.7077727 ,  6.5601    , ...,  7.2980003 ,

[2020-02-22T02:22:11.175Z]            6.5235353 ,  7.306244  ],

[2020-02-22T02:22:11.175Z]          [ 7.3047543 ,  6.7267437 ,  6.8369255 , ...,  6.541499  ,...

[2020-02-22T02:22:11.175Z] -------------------- >> begin captured stdout << ---------------------

[2020-02-22T02:22:11.175Z] 

[2020-02-22T02:22:11.175Z] *** Maximum errors for vector of size 2097152:  rtol=0.1, atol=0.5

[2020-02-22T02:22:11.175Z] 

[2020-02-22T02:22:11.175Z]   1: Error 5.486012  Location of error: (8, 10, 57, 30), a=7.40625000, b=3.01126194

[2020-02-22T02:22:11.175Z]   2: Error 5.464664  Location of error: (29, 10, 37, 13), a=-1.37500000, b=2.99279308

[2020-02-22T02:22:11.175Z]   3: Error 5.440769  Location of error: (11, 10, 42, 27), a=-1.37500000, b=2.95090246

[2020-02-22T02:22:11.175Z]   4: Error 5.436756  Location of error: (3, 10, 58, 39), a=7.40625000, b=3.03682470

[2020-02-22T02:22:11.175Z]   5: Error 5.431772  Location of error: (22, 10, 47, 30), a=-1.37500000, b=2.93524361

[2020-02-22T02:22:11.175Z]   6: Error 5.425523  Location of error: (18, 10, 28, 4), a=-1.37500000, b=2.92440319

[2020-02-22T02:22:11.175Z]   7: Error 5.420064  Location of error: (23, 10, 48, 4), a=7.40625000, b=3.04552412

[2020-02-22T02:22:11.175Z]   8: Error 5.419101  Location of error: (6, 10, 14, 20), a=-1.37500000, b=2.91329479

[2020-02-22T02:22:11.175Z]   9: Error 5.410485  Location of error: (12, 10, 42, 59), a=-1.37500000, b=2.89843893

[2020-02-22T02:22:11.175Z]  10: Error 5.396444  Location of error: (23, 10, 33, 15), a=-1.37500000, b=2.87434864

[2020-02-22T02:22:11.175Z] 

[2020-02-22T02:22:11.175Z] --------------------- >> end captured stdout << ----------------------

[2020-02-22T02:22:11.175Z] -------------------- >> begin captured logging << --------------------

[2020-02-22T02:22:11.175Z] root: INFO: TVM op config has been loaded

[2020-02-22T02:22:11.175Z] common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=259105799 to reproduce.

[2020-02-22T02:22:11.175Z] --------------------- >> end captured logging << ---------------------

The text was updated successfully, but these errors were encountered:

TaoLv · 2020-02-24T07:05:49Z

@rongzha1 Could you please take a look at this issue?

rongzha1 · 2020-02-24T08:20:00Z

@ElaineBao @ZhennanQin Can you double check it pls? As we discussed, we can't avoid such cast error. What about skip this case ? Or use fixed seed ?

ElaineBao · 2020-02-24T08:50:37Z

Yes, this happens when the variance of bn meets some specific value which can not be computed as fp32 by bf16, as you can see, these error occurs in the same channel (channel = 10 in above case), because they share the same variance. This case can be fixed either by enlarging the error tolerance or using fixed seed.

pengzhao-intel · 2020-02-24T09:55:41Z

Could we increase the tolerance for this case to align BF16 precision?

rongzha1 · 2020-02-24T14:27:01Z

increase the tolerance for this case in #17673

haojin2 added Flaky CI MKLDNN labels Feb 24, 2020

rongzha1 mentioned this issue Feb 24, 2020

change error tolerance for bf16 bn test #17673

Merged

7 tasks

pengzhao-intel closed this as completed in #17673 Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test_bf16_operator.test_bf16_bn #17669

flaky test_bf16_operator.test_bf16_bn #17669

haojin2 commented Feb 24, 2020 •

edited

Loading

TaoLv commented Feb 24, 2020

rongzha1 commented Feb 24, 2020 •

edited

Loading

ElaineBao commented Feb 24, 2020

pengzhao-intel commented Feb 24, 2020

rongzha1 commented Feb 24, 2020

flaky test_bf16_operator.test_bf16_bn #17669

flaky test_bf16_operator.test_bf16_bn #17669

Comments

haojin2 commented Feb 24, 2020 • edited Loading

TaoLv commented Feb 24, 2020

rongzha1 commented Feb 24, 2020 • edited Loading

ElaineBao commented Feb 24, 2020

pengzhao-intel commented Feb 24, 2020

rongzha1 commented Feb 24, 2020

haojin2 commented Feb 24, 2020 •

edited

Loading

rongzha1 commented Feb 24, 2020 •

edited

Loading