[MXNET-261]Update MKLDNN & Add CPP Test #10365

xinyu-intel · 2018-04-02T07:38:21Z

Description

This pr aims to fix bugs in #8712 by update MKLDNN to the newest. CPP tests are added to monitor data format change of MKL-DNN MXNET-98

@pengzhao-intel

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Update MKLDNN
Add cpp tests

zheng-da · 2018-04-02T07:53:34Z

Should we update MKLDNN to the latest commit in its master branch? or should we always attach it to a certain version tag? What rule should we follow?
@szha @cjolivier01 @piiswrong

marcoabreu · 2018-04-02T11:12:04Z

I'd vote for using a stable version instead of the latest master.

@xinyu-intel @pengzhao-intel what's the stability of the master branch?

zheng-da · 2018-04-02T17:57:14Z

are you sure this is a MKL problem? it fails in all configurations.

cjolivier01 · 2018-04-02T18:03:22Z

tests/cpp/operator/mkldnn.cc

+
+TEST(MKLDNN_UTIL_FUNC, MemFormat) {
+  // Check whether the number of format is correct.
+  CHECK_EQ(mkldnn_format_last, 56);


Is mkldnn_format_last an enum or constant? If so, you can use static_assert<> somewhere in the code and it doesn't have to be a unit test.

Yes, it's an enum. This request is from @marcoabreu. Please refer #9918 (comment) and jira#98 for the background of this. Thanks.

pengzhao-intel · 2018-04-03T00:34:40Z

@zheng-da will double check. The tests in python2/3 MKLDNN-CPU passed.

xinyu-intel · 2018-04-03T04:37:03Z

I think the failure of this ut test may be related to this old version of mklml.
https://github.com/apache/incubator-mxnet/blob/5245ef68191a6d47594bf331ec6e20ba6e93ad4c/ci/docker/install/ubuntu_mklml.sh#L24

zheng-da · 2018-04-03T05:08:26Z

@xinyu-intel I guess you mean mklml?

xinyu-intel · 2018-04-03T05:26:13Z

@zheng-da yes, I've made a mistake. It's mklml not mkldnn:)

xinyu-intel · 2018-04-03T07:13:48Z

I have tried the following four tests with seed(1):

First two passed:
exe1 = y1.simple_bind(mx.cpu(), x=shape)
exe2 = y2.simple_bind(mx.cpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

exe1 = y1.simple_bind(mx.cpu(), x=shape)
exe2 = y2.simple_bind(mx.gpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

Others failed:

exe1 = y1.simple_bind(mx.gpu(), x=shape)
exe2 = y2.simple_bind(mx.cpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

(mismatch 94.4444444444%)
x: array([[[[ 11.774242, 10.667873, 37.325356],
[ -45.697014, 59.5456 , -50.37157 ],
[ -39.387352, -65.5543 , -1.68909 ]]],...
y: array([[[[ 11.774241, 59.5456 , -1.689087],
[ 41.73616 , 95.499115, 16.014626],
[ 12.258306, 22.502499, 45.119247]]],...

exe1 = y1.simple_bind(mx.gpu(), x=shape)
exe2 = y2.simple_bind(mx.gpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel, b=(num_filter,))

(mismatch 94.4444444444%)
x: array([[[[ 11.77424 , 10.667873, 37.32536 ],
[ -45.697014, 59.5456 , -50.37157 ],
[ -39.387352, -65.5543 , -1.68909 ]]],...
y: array([[[[ 11.774241, 59.545593, -1.689092],
[ 41.736153, 95.49912 , 16.014624],
[ 12.258305, 22.5025 , 45.119247]]],...

It seems that this test cannot pass when using GPU to compute exe1.

pengzhao-intel · 2018-04-04T00:58:09Z

@marcoabreu @cjolivier01 @zheng-da I think the conclusion (based on @xinyu-intel 's analysis) is latest MKL-DNN fixed the problem but the GPU results of exe1 are not correct. Could anyone look into GPU side?

@marcoabreu Regarding MKL-DNN, there's release version we can use. But the MXNET development progress is very fast so more new features (or bugfix) are needed. Thus, I think it's OK to select the master branch (based on a commit id). Each CI in MKL-DNN is fully verified and tested.
https://github.com/intel/mkl-dnn/releases

BTW, as we see enabling all test cases would be a great practice to improve the quality.

marcoabreu · 2018-04-04T13:27:06Z

@marcoabreu Regarding MKL-DNN, there's release version we can use. But the MXNET development progress is very fast so more new features (or bugfix) are needed. Thus, I think it's OK to select the master branch (based on a commit id). Each CI in MKL-DNN is fully verified and tested.
https://github.com/intel/mkl-dnn/releases

I'm indifferent about this one, at least for CI. My only concern is when we make a release with MXNet, we're unable to expect our users to use a (potentially) unstable master commit - usually people prefer to use a stable release. This means that users could run into problems because we're validating against a version of MKLDNN which is not even out yet. We have to consider this fact and have to find a solution - e.g. Intel making more frequent releases of the library, back-porting these fixes or something else along those lines. In the end, we don't want to make a release of MXNet that requires a dependency which is not even out yet.

BTW, as we see enabling all test cases would be a great practice to improve the quality.

Definitely! I appreciate efforts in that direction by a lot!

Thanks a lot everybody for all your efforts!

zheng-da · 2018-04-04T17:25:27Z

@xinyu-intel @pengzhao-intel could you describe what is the root of this problem?
It's a little weird that both MKL-DNN and CuDNN have the same bug. Is the bug in the convolution operator? Does the native implementation of MXNet have the bug? Thanks.

xinyu-intel · 2018-04-09T02:49:46Z

@nihui Please help take a look at this pr. The gpu unit test 'depth_wise_conv' which skipped in #10098 can't pass now. Thank you:)

pengzhao-intel · 2018-04-11T23:15:54Z

ping @nihui

nihui · 2018-04-16T08:08:23Z

@xinyu-intel hello

I just played with the latest code f3c01d5

build with cuda 8.0.61, and without mkl

I uncomment the skip line and remove all testcase except the depth_wise_conv one
and the test passed whatever I change the binded device to mx.cpu() or mx.gpu()

[nihuini@TENCENT64 ~/incubator-mxnet]$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 CUDA_VISIBLE_DEVICES=4,5,6,7 nosetests --verbose --nocapture tests/python/unittest
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1054985103 to reproduce.
test_operator.test_depthwise_convolution ... ok

----------------------------------------------------------------------
Ran 1 test in 4.753s

OK

xinyu-intel · 2018-04-16T09:41:48Z

@nihui Thanks. A bit confused. I just tested on Tesla P100 and got the same error as before.

xinyu-intel · 2018-04-16T12:37:14Z

@nihui Can you please help double check base on the following code in incubator-mxnet/tests/python/unittest/test_operator.py:

                             dev = default_context()
-                            exe1 = y1.simple_bind(dev, x=shape)
-                            exe2 = y2.simple_bind(mx.cpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel,
+                            exe1 = y1.simple_bind(mx.gpu(), x=shape)
+                            exe2 = y2.simple_bind(mx.gpu(), x=shape, w=(num_filter, shape[1]//num_group)+kernel,
                                     b=(num_filter,))

And I got error as follow:

[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=896461014 to reproduce.
test_operator.test_depthwise_convolution ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1571766810 to reproduce.
FAIL

======================================================================
FAIL: test_operator.test_depthwise_convolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/nfs/pdx/home/zhaopen1/incubator-mxnet/tests/python/unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/nfs/pdx/home/zhaopen1/incubator-mxnet/tests/python/unittest/test_operator.py", line 1303, in test_depthwise_convolution
    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)
  File "/nfs/pdx/home/zhaopen1/.local/lib/python2.7/site-packages/numpy/testing/utils.py", line 1395, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/nfs/pdx/home/zhaopen1/.local/lib/python2.7/site-packages/numpy/testing/utils.py", line 778, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=0.001

(mismatch 94.4444444444%)
 x: array([[[[ -31.054831, -108.180809,  -26.939766],
         [  10.257776,   15.99695 ,   96.046448],
         [   4.541703,   -8.48899 ,   44.320747]]],...
 y: array([[[[ -31.054831,   15.996953,   44.32074 ],
         [  28.470549,   10.288336,   -7.459843],
         [  56.939667,   36.969101,    2.033797]]],...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=896461014 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1571766810 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 5.755s

FAILED (failures=1)

Thank you!

nihui · 2018-04-17T02:19:23Z

@xinyu-intel issue reproduced on another machine .. investigating ...

nihui · 2018-04-17T03:18:53Z

#10578
new pull request raised for fixing

xinyu-intel · 2018-04-17T04:27:05Z

Thanks, I will retrigger unit test of this pr as soon as #10578 been merged.

xinyu-intel · 2018-04-18T00:43:31Z

All commits have been merged with #10578 , close this pr.

xinyu-intel added 3 commits April 2, 2018 12:26

update mkldnn to fix bugs

74d5c9e

add mkldnn type

1934773

add mkldnn memory format cpp tests

d77979f

xinyu-intel requested a review from cjolivier01 as a code owner April 2, 2018 07:38

xinyu-intel changed the title ~~Update MKLDNN & Add CPP Test~~ [MXNET-261]Update MKLDNN & Add CPP Test Apr 2, 2018

test_depthwise_convolution

7bd76fb

cjolivier01 reviewed Apr 2, 2018

View reviewed changes

nihui mentioned this pull request Apr 17, 2018

[MXNET-326] fix filter layout in DepthwiseConv2dBackwardFilterKernel #10578

Merged

7 tasks

xinyu-intel closed this Apr 18, 2018

mseth10 mentioned this pull request Dec 11, 2018

flaky test: test_operator_gpu.test_depthwise_convolution #12203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-261]Update MKLDNN & Add CPP Test #10365

[MXNET-261]Update MKLDNN & Add CPP Test #10365

xinyu-intel commented Apr 2, 2018 •

edited

Loading

zheng-da commented Apr 2, 2018

marcoabreu commented Apr 2, 2018 •

edited

Loading

zheng-da commented Apr 2, 2018

cjolivier01 Apr 2, 2018

TaoLv Apr 3, 2018

pengzhao-intel commented Apr 3, 2018

xinyu-intel commented Apr 3, 2018 •

edited

Loading

zheng-da commented Apr 3, 2018

xinyu-intel commented Apr 3, 2018

xinyu-intel commented Apr 3, 2018

pengzhao-intel commented Apr 4, 2018

marcoabreu commented Apr 4, 2018

zheng-da commented Apr 4, 2018

xinyu-intel commented Apr 9, 2018

pengzhao-intel commented Apr 11, 2018

nihui commented Apr 16, 2018 •

edited

Loading

xinyu-intel commented Apr 16, 2018

xinyu-intel commented Apr 16, 2018 •

edited

Loading

nihui commented Apr 17, 2018

nihui commented Apr 17, 2018

xinyu-intel commented Apr 17, 2018

xinyu-intel commented Apr 18, 2018

[MXNET-261]Update MKLDNN & Add CPP Test #10365

[MXNET-261]Update MKLDNN & Add CPP Test #10365

Conversation

xinyu-intel commented Apr 2, 2018 • edited Loading

Description

Checklist

Essentials

Changes

zheng-da commented Apr 2, 2018

marcoabreu commented Apr 2, 2018 • edited Loading

zheng-da commented Apr 2, 2018

cjolivier01 Apr 2, 2018

Choose a reason for hiding this comment

TaoLv Apr 3, 2018

Choose a reason for hiding this comment

pengzhao-intel commented Apr 3, 2018

xinyu-intel commented Apr 3, 2018 • edited Loading

zheng-da commented Apr 3, 2018

xinyu-intel commented Apr 3, 2018

xinyu-intel commented Apr 3, 2018

pengzhao-intel commented Apr 4, 2018

marcoabreu commented Apr 4, 2018

zheng-da commented Apr 4, 2018

xinyu-intel commented Apr 9, 2018

pengzhao-intel commented Apr 11, 2018

nihui commented Apr 16, 2018 • edited Loading

xinyu-intel commented Apr 16, 2018

xinyu-intel commented Apr 16, 2018 • edited Loading

nihui commented Apr 17, 2018

nihui commented Apr 17, 2018

xinyu-intel commented Apr 17, 2018

xinyu-intel commented Apr 18, 2018

xinyu-intel commented Apr 2, 2018 •

edited

Loading

marcoabreu commented Apr 2, 2018 •

edited

Loading

xinyu-intel commented Apr 3, 2018 •

edited

Loading

nihui commented Apr 16, 2018 •

edited

Loading

xinyu-intel commented Apr 16, 2018 •

edited

Loading