Fall back when sparse arrays are passed to MKLDNN-enabled operators #11664

luobao-intel · 2018-07-12T08:37:58Z

Description

Currently, the MKLDNN-enabled operators, such as convolution and pooling, can't handle sparse arrays correctly. The reason is that the storage inference of these operators doesn't return the right dispatch mode.#11448

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

@pengzhao-intel @zheng-da @xinyu-intel

marcoabreu · 2018-07-12T09:02:02Z

tests/python/mkl/test_mkldnn.py

@@ -240,5 +240,135 @@ def check_batchnorm_training(stype):
    for stype in stypes:
        check_batchnorm_training(stype)

+@with_seed()


I don't think that these tests are mkldnn specific and should rather be put in test_operators or test_operators_sparse
@eric-haibin-lin @haojin2

As we discussed earlier, we should still have such test in place for testing the fallback logic with USE_MKLDNN=1. Depends on the logic in InferStorage, we may need an extra test in test_sparse_operator/test_operator

Right, but these tests can then be placed in the regular operator tests. The fallback will be implicitly tested. I don't see any mkldnn specific code that would justify it from not being run with MXNet or cuda as backend.

I had the same discussion in another PR from shufan (on mobile ATM). Would you mind joining in over there?

I definitely would like to, would you mind providing a pointer to that PR?

marcoabreu · 2018-07-12T09:03:21Z

Hello @luobao-intel, welcome to MXNet and thanks for your contribution!

zheng-da · 2018-07-12T21:12:29Z

Why not call MKLDNNStorageType for most of the operators?
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_base.cc#L521

This is the final rectify for fallback problem(functions call)

Conflicts: tests/python/mkl/test_mkldnn.py

luobao-intel · 2018-07-18T14:46:40Z

The final commit deleted the wrong test case added by me. And this commit follows the unifying code structure of batchnorm. @zheng-da

zheng-da · 2018-07-25T23:47:06Z

src/operator/nn/activation.cc

-  }
+  if (dev_mask == mshadow::cpu::kDevMask && SupportMKLDNNAct(param))
+    return ElemwiseStorageType<1, 1, false, false, false>(
+        attrs, dev_mask, dispatch_mode, in_attrs, out_attrs);


If it's CPU and support MKLDNN activation, is dispatch_mode kFComputeEx?

luobao-intel · 2018-07-27T07:42:43Z

Thanks for reviewing again. @zheng-da

zheng-da · 2018-07-27T17:34:04Z

src/operator/nn/activation.cc

+    if (param.act_type != activation::kReLU) {
+      CHECK_EQ(in_attrs->size(), 3U);
+      ret = ElemwiseStorageType<3, 1, false, false, false>(
+          attrs, dev_mask, dispatch_mode, in_attrs, out_attrs);


I still have the same question here. dispatch_mode can be kFComputeEx? ElemwiseStorageType only uses kFComputeEx for sparse storage type.

zheng-da · 2018-07-27T17:35:34Z

Please benchmark the performance with this modification to make sure there isn't performance regression.

ZhennanQin · 2018-07-30T08:48:46Z

@marcoabreu , if cudnn and mkldnn are both used, the behavior is unchanged. This PR only covers scenarios using mkldnn only.

Act

ZhennanQin · 2018-08-01T03:00:09Z

@marcoabreu I did some investigation and found that activation doesn't support FComputeEx<GPU>, so FInferStorageType should be used for MKLDNN only, just like other ops. I refactored the code, please review. Thanks.

pengzhao-intel · 2018-08-13T01:15:23Z

@luobao-intel please help add the performance data.

@zheng-da @marcoabreu please take a review again and merge in case no further comments.

luobao-intel · 2018-08-13T02:15:56Z

The training and inference performance of regression verifying for fallback merge is shown as follows. This data is collected on commit 3ea67a7.
The training comparison:

Topo	batchsize	6148（fallback）	6148（origin)
alexnet	1	30.594143	31.317253
	2	41.044092	41.132786
	4	80.769897	81.800255
	8	133.454369	127.187418
	16	224.20456	221.552398
	32	302.540959	305.953074
	64	383.023917	330.386532
	128	447.751747	435.501388
	256	477.884483	474.643253
vgg-16	1	9.848415	9.811869
	2	12.403094	12.305232
	4	17.652585	17.385885
	8	22.634356	20.478328
	16	27.60285	27.545312
	32	31.050534	30.916987
	64	33.113762	32.895624
	128	34.623087	34.604778
	256	35.412343	35.241626
vgg-19	1	8.483369	8.561665
	2	10.670349	10.639146
	4	15.133063	15.205443
	8	19.301709	19.264988
	16	22.8069	23.004417
	32	25.658372	25.787008
	64	27.210528	27.148756
	128	28.396534	28.622843
	256	28.990824	28.944935
inception-bn	1	5.173929	5.113685
	2	9.713608	9.612722
	4	18.292555	17.625253
	8	32.262617	32.123224
	16	51.378436	50.281799
	32	71.756796	77.19132
	64	99.396415	101.789003
	128	122.943234	125.025602
	256	140.761579	140.559997
inception-v3	1	3.647243	3.701843
	2	6.893006	7.003765
	4	12.292691	12.322165
	8	20.018688	20.248172
	16	29.733195	29.502392
	32	39.961187	39.015816
	64	46.334718	47.595459
	128	53.897898	54.346451
	256	58.247023	58.307715
resnet-50	1	5.828596	5.758312
	2	9.999834	9.925494
	4	17.197613	16.971942
	8	27.336548	26.727552
	16	39.047838	37.635901
	32	51.535267	50.304822
	64	60.800547	59.694137
	128	68.016266	66.886088
	256	73.783432	72.118631
resnet-152	1	2.073641	2.020319
	2	3.745274	3.577454
	4	6.488948	6.380499
	8	10.482311	10.37387
	16	15.354603	14.47856
	32	19.90136	19.666869
	64	25.205104	24.286143
	128	28.272118	28.342906
	256	30.751341	29.938136

The Inference comparison:

Topo	batchsize	6148（fallback）	6148（origin)
alexnet	1	230.609201	272.974964
	2	459.347717	418.977067
	4	562.738327	500.14998
	8	757.427326	683.123778
	16	995.188787	1010.835509
	32	1236.91575	1220.31215
	64	1386.149118	1310.73248
	128	1492.969058	1059.182116
	256	1516.787892	1502.325897
vgg-16	1	64.936414	62.079648
	2	84.860327	82.680198
	4	88.850584	89.668909
	8	98.573954	98.142158
	16	114.243305	114.564623
	32	117.061494	116.955362
	64	116.544261	119.016377
	128	117.818935	119.549992
	256	120.201841	119.661409
vgg-19	1	56.805177	54.132415
	2	71.964449	70.584329
	4	77.130441	75.984336
	8	84.741283	83.674499
	16	94.627382	94.114408
	32	96.506882	95.726291
	64	95.06602	96.789072
	128	97.67915	96.973608
	256	97.465934	96.771837
inception-bn	1	103.438399	103.472951
	2	166.834318	161.873053
	4	255.293207	250.687353
	8	340.244836	333.390418
	16	410.579713	406.384
	32	452.503148	447.816195
	64	490.925741	467.791765
	128	493.556792	497.755161
	256	504.554601	483.210319
inception-v3	1	51.675768	51.394543
	2	84.020412	83.020656
	4	117.211207	118.202891
	8	145.685021	144.755562
	16	164.314623	161.190974
	32	181.601478	177.47979
	64	187.68821	183.350203
	128	187.519656	186.671198
	256	190.363415	185.131233
resnet-50	1	74.352261	71.926143
	2	105.118596	100.971862
	4	136.682507	129.251114
	8	160.903148	150.033564
	16	186.371277	184.590927
	32	210.353603	197.494244
	64	220.03369	205.50499
	128	222.752326	209.56644
	256	220.268782	202.993866
resnet-152	1	31.555538	30.415254
	2	44.313238	41.896732
	4	56.402015	53.602173
	8	66.893727	61.240171
	16	79.20712	73.083458
	32	83.782056	82.280132
	64	93.311579	85.641401
	128	93.520913	87.348002
	256	89.06415	83.397316

zheng-da · 2018-08-14T00:00:52Z

thanks for providing the performance results.

pengzhao-intel · 2018-08-14T00:35:19Z

@marcoabreu any other comments? Could you help to merge?
It's a usability improvement and I think it's better to go 1.3.

zheng-da · 2018-08-24T03:42:54Z

src/operator/nn/mkldnn/mkldnn_base.cc

-    NDArray data = inputs_[i];
-    inputs.emplace_back(data.shape(), ctx, false, data.dtype());
-    if (data.IsMKLDNNData() && data.IsView())
-        data = data.Reorder2Default();


why do you remove the code here?
I think the original code is correct.

zheng-da · 2018-08-24T03:43:57Z

src/operator/nn/mkldnn/mkldnn_base.cc

-  if (is_excluded.get(attrs.op, false)) {
-    LOG(WARNING) << attrs.op->name << " not checked. TExcludeMKLDNNDebug flag present";
-    return;
-  }


Why is this removed?

sorry, I didn't mean to remove the code. It was caused by the wrong merge. I'll handle it.

zheng-da · 2018-08-24T03:47:19Z

@azai91 could you please review the code?

…pache#11664) * softmax_fallbach * Fallback Amend This is the final rectify for fallback problem(functions call) * Lint amend * test_try * Patch for test fail * Pooling amend * Delete non_rectified_operation_test * fallback_normal * Fixed_dispatch * activation-amend * activation second * activation backward * activate_try * activation_debug * Act change. * test_random * mkldnn choice * format_modify * rebase

softmax_fallbach

2e68c96

luobao-intel requested a review from anirudh2290 as a code owner July 12, 2018 08:37

marcoabreu reviewed Jul 12, 2018

View reviewed changes

luobao-intel force-pushed the fallback branch from fd1266a to b40c3b1 Compare July 16, 2018 02:08

Fallback Amend

0b4cb5b

This is the final rectify for fallback problem(functions call)

luobao-intel force-pushed the fallback branch from b40c3b1 to 0b4cb5b Compare July 16, 2018 08:14

luobao-intel added 5 commits July 16, 2018 16:46

Merge remote-tracking branch 'origin/master' into fallback

eb04ddf

Conflicts: tests/python/mkl/test_mkldnn.py

Lint amend

1288feb

test_try

9d550a1

Patch for test fail

4a4024c

Pooling amend

78fc94c

luobao-intel added 2 commits July 24, 2018 13:30

Delete non_rectified_operation_test

947b404

fallback_normal

9e337f5

luobao-intel force-pushed the fallback branch from 56dca95 to 9e337f5 Compare July 24, 2018 07:56

luobao-intel added 2 commits July 24, 2018 16:27

Fixed_dispatch

eecc175

activation conflict

a7a47f8

zheng-da reviewed Jul 26, 2018

View reviewed changes

luobao-intel added 2 commits July 26, 2018 10:52

activation-amend

fceab5e

activation second

3e22e7f

zheng-da reviewed Jul 27, 2018

View reviewed changes

luobao-intel added 3 commits July 28, 2018 14:46

activation backward

abc24de

activate_try

18e73c1

activation_debug

4cd4ae8

Act change.

8455702

ZhennanQin and others added 2 commits July 31, 2018 13:22

Merge branch 'fallback' into act

4379711

Merge pull request #1 from ZhennanQin/act

b9729a4

Act

sandeep-krishnamurthy added Sparse MKLDNN Backend Issues related to the backend of MXNet pr-awaiting-review PR is waiting for code review labels Aug 8, 2018

zheng-da approved these changes Aug 14, 2018

View reviewed changes

luobao-intel added 4 commits August 23, 2018 13:13

conflict modified

982cc5c

test_random

a0fb033

mkldnn choice

196f2ad

format_modify

6fb5d46

zheng-da reviewed Aug 24, 2018

View reviewed changes

rebase

29e39f8

piiswrong merged commit 15e43c0 into apache:master Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall back when sparse arrays are passed to MKLDNN-enabled operators #11664

Fall back when sparse arrays are passed to MKLDNN-enabled operators #11664

luobao-intel commented Jul 12, 2018

marcoabreu Jul 12, 2018 •

edited

Loading

haojin2 Jul 12, 2018

marcoabreu Jul 12, 2018

haojin2 Jul 12, 2018

marcoabreu Jul 12, 2018

haojin2 Jul 12, 2018

marcoabreu commented Jul 12, 2018

zheng-da commented Jul 12, 2018 •

edited

Loading

luobao-intel commented Jul 18, 2018

zheng-da Jul 25, 2018

luobao-intel commented Jul 27, 2018

zheng-da Jul 27, 2018

zheng-da commented Jul 27, 2018

ZhennanQin commented Jul 30, 2018

ZhennanQin commented Aug 1, 2018 •

edited

Loading

pengzhao-intel commented Aug 13, 2018

luobao-intel commented Aug 13, 2018

zheng-da commented Aug 14, 2018

pengzhao-intel commented Aug 14, 2018

zheng-da Aug 24, 2018

zheng-da Aug 24, 2018

luobao-intel Aug 24, 2018

zheng-da commented Aug 24, 2018

Fall back when sparse arrays are passed to MKLDNN-enabled operators #11664

Fall back when sparse arrays are passed to MKLDNN-enabled operators #11664

Conversation

luobao-intel commented Jul 12, 2018

Description

Checklist

Essentials

Changes

Comments

marcoabreu Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Jul 12, 2018

zheng-da commented Jul 12, 2018 • edited Loading

luobao-intel commented Jul 18, 2018

Choose a reason for hiding this comment

luobao-intel commented Jul 27, 2018

Choose a reason for hiding this comment

zheng-da commented Jul 27, 2018

ZhennanQin commented Jul 30, 2018

ZhennanQin commented Aug 1, 2018 • edited Loading

pengzhao-intel commented Aug 13, 2018

luobao-intel commented Aug 13, 2018

zheng-da commented Aug 14, 2018

pengzhao-intel commented Aug 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Aug 24, 2018

marcoabreu Jul 12, 2018 •

edited

Loading

zheng-da commented Jul 12, 2018 •

edited

Loading

ZhennanQin commented Aug 1, 2018 •

edited

Loading