Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKLDNN Backward op cache #11301

Merged
merged 27 commits into from
Sep 13, 2018
Merged

Conversation

ZhennanQin
Copy link
Contributor

Hi all,

For MKLDNN, creating operator primitive and corresponding memory will spend a lot of time, for many scenarios, the created primitive can be reused if it meets requirement for next computing. In this PR, we implemented the caching mechanism for most backward operators, just like the optimization have done for forward operator. Correctness test and accuracy test are all PASS. Performance test shown that most models can get speed up in training. Please review this, Thanks.

@zheng-da, @marcoabreu, @azai91, @TaoLv, @pengzhao-intel

This PR covers below commits.
Enable primitive allocation cache for _backward_LRN
Enable primitive allocation cache for _backward_Pooling.
Enable primitive allocation cache for _backward_Activation.
Enable primitive allocation cache for _backward_Deconvolution.
Enable primitive allocation cache for _backward_BatchNorm.
Enable primitive allocation cache for _backward_Fully_Connected.
Enable primitive allocation cache for _backward_Convolution.

Correntness test result:

unit_test - PASS

Train model with dummy data when MXNET_MKLDNN_DEBUG=1

model Status
alexnet PASS
googlenet PASS
vgg-16 PASS
vgg-19 PASS
inception-bn PASS
inception-v3 PASS
resnet-50 PASS
resnet-152 PASS

Accuracy test result:

1
CiFAR10 + ResNet50 ( Convergence ): Validation accuracy of 99 epochs ( top1: 67.88%, top5: 96.26%)
2
CiFAR10 + VGG16 ( Convergence ): Validation accuracy of 74 epochs ( top1: 82.56%, top5: 98.52% )

Performance test result:

Skylake 8180 2sockets training BS=32 (img/s)

Model Baseline After Caching Backward Op Diff%
alexnet 366.0699 412.3794 112.65%
googlenet 104.3128 152.8598 146.54%
vgg-16 38.40072 39.65964 103.28%
vgg-19 32.35834 33.80966 104.49%
inception-bn 85.71482 124.4676 145.21%
inception-v3 43.0482 57.82697 134.33%
resnet-50 53.15957 65.68513 123.56%
resnet-152 21.67982 28.26535 130.38%
inception-v4 23.82474 29.33314 123.12%

Total Time consumption breakdown(ms)

Model Baseline After caching Backward Op Diff%
Convolution 19440.11 19441.49 100.01%
Activation 11253.48 9493.82 84.36%
LRN 43.325 44.281 102.21%
Pooling 1002.97 1037.459 103.44%
Flatten 36.84 37.138 100.81%
FullyConnected 1135.342 1063.388 93.66%
Dropout 210.459 207.743 98.71%
SoftmaxOutput 6.944 6.428 92.57%
_backward_SoftmaxOutput 3.627 4.424 121.97%
_backward_FullyConnected 1792.938 1806.433 100.75%
_backward_Dropout 3.977 3.586 90.17%
_backward_Activation 11083.87 9536.887 86.04%
_zeros 184.517 175.302 95.01%
_backward_copy 1.076 0.917 85.22%
_backward_Pooling 2291.866 2056.797 89.74%
_backward_Convolution 63134.26 46223.86 73.22%
_backward_LRN 185.551 130.467 70.31%
Concat 594.403 618.61 104.07%
_backward_Concat 703.62 708.137 100.64%
add_n 2501.635 2556.075 102.18%
BatchNorm 5815.348 5797.396 99.69%
_backward_BatchNorm 10238.95 7818.107 76.36%
_copy 339.458 351.205 103.46%
elemwise_add 909.67 925.058 101.69%

merge with official master
@pengzhao-intel
Copy link
Contributor

ping @zheng-da

ZhennanQin and others added 8 commits June 29, 2018 08:44
Change-Id: Iefe9f720de719ec2e2f5d24a006602425136711b
Change-Id: Idbe94e21f1e2ddf711523767194b95beda19b120
Change-Id: I545628ff68a54cb01b7fef323dc3de9bd47b1a19
Change-Id: I1e9bf1b9b44bae52068a9c564dff037851e896e5
Change-Id: I9e52651bd830b8cb5d2f193076ef51606c9056f9
Change-Id: I0496fa2394ee036d05c58f3abc1d74af544c7bca
Change-Id: I8347527ec1271b1518921a74e3581d7d84187429
@zheng-da
Copy link
Contributor

zheng-da commented Jul 7, 2018

@ashokei could you also help review this PR?

@@ -129,6 +129,8 @@ void LRNComputeExCPU(const nnvm::NodeAttrs &attrs,
MKLDNN_OPCHECK_INIT(false, 1, inputs, outputs);
MKLDNNLRNForward(ctx, param, inputs[0], req[0], outputs[0]);
MKLDNN_OPCHECK_RUN(LRNCompute<cpu>, attrs, ctx, inputs, req, outputs);
// Copy outputs[1] from opcheck reference as backward check needs it.
MKLDNN_OPCHECK_COPY_RESULT(outputs, std::vector<size_t>{1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to it for all operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LRN is a special operator that MKLDNN only generates output[0], but for default cpu backward computing, it requires output[1] as well, making opcheck fails eventually. After copying output[1], this problem can be fixed.
Maybe you're thinking if it's necessary for all operators. Firstly, we only found this issue on LRN. Secondly, it would make results different before and after enabling opcheck from accuracy losing accumulation. Thirdly, it would hurt opcheck performance a lot. So I prefer not applying this to all operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining. i just didn't understand why we needed this.

fw_pdesc);
return bw_pdesc;
});
LOG(INFO) << "Unsupported data type for MKLDNN activation";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use LOG(FATAL) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems you changed the one above, but not this one.

mkldnn::eltwise_forward::primitive_desc fw_pdesc(fw_desc, cpu_engine);
mkldnn::eltwise_backward::desc bw_desc(alg, diff_md, data_md, 0.0);
mkldnn::eltwise_backward::primitive_desc bw_pdesc(bw_desc, cpu_engine,
fw_pdesc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code shouldn't run here, right? can we return an empty descriptor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be a unreachable code. As it's hard to create a empty descriptor(mkldnn doesn't provide such a function to do that), I think fatal error is enough.

}

void SetDataNewMem(const mkldnn::memory &out_grad, const mkldnn::memory &weight,
const mkldnn::memory &in_grad) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent


void SetWeightNewMem(const mkldnn::memory &data,
const mkldnn::memory &out_grad,
const mkldnn::memory &in_grad_weight) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

@@ -133,13 +134,23 @@ void MKLDNNLRNFwd::_Init(const LRNParam &param,

void MKLDNNLRNFwd::SetDataHandle(const NDArray &in_data,
const NDArray &out_data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why in_data is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in line 140:
this->SetDataHandle(in_data, out_data_mem);

@@ -82,6 +82,101 @@ inline static mkldnn::inner_product_backward_weights::primitive_desc GetIPBwdWei
}
}

class MKLDNNFullyConnectForward {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you take the caching of FullyConnected forward out and submit another PR for this?
my concern is that this PR is for caching backward. mixing it with all other backward operators might cause confusing later. thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will move FullyConnected forward out this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FullyConnected forward PR see #11611.

@zheng-da
Copy link
Contributor

zheng-da commented Jul 7, 2018

@ZhennanQin sorry for the late reply. I don't see big problems in the PR.

@marcoabreu
Copy link
Contributor

Sorry, I might be missing something, but I can't find any tests. @zheng-da @azai91 could you check please?

@ashokei
Copy link
Contributor

ashokei commented Jul 9, 2018

changes look ok for me, great performance improvement! Since these are primarily for perf.optimization i assume existing unit-tests should be ok for functionality checking.

@ZhennanQin
Copy link
Contributor Author

@marcoabreu Tests are covered by 2 parts:

  1. The existing unit-tests.
  2. Train model after enabling OPCHECK by environment variable MXNET_MKLDNN_DEBUG=1. This will trigger computing for each op twice, one in mkldnn pass, the other one in default cpu pass, then compare the results. If the results are different, it would generate error message.

This PR is like a refactoring work, which doesn't add any new feature or functionality, so above tests are enough.

@marcoabreu
Copy link
Contributor

Excellent, thanks a lot!

@ZhennanQin
Copy link
Contributor Author

@zheng-da @marcoabreu Code is refactored based on comments:

  1. Remove FC forward change into another PR.
  2. Change back to use DispatchMode::kFCompute for FC backward.(We still implemented FC backward op cache in this PR, but won't be used)

If there's no more comment, could you please approve and merge this ASAP? Thanks.

@ZhennanQin
Copy link
Contributor Author

Ping @zheng-da @anirudh2290

@anirudh2290 anirudh2290 added MKLDNN Backend Issues related to the backend of MXNet pr-awaiting-review PR is waiting for code review labels Aug 9, 2018
@lupesko
Copy link
Contributor

lupesko commented Aug 21, 2018

Thanks for the contribution @ZhennanQin !
Bouncing again for review: @anirudh2290 @zheng-da @azai91 @mseth10

@zheng-da
Copy link
Contributor

This is reviewed. We agreed to merge this to 1.4 instead of in 1.3.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Aug 26, 2018

@zheng-da The 1.3 branch is cut. I think it's the time to merge the PR into master now.
What's your opinion?

BTW, we will rebase the code soon :)

@zheng-da
Copy link
Contributor

@pengzhao-intel yes. After you rebase successfully, i'll browse through the code again.

@ZhennanQin
Copy link
Contributor Author

@zheng-da Code is rebased, please have a look. Thanks for engagement.

*diff_dst_memory,
*diff_src_memory.second));
});
TmpMemMgr::Get()->Init(ctx.requested[activation::kTempSpace]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you move it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember the details, maybe it's changed by mistake. I've changed it back.

@zheng-da
Copy link
Contributor

everything else looks fine to me.

@zheng-da
Copy link
Contributor

@azai91 @eric-haibin-lin could you also help review the PR?
The PR has been under review for more than two months. We should merge it if you guys think it looks good to you as well.

@lupesko
Copy link
Contributor

lupesko commented Sep 11, 2018

@azai91 @eric-haibin-lin @anirudh2290 please review. This has been out for review for a long time...

@eric-haibin-lin eric-haibin-lin merged commit 741635a into apache:master Sep 13, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
* Enable primitive allocation cache for _backward_Activation.

Change-Id: I545628ff68a54cb01b7fef323dc3de9bd47b1a19

* Enable primitive allocation cache for _backward_Deconvolution.

Change-Id: I1e9bf1b9b44bae52068a9c564dff037851e896e5

* Enable primitive allocation cache for _backward_Pooling.

Change-Id: Idbe94e21f1e2ddf711523767194b95beda19b120

* Enable primitive allocation cache for _backward_LRN.

Change-Id: Iefe9f720de719ec2e2f5d24a006602425136711b

* Enable primitive allocation cache for _backward_BatchNorm.

Change-Id: I9e52651bd830b8cb5d2f193076ef51606c9056f9

* Enable primitive allocation cache for _backward_Convolution

Change-Id: I0496fa2394ee036d05c58f3abc1d74af544c7bca

* Enable primitive allocation cache for _backward_Fully_Connected

Change-Id: I8347527ec1271b1518921a74e3581d7d84187429

* remove fc forward and fix indent problem

* remove fc forward and fix convolution indent problem

* Change log level to FATAL for unreachable code in mkldnn_act.cc

* remove fc forward and fix convolution indent problem

* remove useless hint in fc

* Merge branch 'master' into backward_op_cache

* Empty commit to retrigger the CI.

* Change LOG(INFO) to LOG(FATAL) for unreachable code in mkldnn_act.cc

* Fix build issue after code merge.

* Fix lint after merge

* Fix mkldnn act.
@ZhennanQin ZhennanQin deleted the backward_op_cache branch October 10, 2018 00:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet MKLDNN pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants