[MXNET-264] Improve performance of MKLDNN in small batch sizes. #10317

zheng-da · 2018-03-29T16:25:38Z

Description

The current MKLDNN integration still has a lot of overhead for calling MKLDNN functions and the overhead comes from many places. This PR is to reduce the overheads.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

cjolivier01 · 2018-03-29T16:28:02Z

include/mxnet/ndarray.h

@@ -74,6 +74,7 @@ enum NDArrayFormatErr {
  kRSPIdxErr,     // indices error for row sparse
 };

+class MKLDNNMemory;


forward declarations are going out of style. Is there a reasonable way around this, or does it get messy without this?

i don't think it's a good idea to include the header file that defines MKLDNNMemory in ndarray.h
it's better to declare the class in this file. what is the reason of not using forward declaration?

one reason is that they cause annoying compile errors when used in pointer classes when code the compiler decides it needs the type in order to generate the destructor code, for instance or during template instantiation of something that uses it directly or indirectly. i’m not going to block the PR over it and if you feel
strongly that you want to use it then fine, but it’s not done much in the code base and that’s probably not an accident.

between including mkldnn_base-inl.h and forward declaration, i'll choose the latter. NDArray is such a basic class, its header file is included in almost every .cc files and many .h files, including mkldnn_base-inl.h. The other option is to define MKLDNNMemory in NDArray, if it's a preferred way. it's a little weird to me.

cjolivier01 · 2018-03-29T16:29:34Z

src/ndarray/ndarray.cc

-    auto format = GetDefaultFormat(ptr_->mkl_mem_->get_primitive_desc().desc());
-    CHECK_NE(format, ptr_->mkl_mem_->get_primitive_desc().desc().data.format);
-    auto def_pd = GetPrimitiveDesc(ptr_->mkl_mem_->get_primitive_desc(), format);
+    auto format = ptr_->mkl_mem_->GetDefaultFormat();


Can we please reduce the usage of 'auto'? Usually, 'auto' is best for obvious types or very tedious declarations such as stl map iterators. For this, there's several calls that I have no idea what the object types are. It makes the code hard to understand. There's so many autos that it almost looks like javascript :)

cjolivier01 · 2018-03-29T16:30:15Z

src/ndarray/ndarray.cc

    // def_mem points to a memory region in the temp space. It's only valid
    // inside an operator. As such, the returned NDArray can only be valid
    // inside an operator and the shared point doesn't need to do anything
    // when it's destroyed.
-    ret.ptr_->mkl_mem_ = std::shared_ptr<mkldnn::memory>(def_mem,
-                                                         [](mkldnn::memory *mem){});
+    auto tmp = std::shared_ptr<mkldnn::memory>(def_mem, [](mkldnn::memory *mem){});


see, for example this is a good use for auto, because the type is obvious from the assignment.

…or-mxnet into fix_mkldnn_perf

This reverts commit 71d0dec.

zheng-da · 2018-04-02T21:08:57Z

Could you please review this PR? @piiswrong @pengzhao-intel @TaoLv

TaoLv · 2018-04-03T14:41:43Z

@zheng-da Do you have any performance update of this PR?

This reverts commit 58854be.

zheng-da · 2018-04-04T00:23:13Z

Here is the performance before and after the optimizations.
The performance is measured with python example/image-classification/benchmark_score.py.
It's measured on C5.18xlarge.

model	batch size	before	after
AlexNet	1	268.63	378.96
	2	431.88	585.72
	4	580.75	701.16
	8	683.26	922.76
	16	932.89	1009.10
	32	866.39	1268.36
Inception-BN	1	102.64	124.97
	2	154.96	204.07
	4	255.78	300.68
	8	342.59	353.10
	16	380.03	423.39
	32	468.13	479.96
Inception-V3	1	54.73	57.70
	2	85.17	93.16
	4	120.25	127.48
	8	149.29	153.57
	16	162.59	170.97
	32	178.43	175.42
Resnet-50	1	72.79	77.32
	2	112.25	106.14
	4	141.65	145.12
	8	161.16	165.45
	16	173.28	177.18
	32	195.73	195.85

ashokei · 2018-04-04T22:29:08Z

src/operator/nn/mkldnn/mkldnn_batch_norm-inl.h

-        weight_buf[channels_ + i] = bias_ptr[i];  // bias
-      }
+      memcpy(weight_buf, weight_ptr, sizeof(weight_buf[0]) * channels_);
+      memcpy(&weight_buf[channels_], bias_ptr, sizeof(weight_buf[0]) * channels_);


nice optimization above; why is the below OMP calls causing overhead.

are you referring to all of the OMP directives? The number of channels is in the order of 100. Parallelization overhead is usually larger than the actual computation.

got it, thanks! we noticed the same performance issue for smaller networks too (eg: mnist) . Lower OMP_NUM_THREADS (eg: 4 -vs- 36) was giving better performance.

ashokei · 2018-04-04T22:54:17Z

@cjolivier01 @piiswrong this PR resolves performance issues with small batch size inference, can we get this PR into 1.2.0 release please., thanks.

cjolivier01

way too many autos.

cjolivier01 · 2018-04-05T01:09:16Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

 mkldnn_memory_format_t GetDefaultFormat(int num_dims);
 mkldnn::memory::primitive_desc GetPrimitiveDesc(mkldnn::memory::primitive_desc pd,
                                                mkldnn_memory_format_t format);

+static inline bool same_shape(const TShape &shape, const mkldnn_dims_t dims, int ndims) {


don’t use static in a header. in-line is fine by itself.

cjolivier01 · 2018-04-05T01:11:22Z

src/operator/nn/lrn-inl.h

@@ -58,8 +58,35 @@ struct LRNParam : public dmlc::Parameter<LRNParam> {
    DMLC_DECLARE_FIELD(nsize)
    .describe("normalization window width in elements.");
  }
+
+  bool operator==(const LRNParam& other) const {
+    return (fabs(this->alpha - other.alpha) < 1e-6 &&


it’s better to check the nsize first because it’s a far less expensive check than fabs()

cjolivier01 · 2018-04-05T01:12:28Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

+  return true;
+}
+
+static inline bool same_shape(const TShape &shape, int dtype,


cjolivier01 · 2018-04-05T01:13:46Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

+    size = pd.get_size();
+  }
+
+  explicit MKLDNNMemory(std::shared_ptr<mkldnn::memory> mem): desc(


can this pointer be passed by reference to reduce the interlocked operation?

we need to use shared_ptr here. MKLDNNMemory needs to own the memory.

cjolivier01 · 2018-04-05T01:14:41Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

+  explicit MKLDNNMemory(std::shared_ptr<mkldnn::memory> mem): desc(
+      mem->get_primitive_desc().desc()) {
+    this->mem = mem;
+    auto pd = mem->get_primitive_desc();


nit: it isn’t clear what auto is here

pengzhao-intel · 2018-04-05T01:41:05Z

LGTM, thanks @zheng-da

TaoLv · 2018-04-05T07:56:41Z

src/ndarray/ndarray.cc

-  auto format = GetDefaultFormat(mkl_mem_->get_primitive_desc().desc());
-  CHECK(format != mkl_mem_->get_primitive_desc().desc().data.format);
+  auto format = mkl_mem_->GetDefaultFormat();
+  CHECK(format != mkl_mem_->GetFormat());


TaoLv · 2018-04-05T08:04:35Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

+    return GetFormat() != GetDefaultFormat();
+  }
+
+  bool SameFormat(mkldnn::memory::primitive_desc pd) const {


nit: maybe HaveSameFormat is a better name.

TaoLv · 2018-04-05T08:05:50Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

+  void ReorderTo(mkldnn::memory *other) const {
+    std::vector<mkldnn::primitive> net;
+    net.push_back(mkldnn::reorder(*mem, *other));
+    mkldnn::stream(mkldnn::stream::kind::eager).submit(net).wait();


Why not use MKLDNNStream here?

We want to immediate action here. MKLDNNStream is designed to collect all MKLDNN operators and submit them in one call.

TaoLv · 2018-04-05T08:08:29Z

src/operator/nn/mkldnn/mkldnn_concat.cc

@@ -30,6 +30,67 @@
 namespace mxnet {
 namespace op {

+class MKLDNNCcForward {


Is 'Cc' a short name for concat? Please find a more proper name for this class.

TaoLv · 2018-04-05T08:09:54Z

src/operator/nn/mkldnn/mkldnn_concat.cc

-  std::vector<mkldnn::primitive::at> data_mem;
+  std::vector<const mkldnn::memory *> data_mem;
+  data_md.reserve(num_in_data);
+  data_mem.reserve(num_in_data);
  for (int i =0; i < num_in_data; i++) {
      auto tmp_mem = in_data[i].GetMKLDNNData();


Please help fix the indents here.

…or-mxnet into fix_mkldnn_perf

zheng-da · 2018-04-05T22:35:01Z

@cjolivier01 I have removed "auto" as much as possible.

zheng-da · 2018-04-06T20:47:24Z

@cjolivier01 are you OK with the PR?

anirudh2290 · 2018-04-08T19:03:09Z

@cjolivier01 is this good to merge ?

anirudh2290 · 2018-04-09T22:01:12Z

@piiswrong can this be merged ?

…he#10317) * Create MKLDNNMemory to cache metadata. * Fix lint error. * Cache concat. * Fix a bug in NDArray. * improve hashing. * don't use omp for gamma and beta in batchnorm. * address the comments. * Avoid computing out mean&var in batchnorm. * Cache LRN. * Fix a bug in LRN. * Fix lint error. * Revert "Avoid computing out mean&var in batchnorm." This reverts commit 71d0dec. * remove more omp in batchnorm. * add comments for MKLDNNMemory. * Revert "improve hashing." This reverts commit 58854be. * Remove unnecessary TODO. * address comments. * Remove additional auto. * Fix compile error. * remove more auto.

Create MKLDNNMemory to cache metadata.

340d557

zheng-da requested a review from cjolivier01 as a code owner March 29, 2018 16:25

cjolivier01 suggested changes Mar 29, 2018

View reviewed changes

zheng-da added 3 commits March 29, 2018 17:32

Fix lint error.

8c0e9da

Cache concat.

8f3194f

Fix a bug in NDArray.

c8db046

zheng-da force-pushed the fix_mkldnn_perf branch from 8cce42e to c8db046 Compare March 29, 2018 21:03

zheng-da added 10 commits March 29, 2018 21:55

improve hashing.

58854be

Merge branch 'fix_mkldnn_perf' of https://github.com/zheng-da/incubat…

d4e337f

…or-mxnet into fix_mkldnn_perf

don't use omp for gamma and beta in batchnorm.

537c9ba

address the comments.

be36131

Avoid computing out mean&var in batchnorm.

71d0dec

Cache LRN.

1454042

Fix a bug in LRN.

d715050

Fix lint error.

cca2b0c

Revert "Avoid computing out mean&var in batchnorm."

d573b44

This reverts commit 71d0dec.

remove more omp in batchnorm.

6ebbdcb

zheng-da changed the title ~~[WIP] This is to improve performance of MKLDNN in small batch sizes.~~ [MXNET-264] Improve performance of MKLDNN in small batch sizes. Apr 2, 2018

add comments for MKLDNNMemory.

654da48

Revert "improve hashing."

c4cc171

This reverts commit 58854be.

ashokei reviewed Apr 4, 2018

View reviewed changes

ashokei approved these changes Apr 4, 2018

View reviewed changes

Remove unnecessary TODO.

128d9e6

cjolivier01 reviewed Apr 5, 2018

View reviewed changes

TaoLv reviewed Apr 5, 2018

View reviewed changes

zheng-da added 4 commits April 5, 2018 18:52

address comments.

643c624

Merge branch 'fix_mkldnn_perf' of https://github.com/zheng-da/incubat…

f7c3ce7

…or-mxnet into fix_mkldnn_perf

Remove additional auto.

fe0d7f8

Fix compile error.

b1f4dc6

remove more auto.

ae7ee06

cjolivier01 approved these changes Apr 9, 2018

View reviewed changes

piiswrong merged commit 5f94c9a into apache:master Apr 10, 2018

dwSun mentioned this pull request May 4, 2018

Check failed: format != mkl_mem_->GetFormat() (5 vs. 5) #10809

Closed

zheng-da deleted the fix_mkldnn_perf branch July 5, 2018 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-264] Improve performance of MKLDNN in small batch sizes. #10317

[MXNET-264] Improve performance of MKLDNN in small batch sizes. #10317

zheng-da commented Mar 29, 2018 •

edited

Loading

cjolivier01 Mar 29, 2018

zheng-da Mar 30, 2018

cjolivier01 Apr 5, 2018

zheng-da Apr 6, 2018

cjolivier01 Mar 29, 2018

cjolivier01 Mar 29, 2018

zheng-da commented Apr 2, 2018

TaoLv commented Apr 3, 2018

zheng-da commented Apr 4, 2018 •

edited

Loading

ashokei Apr 4, 2018

zheng-da Apr 4, 2018

ashokei Apr 4, 2018

ashokei commented Apr 4, 2018

cjolivier01 left a comment

cjolivier01 Apr 5, 2018

cjolivier01 Apr 5, 2018

cjolivier01 Apr 5, 2018

cjolivier01 Apr 5, 2018

zheng-da Apr 5, 2018

cjolivier01 Apr 5, 2018

pengzhao-intel commented Apr 5, 2018

TaoLv Apr 5, 2018

TaoLv Apr 5, 2018

TaoLv Apr 5, 2018

zheng-da Apr 5, 2018

TaoLv Apr 5, 2018

TaoLv Apr 5, 2018

zheng-da commented Apr 5, 2018

zheng-da commented Apr 6, 2018

anirudh2290 commented Apr 8, 2018

anirudh2290 commented Apr 9, 2018

[MXNET-264] Improve performance of MKLDNN in small batch sizes. #10317

[MXNET-264] Improve performance of MKLDNN in small batch sizes. #10317

Conversation

zheng-da commented Mar 29, 2018 • edited Loading

Description

Checklist

Essentials

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Apr 2, 2018

TaoLv commented Apr 3, 2018

zheng-da commented Apr 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashokei commented Apr 4, 2018

cjolivier01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhao-intel commented Apr 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Apr 5, 2018

zheng-da commented Apr 6, 2018

anirudh2290 commented Apr 8, 2018

anirudh2290 commented Apr 9, 2018

zheng-da commented Mar 29, 2018 •

edited

Loading

zheng-da commented Apr 4, 2018 •

edited

Loading