Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-264] Improve performance of MKLDNN in small batch sizes. #10317

Merged
merged 22 commits into from
Apr 10, 2018

Conversation

zheng-da
Copy link
Contributor

@zheng-da zheng-da commented Mar 29, 2018

Description

The current MKLDNN integration still has a lot of overhead for calling MKLDNN functions and the overhead comes from many places. This PR is to reduce the overheads.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@@ -74,6 +74,7 @@ enum NDArrayFormatErr {
kRSPIdxErr, // indices error for row sparse
};

class MKLDNNMemory;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forward declarations are going out of style. Is there a reasonable way around this, or does it get messy without this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's a good idea to include the header file that defines MKLDNNMemory in ndarray.h
it's better to declare the class in this file. what is the reason of not using forward declaration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one reason is that they cause annoying compile errors when used in pointer classes when code the compiler decides it needs the type in order to generate the destructor code, for instance or during template instantiation of something that uses it directly or indirectly. i’m not going to block the PR over it and if you feel
strongly that you want to use it then fine, but it’s not done much in the code base and that’s probably not an accident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between including mkldnn_base-inl.h and forward declaration, i'll choose the latter. NDArray is such a basic class, its header file is included in almost every .cc files and many .h files, including mkldnn_base-inl.h. The other option is to define MKLDNNMemory in NDArray, if it's a preferred way. it's a little weird to me.

auto format = GetDefaultFormat(ptr_->mkl_mem_->get_primitive_desc().desc());
CHECK_NE(format, ptr_->mkl_mem_->get_primitive_desc().desc().data.format);
auto def_pd = GetPrimitiveDesc(ptr_->mkl_mem_->get_primitive_desc(), format);
auto format = ptr_->mkl_mem_->GetDefaultFormat();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please reduce the usage of 'auto'? Usually, 'auto' is best for obvious types or very tedious declarations such as stl map iterators. For this, there's several calls that I have no idea what the object types are. It makes the code hard to understand. There's so many autos that it almost looks like javascript :)

// def_mem points to a memory region in the temp space. It's only valid
// inside an operator. As such, the returned NDArray can only be valid
// inside an operator and the shared point doesn't need to do anything
// when it's destroyed.
ret.ptr_->mkl_mem_ = std::shared_ptr<mkldnn::memory>(def_mem,
[](mkldnn::memory *mem){});
auto tmp = std::shared_ptr<mkldnn::memory>(def_mem, [](mkldnn::memory *mem){});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, for example this is a good use for auto, because the type is obvious from the assignment.

@zheng-da zheng-da changed the title [WIP] This is to improve performance of MKLDNN in small batch sizes. [MXNET-264] Improve performance of MKLDNN in small batch sizes. Apr 2, 2018
@zheng-da
Copy link
Contributor Author

zheng-da commented Apr 2, 2018

Could you please review this PR? @piiswrong @pengzhao-intel @TaoLv

@TaoLv
Copy link
Member

TaoLv commented Apr 3, 2018

@zheng-da Do you have any performance update of this PR?

This reverts commit 58854be.
@zheng-da
Copy link
Contributor Author

zheng-da commented Apr 4, 2018

Here is the performance before and after the optimizations.
The performance is measured with python example/image-classification/benchmark_score.py.
It's measured on C5.18xlarge.

model batch size before after
AlexNet 1 268.63 378.96
  2 431.88 585.72
  4 580.75 701.16
  8 683.26 922.76
  16 932.89 1009.10
  32 866.39 1268.36
Inception-BN 1 102.64 124.97
  2 154.96 204.07
  4 255.78 300.68
  8 342.59 353.10
  16 380.03 423.39
  32 468.13 479.96
Inception-V3 1 54.73 57.70
  2 85.17 93.16
  4 120.25 127.48
  8 149.29 153.57
  16 162.59 170.97
  32 178.43 175.42
Resnet-50 1 72.79 77.32
  2 112.25 106.14
  4 141.65 145.12
  8 161.16 165.45
  16 173.28 177.18
  32 195.73 195.85

weight_buf[channels_ + i] = bias_ptr[i]; // bias
}
memcpy(weight_buf, weight_ptr, sizeof(weight_buf[0]) * channels_);
memcpy(&weight_buf[channels_], bias_ptr, sizeof(weight_buf[0]) * channels_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice optimization above; why is the below OMP calls causing overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to all of the OMP directives? The number of channels is in the order of 100. Parallelization overhead is usually larger than the actual computation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks! we noticed the same performance issue for smaller networks too (eg: mnist) . Lower OMP_NUM_THREADS (eg: 4 -vs- 36) was giving better performance.

@ashokei
Copy link
Contributor

ashokei commented Apr 4, 2018

@cjolivier01 @piiswrong this PR resolves performance issues with small batch size inference, can we get this PR into 1.2.0 release please., thanks.

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

way too many autos.

mkldnn_memory_format_t GetDefaultFormat(int num_dims);
mkldnn::memory::primitive_desc GetPrimitiveDesc(mkldnn::memory::primitive_desc pd,
mkldnn_memory_format_t format);

static inline bool same_shape(const TShape &shape, const mkldnn_dims_t dims, int ndims) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don’t use static in a header. in-line is fine by itself.

@@ -58,8 +58,35 @@ struct LRNParam : public dmlc::Parameter<LRNParam> {
DMLC_DECLARE_FIELD(nsize)
.describe("normalization window width in elements.");
}

bool operator==(const LRNParam& other) const {
return (fabs(this->alpha - other.alpha) < 1e-6 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s better to check the nsize first because it’s a far less expensive check than fabs()

return true;
}

static inline bool same_shape(const TShape &shape, int dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

size = pd.get_size();
}

explicit MKLDNNMemory(std::shared_ptr<mkldnn::memory> mem): desc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this pointer be passed by reference to reduce the interlocked operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use shared_ptr here. MKLDNNMemory needs to own the memory.

explicit MKLDNNMemory(std::shared_ptr<mkldnn::memory> mem): desc(
mem->get_primitive_desc().desc()) {
this->mem = mem;
auto pd = mem->get_primitive_desc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it isn’t clear what auto is here

@pengzhao-intel
Copy link
Contributor

LGTM, thanks @zheng-da

auto format = GetDefaultFormat(mkl_mem_->get_primitive_desc().desc());
CHECK(format != mkl_mem_->get_primitive_desc().desc().data.format);
auto format = mkl_mem_->GetDefaultFormat();
CHECK(format != mkl_mem_->GetFormat());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK_NE()

return GetFormat() != GetDefaultFormat();
}

bool SameFormat(mkldnn::memory::primitive_desc pd) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe HaveSameFormat is a better name.

void ReorderTo(mkldnn::memory *other) const {
std::vector<mkldnn::primitive> net;
net.push_back(mkldnn::reorder(*mem, *other));
mkldnn::stream(mkldnn::stream::kind::eager).submit(net).wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use MKLDNNStream here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to immediate action here. MKLDNNStream is designed to collect all MKLDNN operators and submit them in one call.

@@ -30,6 +30,67 @@
namespace mxnet {
namespace op {

class MKLDNNCcForward {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'Cc' a short name for concat? Please find a more proper name for this class.

std::vector<mkldnn::primitive::at> data_mem;
std::vector<const mkldnn::memory *> data_mem;
data_md.reserve(num_in_data);
data_mem.reserve(num_in_data);
for (int i =0; i < num_in_data; i++) {
auto tmp_mem = in_data[i].GetMKLDNNData();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help fix the indents here.

@zheng-da
Copy link
Contributor Author

zheng-da commented Apr 5, 2018

@cjolivier01 I have removed "auto" as much as possible.

@zheng-da
Copy link
Contributor Author

zheng-da commented Apr 6, 2018

@cjolivier01 are you OK with the PR?

@anirudh2290
Copy link
Member

@cjolivier01 is this good to merge ?

@anirudh2290
Copy link
Member

@piiswrong can this be merged ?

@piiswrong piiswrong merged commit 5f94c9a into apache:master Apr 10, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
…he#10317)

* Create MKLDNNMemory to cache metadata.

* Fix lint error.

* Cache concat.

* Fix a bug in NDArray.

* improve hashing.

* don't use omp for gamma and beta in batchnorm.

* address the comments.

* Avoid computing out mean&var in batchnorm.

* Cache LRN.

* Fix a bug in LRN.

* Fix lint error.

* Revert "Avoid computing out mean&var in batchnorm."

This reverts commit 71d0dec.

* remove more omp in batchnorm.

* add comments for MKLDNNMemory.

* Revert "improve hashing."

This reverts commit 58854be.

* Remove unnecessary TODO.

* address comments.

* Remove additional auto.

* Fix compile error.

* remove more auto.
zheng-da added a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
…he#10317)

* Create MKLDNNMemory to cache metadata.

* Fix lint error.

* Cache concat.

* Fix a bug in NDArray.

* improve hashing.

* don't use omp for gamma and beta in batchnorm.

* address the comments.

* Avoid computing out mean&var in batchnorm.

* Cache LRN.

* Fix a bug in LRN.

* Fix lint error.

* Revert "Avoid computing out mean&var in batchnorm."

This reverts commit 71d0dec.

* remove more omp in batchnorm.

* add comments for MKLDNNMemory.

* Revert "improve hashing."

This reverts commit 58854be.

* Remove unnecessary TODO.

* address comments.

* Remove additional auto.

* Fix compile error.

* remove more auto.
@zheng-da zheng-da deleted the fix_mkldnn_perf branch July 5, 2018 06:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants