Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893

Merged
merged 43 commits into from
May 22, 2019

Conversation

juliusshufan
Copy link
Contributor

@juliusshufan juliusshufan commented May 6, 2019

Description

Intel MKL provides a wide range of the generic vectored mathematic (VML) functions befitting from AVX512 instructions, via the integration of the VML functions, some of the element-wised operation is expected to be speedup.

Currently, the generic mathematic computations are implemented by a series of mshadow OPs, which encapsulate the functions provided by standard math library, and the vectorization of the input is implemented by dense/sparse tensor, and computations are paralized by OpenMP. Specially,
Element-wise OP taking one or two inputs, with “write-to”/”write-inplace” types can be supported by MKL VML functions.

@TaoLv @pengzhao-intel

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

The unit tests are provided with the changes.

@juliusshufan juliusshufan changed the title Integrating the MKL VML functions to MXNET to speed-up the mathematic computation Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation May 6, 2019
@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label May 6, 2019
@pengzhao-intel
Copy link
Contributor

@eric-haibin-lin @szha please help take a review.

src/operator/mkl_functions-inl.h Outdated Show resolved Hide resolved
src/operator/mkl_functions-inl.h Show resolved Hide resolved
src/operator/mkl_functions-inl.h Outdated Show resolved Hide resolved
src/operator/mkl_functions-inl.h Outdated Show resolved Hide resolved
MXNET_MKL_BINARY_MATH_FUNC(sub, Sub);
MXNET_MKL_BINARY_MATH_FUNC(mul, Mul);
MXNET_MKL_BINARY_MATH_FUNC(pow, Pow);
MXNET_MKL_BINARY_MATH_FUNC(hypot, Hypot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all of these functions will be mapped automatically when MKL is enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We just put all the VML functions here. We think these functions can be leveraged by MXNet in the future. But currently it need to change the registration of each operator to use these functions. In this PR we only optimized some operators which are used in BERT. We propose to optimize others when we face performance problems on them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. We can add it back when we use it; otherwise, it is a little confusion for other developers.

.set_attr<FInferStorageType>("FInferStorageType", ElemwiseStorageType<1, 1, \
false, true, true>) \
.set_attr<FCompute>("FCompute<" #__xpu$ ">", UnaryOp::MKL_Compute<__kernel$, __mkl_kernel$>) \
.set_attr<FComputeEx>("FComputeEx<" #__xpu$ ">", UnaryOp::MKL_ComputeEx<__kernel$, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you override the FComputeEx attribute?

Copy link
Member

@TaoLv TaoLv May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin Thanks for review. Not sure how sparse is handled in the original FComputeEx. Previously I thought sparse can also benefit from VML if its values are stored in a dense way. But we don't much data to prove that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Yeah it should benefit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reverted the change for FComputeEx as we don't have much data for that yet. Will revisit this part once we meet any performance issue for sparse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually I'd prefer not reverting it. I don't see any reason why it won't help. Let's undo the revert?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a simple benchmark for a sparse unary operator? We can take a quick try. Thanks! cc @juliusshufan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin @TaoLv Friendly ping, May I know your decison on the sparse part?


// LayerNorm on the last dimension
template <typename DType>
MSHADOW_XINLINE static void LayerNormLastDim(const index_t m,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR also want to enable optimization for layernorm ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm working on enabling it and trying to understand the optimization and workflow from @sxjscience 's PR.

@eric-haibin-lin eric-haibin-lin dismissed their stale review May 15, 2019 05:00

comments addressed

@pengzhao-intel
Copy link
Contributor

@eric-haibin-lin please help to review again :)

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minor comment is added.

LGTM


mul::Vectorize(n, out_offset, gamma, out_offset);
div_(n, out_offset, var[i], out_offset);
add::Vectorize(n, out_offset, beta, out_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance to fusion some of these operations to reduce the memory bandwidth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much faster is this version compared to the mshadow one?

Copy link
Member

@sxjscience sxjscience May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code, I think the current implementation, which relies on the vectorized operations, should be fast at scaling and shifting the data (data * gamma & data + beta). One possible improvement is to use the Welford's online algorithm (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) to calculate the mean/variance in one pass, the code will look like this:

template <typename DType>
MSHADOW_XINLINE static void mean_var_(index_t n, DType *in, DType *mean, DType* variance) {
  DType sigma2 = 0;
  DType mean_v = 0;
  DType old_mean_v = 0;
  for (index_t i = 0; i < n; i++) {
    DType x = in[i];
    old_mean_v = mean_v;
    mean_v += (x - old_mean_v) / (i + 1);
    sigma2 += (x - old_mean_v) * (x - mean_v);
  }    
  mean[0] = mean_v;
  variance[0] = sigma2 / n;
}


template <typename DType>
MSHADOW_XINLINE static void LayerNormLastDim(index_t m,
                                             index_t n,
                                             DType *a,
                                             DType *b,
                                             DType *ws,
                                             DType *gamma,
                                             DType *beta,
                                             DType *mean,
                                             DType *var,
                                             DType eps) {
  auto nthreads = engine::OpenMP::Get()->GetRecommendedOMPThreadCount();
#pragma omp parallel for num_threads(nthreads)
  for (index_t i = 0; i < m; i++) {
    DType ele_mean, ele_var;
    DType* in_offset = a + i * n;
    DType* out_offset = b + i * n;
    mean_var_(n, in_offset, &ele_mean, &ele_var);
    sub_(n, in_offset, ele_mean, out_offset);
    ele_var = math::sqrt(ele_var + eps);
    mul::Vectorize(n, out_offset, gamma, out_offset);
    div_(n, out_offset, ele_var, out_offset);
    add::Vectorize(n, out_offset, beta, out_offset);
    mean[i] = ele_mean;
    var[i] = ele_var;
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengzhao-intel @sxjscience loops are fused in the latest commit. I also removed the required workspace but that means we can not leverage VML functions and need rely on compiler for vectorization.

@TaoLv
Copy link
Member

TaoLv commented May 18, 2019

@sxjscience Can you help to review? Here is a optimization for CPU LayerNorm.

@TaoLv
Copy link
Member

TaoLv commented May 21, 2019

LayerNorm performance is measured on my skl machine. Shapes are from BERT base and large model respectively. The speedup from this PR is around 3x~10x. @eric-haibin-lin @sxjscience @pengzhao-intel

# mxnet-mkl==1.4.1
layernorm (1L, 128L, 768L): 0.23437 ms
layernorm (8L, 128L, 768L): 1.39641 ms
layernorm (32L, 128L, 768L): 5.18604 ms
layernorm (1L, 128L, 1024L): 0.35661 ms
layernorm (8L, 128L, 1024L): 1.80795 ms
layernorm (32L, 128L, 1024L): 6.76601 ms

# this PR built with USE_BLAS=mkl
layernorm (1, 128, 768): 0.07230 ms
layernorm (8, 128, 768): 0.21550 ms
layernorm (32, 128, 768): 0.51188 ms
layernorm (1, 128, 1024): 0.08863 ms
layernorm (8, 128, 1024): 0.25120 ms
layernorm (32, 128, 1024): 0.63479 ms

@pengzhao-intel
Copy link
Contributor

@sxjscience @eric-haibin-lin any other comments? If no, I will merge this PR soon for the release 1.5.

@pengzhao-intel pengzhao-intel merged commit b0be6c5 into apache:master May 22, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…ised) mathematic computation (apache#14893)

* mkl_func test with erf&log op, build success~

* fix lint and build issues

* Try to add support to sparse array

* fix build

* add functions

* Fix review comments

* remove unecessary code

* Update test case

* minor fix

* move the position of MKL_Compute

* mkl_func test with erf&log op, build success~

* fix lint and build issues

* Try to add support to sparse array

* fix build

* Fix review comments

* remove unecessary code

* Update test case

* minor fix

* add functions

* move the position of MKL_Compute

* fix cpplint

* cpp lint

* trigger ci

* address comments

* coding style

* enable layernorm

* fix windows build

* revert changes to FComputeEx

* int -> index_t

* remove workspace

* fix lint

* clean code
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants