-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Integrating the MKL VML functions to MXNET to speed-up the (element-wised) mathematic computation #14893
Conversation
Fix review comments
@mxnet-label-bot add [pr-awaiting-review] |
@eric-haibin-lin @szha please help take a review. |
MXNET_MKL_BINARY_MATH_FUNC(sub, Sub); | ||
MXNET_MKL_BINARY_MATH_FUNC(mul, Mul); | ||
MXNET_MKL_BINARY_MATH_FUNC(pow, Pow); | ||
MXNET_MKL_BINARY_MATH_FUNC(hypot, Hypot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does all of these functions will be mapped automatically when MKL is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We just put all the VML functions here. We think these functions can be leveraged by MXNet in the future. But currently it need to change the registration of each operator to use these functions. In this PR we only optimized some operators which are used in BERT. We propose to optimize others when we face performance problems on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. We can add it back when we use it; otherwise, it is a little confusion for other developers.
.set_attr<FInferStorageType>("FInferStorageType", ElemwiseStorageType<1, 1, \ | ||
false, true, true>) \ | ||
.set_attr<FCompute>("FCompute<" #__xpu$ ">", UnaryOp::MKL_Compute<__kernel$, __mkl_kernel$>) \ | ||
.set_attr<FComputeEx>("FComputeEx<" #__xpu$ ">", UnaryOp::MKL_ComputeEx<__kernel$, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you override the FComputeEx attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-haibin-lin Thanks for review. Not sure how sparse is handled in the original FComputeEx. Previously I thought sparse can also benefit from VML if its values are stored in a dense way. But we don't much data to prove that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Yeah it should benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reverted the change for FComputeEx as we don't have much data for that yet. Will revisit this part once we meet any performance issue for sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually I'd prefer not reverting it. I don't see any reason why it won't help. Let's undo the revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a simple benchmark for a sparse unary operator? We can take a quick try. Thanks! cc @juliusshufan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-haibin-lin @TaoLv Friendly ping, May I know your decison on the sparse part?
src/operator/mkl_functions-inl.h
Outdated
|
||
// LayerNorm on the last dimension | ||
template <typename DType> | ||
MSHADOW_XINLINE static void LayerNormLastDim(const index_t m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR also want to enable optimization for layernorm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm working on enabling it and trying to understand the optimization and workflow from @sxjscience 's PR.
@eric-haibin-lin please help to review again :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minor comment is added.
LGTM
src/operator/mkl_functions-inl.h
Outdated
|
||
mul::Vectorize(n, out_offset, gamma, out_offset); | ||
div_(n, out_offset, var[i], out_offset); | ||
add::Vectorize(n, out_offset, beta, out_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance to fusion some of these operations to reduce the memory bandwidth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much faster is this version compared to the mshadow one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code, I think the current implementation, which relies on the vectorized operations, should be fast at scaling and shifting the data (data * gamma & data + beta). One possible improvement is to use the Welford's online algorithm (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) to calculate the mean/variance in one pass, the code will look like this:
template <typename DType>
MSHADOW_XINLINE static void mean_var_(index_t n, DType *in, DType *mean, DType* variance) {
DType sigma2 = 0;
DType mean_v = 0;
DType old_mean_v = 0;
for (index_t i = 0; i < n; i++) {
DType x = in[i];
old_mean_v = mean_v;
mean_v += (x - old_mean_v) / (i + 1);
sigma2 += (x - old_mean_v) * (x - mean_v);
}
mean[0] = mean_v;
variance[0] = sigma2 / n;
}
template <typename DType>
MSHADOW_XINLINE static void LayerNormLastDim(index_t m,
index_t n,
DType *a,
DType *b,
DType *ws,
DType *gamma,
DType *beta,
DType *mean,
DType *var,
DType eps) {
auto nthreads = engine::OpenMP::Get()->GetRecommendedOMPThreadCount();
#pragma omp parallel for num_threads(nthreads)
for (index_t i = 0; i < m; i++) {
DType ele_mean, ele_var;
DType* in_offset = a + i * n;
DType* out_offset = b + i * n;
mean_var_(n, in_offset, &ele_mean, &ele_var);
sub_(n, in_offset, ele_mean, out_offset);
ele_var = math::sqrt(ele_var + eps);
mul::Vectorize(n, out_offset, gamma, out_offset);
div_(n, out_offset, ele_var, out_offset);
add::Vectorize(n, out_offset, beta, out_offset);
mean[i] = ele_mean;
var[i] = ele_var;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pengzhao-intel @sxjscience loops are fused in the latest commit. I also removed the required workspace but that means we can not leverage VML functions and need rely on compiler for vectorization.
@sxjscience Can you help to review? Here is a optimization for CPU LayerNorm. |
…to vml Conflicts: src/operator/nn/layer_norm-inl.h
LayerNorm performance is measured on my skl machine. Shapes are from BERT base and large model respectively. The speedup from this PR is around 3x~10x. @eric-haibin-lin @sxjscience @pengzhao-intel
|
@sxjscience @eric-haibin-lin any other comments? If no, I will merge this PR soon for the release 1.5. |
…ised) mathematic computation (apache#14893) * mkl_func test with erf&log op, build success~ * fix lint and build issues * Try to add support to sparse array * fix build * add functions * Fix review comments * remove unecessary code * Update test case * minor fix * move the position of MKL_Compute * mkl_func test with erf&log op, build success~ * fix lint and build issues * Try to add support to sparse array * fix build * Fix review comments * remove unecessary code * Update test case * minor fix * add functions * move the position of MKL_Compute * fix cpplint * cpp lint * trigger ci * address comments * coding style * enable layernorm * fix windows build * revert changes to FComputeEx * int -> index_t * remove workspace * fix lint * clean code
Description
Intel MKL provides a wide range of the generic vectored mathematic (VML) functions befitting from AVX512 instructions, via the integration of the VML functions, some of the element-wised operation is expected to be speedup.
Currently, the generic mathematic computations are implemented by a series of mshadow OPs, which encapsulate the functions provided by standard math library, and the vectorization of the input is implemented by dense/sparse tensor, and computations are paralized by OpenMP. Specially,
Element-wise OP taking one or two inputs, with “write-to”/”write-inplace” types can be supported by MKL VML functions.
@TaoLv @pengzhao-intel
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
The unit tests are provided with the changes.