Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

support mkl log when dtype is fp32 or fp64 #13150

Merged
merged 6 commits into from
Dec 14, 2018
Merged

Conversation

XiaotaoChen
Copy link
Contributor

Description

  1. support MKL Log implemented with vectorization;
  2. when data's type is fp32 or fp64 and req is kWriteTo , mx.nd.log will call MKL Log. otherwise, it will call the default implementation.
  3. According to the statistics in c code, it gains 2-10 times speedup with different size data. in python layer, it can gain 1.2-4.4 times speedup. detail data as below.
  4. speedup in Sockeye model : according profile info: nd.log gains 2.67 times speedup compared with the default implementation. the total time of log operator dropped from 11.5s to 4.3s. profile info as below.

@pengzhao-intel

detail data

hardware: skx-8180 single socket(28 cores)

statistics in c++ code

shape size float32(us) double(us)
default-cal-time MKLLog-cal-time cal-speedup default-cal-time MKLLog-cal-time cal-speedup
(1000, 1) 1000 6.02333 0.7816 7.706409928 7.00213 1.58913 4.406266322
(1000, 10) 10000 9.4944 4.0589 2.339155929 18.8873 10.0668 1.876197004
(1000, 100) 100000 47.3687 35.2625 1.343316554 137.633 30.9184 4.451491668
(1000, 1000) 1000000 390.641 64.6105 6.04609158 1309 127.027 10.30489581
(1000, 10000) 10000000 4516.87 1601.08 2.821139481 13872.1 3071.23 4.51678969
(1000, 100000) 100000000 41938.7 15814.1 2.65198146 135445 30538.2 4.435264685

statistics in python front end

here is test script

shape size float32(us) double(us)
default-callback-time MKLLog-callback-time callback-speedup default-callback-time MKLLog-callback-time callback-speedup
(1000, 1) 1000 40.650368 31.105677 1.306847236 40.729841 31.542778 1.291257257
(1000, 10) 10000 43.606758 34.761429 1.254458152 53.819021 41.135152 1.308346229
(1000, 100) 100000 119.781494 76.246262 1.570981854 203.490257 77.493985 2.625884538
(1000, 1000) 1000000 1453.781128 1190.233231 1.221425423 2601.218224 1967.120171 1.322348407
(1000, 10000) 10000000 8755.636215 8626.850446 1.014928481 14128.685 8942.484856 1.579950677
(1000, 100000) 100000000 42178.8613 16119.05098 2.616708723 135685.5551 30782.89032 4.407823753

sockeye profile on skx-8180 single socket

offical master

Time of each OP:
FullyConnected         47266.636 ms 	0.77306329528     ms/call 	61142  calls 	27.81 %
CopyCPU2CPU            23471.975 ms 	0.640069128194    ms/call 	36671  calls 	13.81 %
SliceChannel           18618.516 ms 	0.611847387447    ms/call 	30430  calls 	10.96 %
Reshape                17875.549 ms 	2.03895848067     ms/call 	8767   calls 	10.52 %
where                  12352.738 ms 	2.36099732416     ms/call 	5232   calls 	7.27 %
log                    11476.426 ms 	6.58051949541     ms/call 	1744   calls 	6.75 %
take                   6019.863  ms 	0.153309810014    ms/call 	39266  calls 	3.54 %
softmax                4899.333  ms 	1.4046252867      ms/call 	3488   calls 	2.88 %
Activation             4117.32   ms 	0.0284441558262   ms/call 	144751 calls 	2.42 %
_mul_scalar            4105.616  ms 	2.35413761468     ms/call 	1744   calls 	2.42 %
elemwise_add           3983.712  ms 	0.0659664182812   ms/call 	60390  calls 	2.34 %
Concat                 3906.116  ms 	0.22158588609     ms/call 	17628  calls 	2.30 %
broadcast_add          2765.45   ms 	1.5440815187      ms/call 	1791   calls 	1.63 %
batch_dot              2552.298  ms 	0.731736811927    ms/call 	3488   calls 	1.50 %
DeleteVariable         1857.73   ms 	0.0294401128332   ms/call 	63102  calls 	1.09 %
LayerNorm              1476.755  ms 	0.591648637821    ms/call 	2496   calls 	0.87 %
repeat                 1177.079  ms 	1.19258257345     ms/call 	987    calls 	0.69 %
elemwise_mul           1133.412  ms 	0.0132791114548   ms/call 	85353  calls 	0.67 %
_slice_assign          364.709   ms 	0.121246343085    ms/call 	3008   calls 	0.21 %
SetupExec              70.361    ms 	0.000782510537496 ms/call 	89917  calls 	0.04 %
Dropout                57.91     ms 	0.0163818953324   ms/call 	3535   calls 	0.03 %
Embedding              52.36     ms 	0.0292350642099   ms/call 	1791   calls 	0.03 %
_full                  52.006    ms 	0.0282948857454   ms/call 	1838   calls 	0.03 %
sum                    41.332    ms 	0.0230776102736   ms/call 	1791   calls 	0.02 %
stack                  37.153    ms 	0.263496453901    ms/call 	141    calls 	0.02 %
SequenceMask           34.81     ms 	0.0199598623853   ms/call 	1744   calls 	0.02 %
expand_dims            22.422    ms 	0.00267949330784  ms/call 	8368   calls 	0.01 %
_equal_scalar          21.718    ms 	0.00311324541284  ms/call 	6976   calls 	0.01 %
WaitForVar             20.932    ms 	0.00427969740339  ms/call 	4891   calls 	0.01 %
broadcast_logical_or   17.47     ms 	0.00500860091743  ms/call 	3488   calls 	0.01 %
broadcast_logical_and  11.335    ms 	0.0064994266055   ms/call 	1744   calls 	0.01 %
_zeros                 10.968    ms 	0.0101461609621   ms/call 	1081   calls 	0.01 %
SwapAxis               10.414    ms 	0.110787234043    ms/call 	94     calls 	0.01 %
_greater_equal         9.397     ms 	0.00538818807339  ms/call 	1744   calls 	0.01 %
broadcast_logical_xor  9.289     ms 	0.00532626146789  ms/call 	1744   calls 	0.01 %
elemwise_div           8.07      ms 	0.00462729357798  ms/call 	1744   calls 	0.00 %
logical_not            7.188     ms 	0.00412155963303  ms/call 	1744   calls 	0.00 %
SequenceReverse        6.896     ms 	0.0733617021277   ms/call 	94     calls 	0.00 %
_div_scalar            6.207     ms 	0.00355905963303  ms/call 	1744   calls 	0.00 %
tile                   3.75      ms 	0.0797872340426   ms/call 	47     calls 	0.00 %
SequenceLast           1.559     ms 	0.033170212766    ms/call 	47     calls 	0.00 %
argsort                0.786     ms 	0.0167234042553   ms/call 	47     calls 	0.00 %
broadcast_to           0.763     ms 	0.0162340425532   ms/call 	47     calls 	0.00 %
broadcast_not_equal    0.513     ms 	0.010914893617    ms/call 	47     calls 	0.00 %
_slice_assign_scalar   0.423     ms 	0.009             ms/call 	47     calls 	0.00 %
_ones                  0.346     ms 	0.00736170212766  ms/call 	47     calls 	0.00 %
_unravel_index         0.333     ms 	0.00708510638298  ms/call 	47     calls 	0.00 %
Cast                   0.254     ms 	0.00540425531915  ms/call 	47     calls 	0.00 %
zeros_like             0.229     ms 	0.00243617021277  ms/call 	94     calls 	0.00 %

Total OP Time: 169938.42700000 ms

mkl log

Time of each OP:
FullyConnected         48077.779 ms     0.786329838736    ms/call       61142  calls    28.44 %
CopyCPU2CPU            23618.186 ms     0.644056229718    ms/call       36671  calls    13.97 %
SliceChannel           19328.791 ms     0.635188662504    ms/call       30430  calls    11.43 %
Reshape                17722.157 ms     2.02146195962     ms/call       8767   calls    10.48 %
where                  12821.864 ms     2.45066207951     ms/call       5232   calls    7.58 %
take                   9492.064  ms     0.24173748281     ms/call       39266  calls    5.61 %
softmax                5048.474  ms     1.44738360092     ms/call       3488   calls    2.99 %
Activation             4407.896  ms     0.0304515754641   ms/call       144751 calls    2.61 %
_mul_scalar            4269.909  ms     2.44834231651     ms/call       1744   calls    2.53 %
log                    4144.673  ms     2.37653268349     ms/call       1744   calls    2.45 %
elemwise_add           4128.241  ms     0.0683596787548   ms/call       60390  calls    2.44 %
Concat                 4081.049  ms     0.231509473565    ms/call       17628  calls    2.41 %
broadcast_add          2975.819  ms     1.66154048018     ms/call       1791   calls    1.76 %
batch_dot              2289.026  ms     0.656257454128    ms/call       3488   calls    1.35 %
DeleteVariable         1865.181  ms     0.0295567863085   ms/call       63105  calls    1.10 %
LayerNorm              1521.639  ms     0.609631009615    ms/call       2496   calls    0.90 %
elemwise_mul           1155.257  ms     0.013535048563    ms/call       85353  calls    0.68 %
repeat                 1141.611  ms     1.15664741641     ms/call       987    calls    0.68 %
_slice_assign          410.251   ms     0.136386635638    ms/call       3008   calls    0.24 %
Embedding              103.998   ms     0.058067001675    ms/call       1791   calls    0.06 %
SetupExec              73.053    ms     0.000812449258761 ms/call       89917  calls    0.04 %
Dropout                61.236    ms     0.0173227722772   ms/call       3535   calls    0.04 %
_full                  51.833    ms     0.0282007616975   ms/call       1838   calls    0.03 %
sum                    39.877    ms     0.0222652149637   ms/call       1791   calls    0.02 %
stack                  36.832    ms     0.261219858156    ms/call       141    calls    0.02 %
SequenceMask           36.207    ms     0.0207608944954   ms/call       1744   calls    0.02 %
expand_dims            22.735    ms     0.00271689770554  ms/call       8368   calls    0.01 %
_equal_scalar          22.159    ms     0.00317646215596  ms/call       6976   calls    0.01 %
WaitForVar             19.248    ms     0.00393378295524  ms/call       4893   calls    0.01 %
broadcast_logical_or   18.373    ms     0.00526748853211  ms/call       3488   calls    0.01 %
broadcast_logical_and  11.88     ms     0.0068119266055   ms/call       1744   calls    0.01 %
SwapAxis               10.695    ms     0.113776595745    ms/call       94     calls    0.01 %
_zeros                 10.567    ms     0.00977520814061  ms/call       1081   calls    0.01 %
_greater_equal         9.364     ms     0.00536926605505  ms/call       1744   calls    0.01 %
broadcast_logical_xor  9.056     ms     0.00519266055046  ms/call       1744   calls    0.01 %
elemwise_div           7.75      ms     0.00444380733945  ms/call       1744   calls    0.00 %
logical_not            7.266     ms     0.00416628440367  ms/call       1744   calls    0.00 %
SequenceReverse        7.196     ms     0.0765531914894   ms/call       94     calls    0.00 %
_div_scalar            6.596     ms     0.00378211009174  ms/call       1744   calls    0.00 %
tile                   3.689     ms     0.0784893617021   ms/call       47     calls    0.00 %
SequenceLast           1.646     ms     0.0350212765957   ms/call       47     calls    0.00 %
_slice_assign_scalar   1.079     ms     0.0229574468085   ms/call       47     calls    0.00 %
broadcast_to           1.013     ms     0.0215531914894   ms/call       47     calls    0.00 %
argsort                0.761     ms     0.0161914893617   ms/call       47     calls    0.00 %
broadcast_not_equal    0.547     ms     0.0116382978723   ms/call       47     calls    0.00 %
_ones                  0.395     ms     0.00840425531915  ms/call       47     calls    0.00 %
_unravel_index         0.327     ms     0.00695744680851  ms/call       47     calls    0.00 %
Cast                   0.262     ms     0.00557446808511  ms/call       47     calls    0.00 %
zeros_like             0.227     ms     0.00241489361702  ms/call       94     calls    0.00 %

Total OP Time: 169075.73400000 ms

@pengzhao-intel
Copy link
Contributor

@fhieber @tdomhan to aware :)

@szha @eric-haibin-lin could you help take a review?

@anirudhacharya
Copy link
Member

@mxnet-label-bot [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Nov 7, 2018
@kalyc
Copy link
Contributor

kalyc commented Nov 13, 2018

@nswamy @sandeep-krishnamurthy @szha could you please help with reviewing the PR?

@stu1130
Copy link
Contributor

stu1130 commented Nov 21, 2018

@azai91 @mseth10 @samskalicky ping for review

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [MKL]

@marcoabreu marcoabreu added the MKL label Nov 28, 2018
@pengzhao-intel
Copy link
Contributor

@TaoLv please help take a review, thanks.


#if MSHADOW_USE_MKL == 1
#define MKLLOG(fname, DType) \
static void MKLLog(size_t size, const DType* pIn, DType* pOut) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macro has the same name with function. Using macro here seems a little bit overkill for me. How about:

#if MSHADOW_USE_MKL == 1
static inline void MKL_Log(int size, const float* pIn, float* pOut) {
  vsLn(size, pIn, pOut);
}

static inline void MKL_Log(int size, const double* pIn, double* pOut) {
  vdLn(size, pIn, pOut);
}
#endif

Also, please note that vsLn/vdLn take MKL_INT as the first input. size_t might overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's done

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributions.
Are there existing test cases to test log with mkl? Or, do we need more tests to be added?

@@ -940,7 +940,7 @@ The storage type of ``exp`` output is always dense
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseOut{"_mul"});

// log
MXNET_OPERATOR_REGISTER_UNARY_WITH_SPARSE_DR(log, cpu, mshadow_op::log)
MXNET_OPERATOR_REGISTER_UNARY(log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not understand this change. Can you please explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the defination of MXNET_OPERATOR_REGISTER_UNARY_WITH_SPARSE_DR in elemwise_unary_op.h is here:

#define MXNET_OPERATOR_REGISTER_UNARY_WITH_SPARSE_DR(__name$, __xpu$, __kernel$)        \
  MXNET_OPERATOR_REGISTER_UNARY(__name$)                                                \
  .set_attr<FCompute>("FCompute<" #__xpu$ ">", UnaryOp::Compute<__xpu$, __kernel$>)

it will set FCompute attribute to UnaryOp::Compute automatically. To call Our LogCompute, we replace it with below codes.

MXNET_OPERATOR_REGISTER_UNARY(log)
.set_attr<FCompute>("FCompute<cpu>", UnaryOp::LogCompute<cpu, mshadow_op::log>)

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending other comments to address

@@ -348,6 +351,40 @@ class UnaryOp : public OpBase {
LogUnimplementedOp(attrs, ctx, inputs, req, outputs);
}
}

#if MSHADOW_USE_MKL == 1
static inline void MKL_Log(int size, const float* pIn, float* pOut) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int: rename to MKLLog

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin Do we have any recommended convention for function/class name definition in MXNet? Just not feel very comfortable with 4 consecutive capital letters. :(

vsLn(size, pIn, pOut);
}

static inline void MKL_Log(int size, const double* pIn, double* pOut) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MKL_Log

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question (outside the scope of this PR) to the intel team. Is there any plan to add MKL specific tests to the CI ? Currently we are completely blind to it and we have also experienced issues reported but cannot be caught in PRs.

@pengzhao-intel
Copy link
Contributor

@anirudh2290 good question.
Yes, we have got the agreement with @marcoabreu @azai91 to add Intel MKL into CI for better coverage. @xinyu-intel will submit PR soon.

@anirudh2290
Copy link
Member

@pengzhao-intel Thanks for the info. Sounds good !

@xinyu-intel xinyu-intel mentioned this pull request Dec 11, 2018
7 tasks
@TaoLv
Copy link
Member

TaoLv commented Dec 12, 2018

@XiaotaoChen since #13607 is merged, please rebase code and retrigger CI. Make sure that unit test for log operator can correctly run into MKL BLAS.

const size_t MKL_INT_MAX = (sizeof(MKL_INT) == sizeof(int)) ? INT_MAX : LLONG_MAX;
size_t input_size = inputs[0].Size();
if (req[0] == kWriteTo && (type_flag == mshadow::kFloat32
|| type_flag == mshadow::kFloat64) && input_size <= MKL_INT_MAX) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indent.

|| type_flag == mshadow::kFloat64) && input_size <= MKL_INT_MAX) {
MSHADOW_SGL_DBL_TYPE_SWITCH(type_flag, DType, {
MKLLog(input_size, inputs[0].dptr<DType>(), outputs[0].dptr<DType>());
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need ; here.

#if MSHADOW_USE_MKL == 1
#include "mkl.h"
#endif
#include<climits>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space before < and put this include before #include "./cast_storage-inl.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all is done

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Seems CI status is not reflected on the page. Could you retrigger it by an empty commit?

@TaoLv
Copy link
Member

TaoLv commented Dec 14, 2018

Thank you for the contribution. Now merging.

@TaoLv TaoLv merged commit 30655f9 into apache:master Dec 14, 2018
@XiaotaoChen XiaotaoChen deleted the mkl-log branch September 28, 2019 10:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MKL pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.