-
Notifications
You must be signed in to change notification settings - Fork 6.8k
performance degradation from 1.3.1 to 1.4.0 #14496
Comments
Hey, this is the MXNet Label Bot. |
So question for the community, is are others seeing performance issues moving from 1.3.1 to 1.4.0? And for what operators do you see issue? |
@mxnet-label-bot add [Backend, Performance] |
I could verify the performance degradation using Sam's script on the transpose operator. The performance slow down is mainly due to arithmetic operation performance difference between int32_t and int64_t data types. I changed the data type in Tensor object in mshadow from index_t (which is typedef to int64_t) to int32_t and can see runtime of transpose operator come back to almost same as in 1.3.1. I am doing the following experiment:
I will post the experiment results and from it see how we can generalize one of these approaches. @pengzhao-intel Any compiler optimization flags for int64_t data types in Intel architecture? Also, how did MKLDNN handle int64_t performance? Your insight will be appreciated. |
Although the option 1 will enlarge the size of the binary library, I agree it in performance-sensitive codes. I wrote an example: https://github.com/wkcn/incubator-mxnet/blob/support_large_array/src/operator/mxnet_op.h#L41 |
Although the option 1 will enlarge the size of the binary library, I am in favor of it in performance-sensitive codes. |
@apeforest |
Just to clarify, for the data type change from 32bit to 64bit, we're talking about the tensor shape data type, not the actual tensor values. So for example, @apeforest's change allowed 64bit shape data type for a tensor with float elements. And this performance degradation for transpose was measured with pure-C++ (mshadow) implementations, not MKLDNN. @pengzhao-intel does MKLDNN support 64-bit tensor shapes? I cant imagine that MXNet's changes from 1.3.1 to 1.4.0 reduced the performance of MKLDNN operators, but has anyone verified if there was any issues for MKLDNN ops comapred to older versions? |
@sun-dev The performance issue talked in this issue is NOT related to MKLDNN. We have verified the performance of MKLDNN, please see below link. The MKLDNN doesn't support 64-bit tensor shape now. It supposes to fall back original CPU implementation. Feel free to let me know if you see any issues in MKLDNN. Please refer to check if MKLDNN OP is used: https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/MKLDNN_README.html#4 |
@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.
|
Is there any fix for transpose? I noticed now transpose takes significant amount of time in BERT. |
@eric-haibin-lin FYI, #14545 (comment) improved the performance of transpose a lot by MKLDNN. |
@samskalicky could you try the latest master and see if the performance is improved? |
resolved by #14570 |
There appears to be some performance degradation between the 1.3.1 and 1.4.0 releases. So far we know of imdecode and tranpose operators having reduced performance.
We've tracked the responsible PRs for these operators to:
Transpose:
https://github.com/dmlc/mshadow/pull/359/files
#11742
Imdecode:
#8757
Im using the cat.jpg from here as input data:
https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg
Heres the current performance benchmarking script for imdecode:
And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:
Setting the flag = 1 + 128 (instead of just 1) as an argument to imdecode results in the following results:
So there is some additional work thats going on that makes the imdecode take longer. This is a "feature" of the v1.4 release and changed the defaults which is where some performance degradation is happening at least in the imdecode function.
Heres the current performance benchmarking script for transpose:
And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:
The text was updated successfully, but these errors were encountered: