Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

performance degradation from 1.3.1 to 1.4.0 #14496

Closed
samskalicky opened this issue Mar 21, 2019 · 15 comments · Fixed by #14570
Closed

performance degradation from 1.3.1 to 1.4.0 #14496

samskalicky opened this issue Mar 21, 2019 · 15 comments · Fixed by #14570
Labels
Backend Issues related to the backend of MXNet Performance

Comments

@samskalicky
Copy link
Contributor

samskalicky commented Mar 21, 2019

There appears to be some performance degradation between the 1.3.1 and 1.4.0 releases. So far we know of imdecode and tranpose operators having reduced performance.

We've tracked the responsible PRs for these operators to:
Transpose:
https://github.com/dmlc/mshadow/pull/359/files
#11742

Imdecode:
#8757

Im using the cat.jpg from here as input data:
https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg

Heres the current performance benchmarking script for imdecode:

import time
import mxnet as mx
from mxnet import image as img
import numpy as np

flag=1
to_rgb=True
out=None

dtimes = []
for i in range(1000):
    buf = open("cat.jpg", 'rb').read()

    start = time.time()
    img.imdecode(buf, flag, to_rgb, out)
    decode = (time.time() - start)*1000
    dtimes.append(decode)

print('mxnet version: %s' % mx.__version__)
print('decode:---------')
print('p50: %4.2f ms' % np.percentile(dtimes,50))
print('p90: %4.2f ms' % np.percentile(dtimes,90))
print('p99: %4.2f ms' % np.percentile(dtimes,99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0
decode:--------- 
p50: 13.75 ms 
p90: 15.74 ms 
p99: 19.89 ms 

mxnet version: 1.3.1 
decode:--------- 
p50: 12.82 ms 
p90: 13.59 ms 
p99: 13.88 ms 

Setting the flag = 1 + 128 (instead of just 1) as an argument to imdecode results in the following results:

mxnet version: 1.4.0 
load:--------- 
p50: 0.10 ms 
p90: 0.16 ms 
p99: 0.18 ms 
decode:--------- 
p50: 13.27 ms 
p90: 13.83 ms 
p99: 15.63 ms 

So there is some additional work thats going on that makes the imdecode take longer. This is a "feature" of the v1.4 release and changed the defaults which is where some performance degradation is happening at least in the imdecode function.

Heres the current performance benchmarking script for transpose:

import mxnet as mx
import time
import numpy as np

sizes = [10, 50, 100,200,500]
iters = [10000,1000,500,200,20]
times = []
for size in range(len(sizes)):
    data = []
    s = sizes[size]
    print(s)
    for i in range(iters[size]):
        x = mx.nd.ones((s,s,s))
        mx.nd.waitall()
        start = time.time()
        y = mx.nd.transpose(x,(2,0,1))
        mx.nd.waitall()
        data.append((time.time() - start)*1000)
        #print(data[-1])                                                                                                                                                                            
    times.append(data)

print('mxnet version: %s' % mx.__version__)
for s in range(len(sizes)):
    print('--------------------')
    print('size: %s' % str(sizes[s]))
    print('p50: %4.2f ms' % np.percentile(times[s],50))
    print('p90: %4.2f ms' % np.percentile(times[s],90))
    print('p99: %4.2f ms' % np.percentile(times[s],99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0 
-------------------- 
size: 10 
p50: 0.04 ms 
p90: 0.04 ms 
p99: 0.05 ms 
-------------------- 
size: 50 
p50: 1.08 ms 
p90: 1.09 ms 
p99: 1.13 ms 
-------------------- 
size: 100 
p50: 12.50 ms 
p90: 12.53 ms 
p99: 13.52 ms 
-------------------- 
size: 200 
p50: 123.06 ms 
p90: 125.63 ms 
p99: 125.95 ms 
-------------------- 
size: 500 
p50: 2768.49 ms 
p90: 2797.46 ms 
p99: 2809.45 ms 


mxnet version: 1.3.1 
-------------------- 
size: 10 
p50: 0.03 ms 
p90: 0.04 ms 
p99: 0.04 ms 
-------------------- 
size: 50 
p50: 0.36 ms 
p90: 0.37 ms 
p99: 0.37 ms 
-------------------- 
size: 100 
p50: 2.79 ms 
p90: 2.90 ms 
p99: 3.97 ms 
-------------------- 
size: 200 
p50: 46.05 ms 
p90: 50.07 ms 
p99: 50.11 ms 
-------------------- 
size: 500 
p50: 1094.50 ms 
p90: 1095.89 ms 
p99: 1096.44 ms 
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@samskalicky
Copy link
Contributor Author

So question for the community, is are others seeing performance issues moving from 1.3.1 to 1.4.0?

And for what operators do you see issue?

@zachgk
Copy link
Contributor

zachgk commented Mar 21, 2019

@mxnet-label-bot add [Backend, Performance]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet Performance labels Mar 21, 2019
@pengzhao-intel
Copy link
Contributor

@apeforest @zheng-da

@apeforest
Copy link
Contributor

apeforest commented Mar 23, 2019

I could verify the performance degradation using Sam's script on the transpose operator.

The performance slow down is mainly due to arithmetic operation performance difference between int32_t and int64_t data types. I changed the data type in Tensor object in mshadow from index_t (which is typedef to int64_t) to int32_t and can see runtime of transpose operator come back to almost same as in 1.3.1.

I am doing the following experiment:

  1. casting tensor index data type to int32_t at runtime based on tensor size
  2. using a env variable to select between a large tensor object (defined by int64_t) and small tensor object (defined by int32_t)

I will post the experiment results and from it see how we can generalize one of these approaches.

@pengzhao-intel Any compiler optimization flags for int64_t data types in Intel architecture? Also, how did MKLDNN handle int64_t performance? Your insight will be appreciated.

@wkcn
Copy link
Member

wkcn commented Mar 23, 2019

Although the option 1 will enlarge the size of the binary library, I agree it in performance-sensitive codes.

I wrote an example: https://github.com/wkcn/incubator-mxnet/blob/support_large_array/src/operator/mxnet_op.h#L41

@wkcn
Copy link
Member

wkcn commented Mar 24, 2019

Although the option 1 will enlarge the size of the binary library, I am in favor of it in performance-sensitive codes.

@pengzhao-intel
Copy link
Contributor

@apeforest
Regarding INT64, MKL-DNN only supports the INT32 at now. The next major version of 1.0 will support INT64 by default.
On the other hand, the Intel MKL library can handle INT64 (libmkl_intel_ilp64.a) and build MXNet with CFLAGS += -DMKL_ILP64.

@sun-dev
Copy link

sun-dev commented Mar 25, 2019

Just to clarify, for the data type change from 32bit to 64bit, we're talking about the tensor shape data type, not the actual tensor values. So for example, @apeforest's change allowed 64bit shape data type for a tensor with float elements.

And this performance degradation for transpose was measured with pure-C++ (mshadow) implementations, not MKLDNN.

@pengzhao-intel does MKLDNN support 64-bit tensor shapes?

I cant imagine that MXNet's changes from 1.3.1 to 1.4.0 reduced the performance of MKLDNN operators, but has anyone verified if there was any issues for MKLDNN ops comapred to older versions?

@pengzhao-intel
Copy link
Contributor

@sun-dev The performance issue talked in this issue is NOT related to MKLDNN. We have verified the performance of MKLDNN, please see below link.

https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking

The MKLDNN doesn't support 64-bit tensor shape now. It supposes to fall back original CPU implementation.

Feel free to let me know if you see any issues in MKLDNN.

Please refer to check if MKLDNN OP is used: https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/MKLDNN_README.html#4

@apeforest
Copy link
Contributor

apeforest commented Mar 27, 2019

@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here

I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.

result = 49995000
Add 32 time in clocks 24869
Add 32 time in ms 1359
result = 49995000
Add 64 time in clocks 6070
Add 64 time in ms 1971
result = 349965000
Add Mul 32 time in clocks 3601
Add Mul 32 time in ms 1196
result = 349965000
Add Mul 64 time in clocks 9967
Add Mul 64 time in ms 3477
result = 7137858
Add Div 32 time in clocks 8273
Add Div 32 time in ms 2878
result = 7137858
Add Div 64 time in clocks 24016
Add Div 64 time in ms 8499

@eric-haibin-lin
Copy link
Member

Is there any fix for transpose? I noticed now transpose takes significant amount of time in BERT.

@pengzhao-intel
Copy link
Contributor

@eric-haibin-lin FYI, #14545 (comment) improved the performance of transpose a lot by MKLDNN.

@pengzhao-intel
Copy link
Contributor

@samskalicky could you try the latest master and see if the performance is improved?

@apeforest
Copy link
Contributor

resolved by #14570

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants