performance degradation from 1.3.1 to 1.4.0 #14496

samskalicky · 2019-03-21T19:00:53Z

There appears to be some performance degradation between the 1.3.1 and 1.4.0 releases. So far we know of imdecode and tranpose operators having reduced performance.

We've tracked the responsible PRs for these operators to:
Transpose:
https://github.com/dmlc/mshadow/pull/359/files
#11742

Imdecode:
#8757

Im using the cat.jpg from here as input data:
https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg

Heres the current performance benchmarking script for imdecode:

import time
import mxnet as mx
from mxnet import image as img
import numpy as np

flag=1
to_rgb=True
out=None

dtimes = []
for i in range(1000):
    buf = open("cat.jpg", 'rb').read()

    start = time.time()
    img.imdecode(buf, flag, to_rgb, out)
    decode = (time.time() - start)*1000
    dtimes.append(decode)

print('mxnet version: %s' % mx.__version__)
print('decode:---------')
print('p50: %4.2f ms' % np.percentile(dtimes,50))
print('p90: %4.2f ms' % np.percentile(dtimes,90))
print('p99: %4.2f ms' % np.percentile(dtimes,99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0
decode:--------- 
p50: 13.75 ms 
p90: 15.74 ms 
p99: 19.89 ms 

mxnet version: 1.3.1 
decode:--------- 
p50: 12.82 ms 
p90: 13.59 ms 
p99: 13.88 ms

Setting the flag = 1 + 128 (instead of just 1) as an argument to imdecode results in the following results:

mxnet version: 1.4.0 
load:--------- 
p50: 0.10 ms 
p90: 0.16 ms 
p99: 0.18 ms 
decode:--------- 
p50: 13.27 ms 
p90: 13.83 ms 
p99: 15.63 ms

So there is some additional work thats going on that makes the imdecode take longer. This is a "feature" of the v1.4 release and changed the defaults which is where some performance degradation is happening at least in the imdecode function.

Heres the current performance benchmarking script for transpose:

import mxnet as mx
import time
import numpy as np

sizes = [10, 50, 100,200,500]
iters = [10000,1000,500,200,20]
times = []
for size in range(len(sizes)):
    data = []
    s = sizes[size]
    print(s)
    for i in range(iters[size]):
        x = mx.nd.ones((s,s,s))
        mx.nd.waitall()
        start = time.time()
        y = mx.nd.transpose(x,(2,0,1))
        mx.nd.waitall()
        data.append((time.time() - start)*1000)
        #print(data[-1])                                                                                                                                                                            
    times.append(data)

print('mxnet version: %s' % mx.__version__)
for s in range(len(sizes)):
    print('--------------------')
    print('size: %s' % str(sizes[s]))
    print('p50: %4.2f ms' % np.percentile(times[s],50))
    print('p90: %4.2f ms' % np.percentile(times[s],90))
    print('p99: %4.2f ms' % np.percentile(times[s],99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0 
-------------------- 
size: 10 
p50: 0.04 ms 
p90: 0.04 ms 
p99: 0.05 ms 
-------------------- 
size: 50 
p50: 1.08 ms 
p90: 1.09 ms 
p99: 1.13 ms 
-------------------- 
size: 100 
p50: 12.50 ms 
p90: 12.53 ms 
p99: 13.52 ms 
-------------------- 
size: 200 
p50: 123.06 ms 
p90: 125.63 ms 
p99: 125.95 ms 
-------------------- 
size: 500 
p50: 2768.49 ms 
p90: 2797.46 ms 
p99: 2809.45 ms 


mxnet version: 1.3.1 
-------------------- 
size: 10 
p50: 0.03 ms 
p90: 0.04 ms 
p99: 0.04 ms 
-------------------- 
size: 50 
p50: 0.36 ms 
p90: 0.37 ms 
p99: 0.37 ms 
-------------------- 
size: 100 
p50: 2.79 ms 
p90: 2.90 ms 
p99: 3.97 ms 
-------------------- 
size: 200 
p50: 46.05 ms 
p90: 50.07 ms 
p99: 50.11 ms 
-------------------- 
size: 500 
p50: 1094.50 ms 
p90: 1095.89 ms 
p99: 1096.44 ms

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-03-21T19:00:59Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

samskalicky · 2019-03-21T19:01:35Z

So question for the community, is are others seeing performance issues moving from 1.3.1 to 1.4.0?

And for what operators do you see issue?

zachgk · 2019-03-21T21:01:58Z

@mxnet-label-bot add [Backend, Performance]

pengzhao-intel · 2019-03-22T02:21:01Z

@apeforest @zheng-da

apeforest · 2019-03-23T17:13:46Z

I could verify the performance degradation using Sam's script on the transpose operator.

The performance slow down is mainly due to arithmetic operation performance difference between int32_t and int64_t data types. I changed the data type in Tensor object in mshadow from index_t (which is typedef to int64_t) to int32_t and can see runtime of transpose operator come back to almost same as in 1.3.1.

I am doing the following experiment:

casting tensor index data type to int32_t at runtime based on tensor size
using a env variable to select between a large tensor object (defined by int64_t) and small tensor object (defined by int32_t)

I will post the experiment results and from it see how we can generalize one of these approaches.

@pengzhao-intel Any compiler optimization flags for int64_t data types in Intel architecture? Also, how did MKLDNN handle int64_t performance? Your insight will be appreciated.

wkcn · 2019-03-23T23:56:11Z

Although the option 1 will enlarge the size of the binary library, I agree it in performance-sensitive codes.

I wrote an example: https://github.com/wkcn/incubator-mxnet/blob/support_large_array/src/operator/mxnet_op.h#L41

wkcn · 2019-03-24T00:03:31Z

Although the option 1 will enlarge the size of the binary library, I am in favor of it in performance-sensitive codes.

pengzhao-intel · 2019-03-25T07:22:59Z

@apeforest
Regarding INT64, MKL-DNN only supports the INT32 at now. The next major version of 1.0 will support INT64 by default.
On the other hand, the Intel MKL library can handle INT64 (libmkl_intel_ilp64.a) and build MXNet with CFLAGS += -DMKL_ILP64.

sun-dev · 2019-03-25T16:35:59Z

Just to clarify, for the data type change from 32bit to 64bit, we're talking about the tensor shape data type, not the actual tensor values. So for example, @apeforest's change allowed 64bit shape data type for a tensor with float elements.

And this performance degradation for transpose was measured with pure-C++ (mshadow) implementations, not MKLDNN.

@pengzhao-intel does MKLDNN support 64-bit tensor shapes?

I cant imagine that MXNet's changes from 1.3.1 to 1.4.0 reduced the performance of MKLDNN operators, but has anyone verified if there was any issues for MKLDNN ops comapred to older versions?

pengzhao-intel · 2019-03-26T00:45:52Z

@sun-dev The performance issue talked in this issue is NOT related to MKLDNN. We have verified the performance of MKLDNN, please see below link.

https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking

The MKLDNN doesn't support 64-bit tensor shape now. It supposes to fall back original CPU implementation.

Feel free to let me know if you see any issues in MKLDNN.

Please refer to check if MKLDNN OP is used: https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/MKLDNN_README.html#4

apeforest · 2019-03-27T22:55:09Z

@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here

I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.

result = 49995000
Add 32 time in clocks 24869
Add 32 time in ms 1359
result = 49995000
Add 64 time in clocks 6070
Add 64 time in ms 1971
result = 349965000
Add Mul 32 time in clocks 3601
Add Mul 32 time in ms 1196
result = 349965000
Add Mul 64 time in clocks 9967
Add Mul 64 time in ms 3477
result = 7137858
Add Div 32 time in clocks 8273
Add Div 32 time in ms 2878
result = 7137858
Add Div 64 time in clocks 24016
Add Div 64 time in ms 8499

eric-haibin-lin · 2019-04-05T01:27:18Z

Is there any fix for transpose? I noticed now transpose takes significant amount of time in BERT.

pengzhao-intel · 2019-04-05T04:08:06Z

@eric-haibin-lin FYI, #14545 (comment) improved the performance of transpose a lot by MKLDNN.

pengzhao-intel · 2019-04-16T12:42:46Z

@samskalicky could you try the latest master and see if the performance is improved?

apeforest · 2019-04-24T22:47:19Z

resolved by #14570

marcoabreu added Backend Issues related to the backend of MXNet Performance labels Mar 21, 2019

This was referenced Mar 29, 2019

Performance decrease with pypi's mxnet==1.4.0 (Mac) #14563

Closed

add a compiler flag to use int64 as tensor size #14570

Merged

TaoLv mentioned this issue Apr 4, 2019

Optimize transpose operator with MKL-DNN #14545

Merged

7 tasks

apeforest closed this as completed Apr 24, 2019

adamcrussell mentioned this issue Jun 18, 2019

build error on OS X #15271

Open

ChunhuiWang-China mentioned this issue Jul 6, 2019

mac mxnet cpu compile error #15473

Open

nopattern mentioned this issue Jul 7, 2019

mx2onnx error about batchnorm #15482

Closed

anirudhacharya mentioned this issue Jul 18, 2019

[Discussion] 1.6.0 Roadmap #15589

Closed

apeforest mentioned this issue Aug 21, 2019

topk regression in v1.5 #15703

Closed

drivanov mentioned this issue Sep 26, 2019

Improving performance of argmax operator #16218

Open

4 tasks

leezu mentioned this issue Oct 21, 2019

FastText embeddings not working on MxNET 1.5.1/GluonNLP 0.8.1 dmlc/gluon-nlp#981

Closed

lgg mentioned this issue Mar 7, 2021

Infinity loop with cmake from 1.8.0 release and v1.8.x branch #19991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance degradation from 1.3.1 to 1.4.0 #14496

performance degradation from 1.3.1 to 1.4.0 #14496

samskalicky commented Mar 21, 2019 •

edited

Loading

mxnet-label-bot commented Mar 21, 2019

samskalicky commented Mar 21, 2019

zachgk commented Mar 21, 2019

pengzhao-intel commented Mar 22, 2019

apeforest commented Mar 23, 2019 •

edited

Loading

wkcn commented Mar 23, 2019 •

edited

Loading

wkcn commented Mar 24, 2019

pengzhao-intel commented Mar 25, 2019

sun-dev commented Mar 25, 2019

pengzhao-intel commented Mar 26, 2019

apeforest commented Mar 27, 2019 •

edited

Loading

eric-haibin-lin commented Apr 5, 2019

pengzhao-intel commented Apr 5, 2019

pengzhao-intel commented Apr 16, 2019

apeforest commented Apr 24, 2019

performance degradation from 1.3.1 to 1.4.0 #14496

performance degradation from 1.3.1 to 1.4.0 #14496

Comments

samskalicky commented Mar 21, 2019 • edited Loading

mxnet-label-bot commented Mar 21, 2019

samskalicky commented Mar 21, 2019

zachgk commented Mar 21, 2019

pengzhao-intel commented Mar 22, 2019

apeforest commented Mar 23, 2019 • edited Loading

wkcn commented Mar 23, 2019 • edited Loading

wkcn commented Mar 24, 2019

pengzhao-intel commented Mar 25, 2019

sun-dev commented Mar 25, 2019

pengzhao-intel commented Mar 26, 2019

apeforest commented Mar 27, 2019 • edited Loading

eric-haibin-lin commented Apr 5, 2019

pengzhao-intel commented Apr 5, 2019

pengzhao-intel commented Apr 16, 2019

apeforest commented Apr 24, 2019

samskalicky commented Mar 21, 2019 •

edited

Loading

apeforest commented Mar 23, 2019 •

edited

Loading

wkcn commented Mar 23, 2019 •

edited

Loading

apeforest commented Mar 27, 2019 •

edited

Loading