Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

CachedOp performance regression #15067

Open
lanking520 opened this issue May 24, 2019 · 6 comments
Open

CachedOp performance regression #15067

lanking520 opened this issue May 24, 2019 · 6 comments

Comments

@lanking520
Copy link
Member

lanking520 commented May 24, 2019

Recently I am running benchmark on the cachedOp performance and get some regression on the result. Please see the table below:

Module API cachedOp with Static CachedOp without static
p2.8xlarge 43ms 42ms 51ms
p3.2xlarge 11ms 19ms 16ms
c5.4xlarge 36ms 38ms 42ms

I would like to highlight the GPU performance comparison. You can see on P2 there is a performance gain with the flag being set but regression in P3.

imported_net.hybridize(static_alloc = True, static_shape = True)

In theory, it is expected the performance boost if you set these two flags since memory is reused. However, on large GPU it seemed not performing fine.

I used nightly build

pip3 install mxnet-cu92mkl --pre
pip3 install mxnet-mkl --pre

Benchmark Script

import mxnet as mx
from mxnet import ndarray as nd
import numpy as np
import json, time, os
from mxnet import gluon

path='http://data.mxnet.io/models/imagenet/'
[mx.test_utils.download(path+'resnet/152-layers/resnet-152-0000.params'),
mx.test_utils.download(path+'resnet/152-layers/resnet-152-symbol.json'),
mx.test_utils.download(path+'synset.txt')]


def compute_stats(perf_results, results):
  results["average"] = np.average(perf_results)
  results['tp50'] = np.percentile(perf_results, 50)
  results['tp90'] = np.percentile(perf_results, 90)
  results['tp99'] = np.percentile(perf_results, 99)

ctx_str = os.environ['BENCHMARK_CTX']

if ctx_str == 'GPU':
  ctx = mx.gpu(0)
elif ctx_str == 'CPU':
  ctx = mx.cpu()

benchmark = {}

prefix = 'resnet-152'

# Model Partition time
t1 = time.time()
imported_net = gluon.nn.SymbolBlock.imports(prefix + '-symbol.json', ['data', 'softmax_label'],
                                            prefix + '-0000.params')
t2 = time.time()
elapsed = (t2 - t1) * 1000

imported_net.hybridize(static_alloc = True, static_shape = True)

benchmark['ModelLoadTime'] = elapsed

fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true')
img = mx.image.imread(fname)


# convert into format (batch, RGB, width, height)
img = mx.image.imresize(img, 300, 300) # resize
img = img.transpose((2, 0, 1)) # Channel first
img = img.expand_dims(axis=0) # batchify
img = img.astype('float32')

sf_label = nd.ones((1))

if ctx_str == 'GPU':
  img = img.as_in_context(mx.gpu(0))

# First Inference
t1 = time.time()
op = imported_net(img, sf_label)
op.wait_to_read()
t2 = time.time()
elapsed = (t2 - t1) * 1000

benchmark['FirstInferCall'] = elapsed

times = 100
time_cost = []

for idx in range(0, times):
  t1 = time.time()
  op = imported_net(img, sf_label)
  op.wait_to_read()
  t2 = time.time()
  elapsed = (t2 - t1) * 1000
  time_cost.append(elapsed)
  print("time cost: ", elapsed, "ms")

benchmark['ModelLoadTime'] = benchmark['FirstInferCall'] - time_cost[0]
compute_stats(time_cost, benchmark)

output = json.dumps(benchmark)

f = open('Inf.json', 'w')
f.write(output)
f.close()
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@pengzhao-intel
Copy link
Contributor

We have a PR to improve CachedOP recently, #14931, I am not sure if this cause the issue.
Do you mind take a try?

@ZhennanQin
Copy link
Contributor

So same benchmark with hybridize(static_alloc = True, static_shape = True) got different performance trend on different machines ?

@lanking520
Copy link
Member Author

So same benchmark with hybridize(static_alloc = True, static_shape = True) got different performance trend on different machines ?

Yeah, I suspect it is the problem is coming along with GPU.

@lanking520
Copy link
Member Author

We have a PR to improve CachedOP recently, #14931, I am not sure if this cause the issue.
Do you mind take a try?

Will do a test run on it

@sxjscience
Copy link
Member

Has the issue been solved?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants