DOT product too slow on CPU and GPU compared to np and pytorch #17971

djaym7 · 2020-04-04T06:25:25Z

Given below are times for CPU-np, mxnet-CPU, mxnet-GPU, pytorch-CPU, pytorch-GPU.
CPU times on numpy (np) and and pytorch CPU are comparable but mxnet is crazy slow.
GPU time on pytorch is way faster than mxnet.

hardware/software : sagemaker p3.2x latest (1.6.0 cu102mkl)

pengzhao-intel · 2020-04-07T13:46:27Z

@anko-intel please help take a look for this issue, thanks.

anko-intel · 2020-04-29T19:12:23Z

Hi @djaym7
I reproduced the issue locally on Skylake-X i9-7920X. But I am not sure if I observed the same issue. Could you run following python scripts on your environment and upload the results?
test_dot_issue.py.txt
test_dot_issue_logs.py.txt

djaym7 · 2020-05-11T19:09:48Z

@anko-intel here are the results

leezu · 2020-05-11T19:23:38Z

Any relation to #17980 for cpu side? Are you comparing to mkl enabled np / pytorch?

djaym7 · 2020-05-12T00:28:27Z

np is not using mkl

Pytorch is using mkl

anko-intel · 2020-05-25T19:43:25Z

Hi @djaym7
Thank you for your results. I observe some similarity in the results measured locally on Skylake-X i9-7920X on MxNet 1.6.0 cu102mkl binary. The only exception is the time for 512x512 tensor on MxNet(?).
MxNet compiled from master branch (on b214477 - fix (#18313)) uses MKL if available, and the results are much better. But Mxnet is still worse than NumPy for smaller tensors.

Additional measurements on the master with MxNet Profiler enabled show that > 80us is spent between python and time noted by Profiler for dot operation.
It seems to be an already know issue #14883 and #17097 regarding passing python/C++ barrier. For me it sounds like fixing python-MXNet binding overhead issue should also fix this issue.

Results in table below, neglecting measurement noise, shows that differences between time measured in python and MKL are almost the same as between python and MXNet Profiler, so it confirms python <-> C++ API issue.

In the last table there are results for MxNet when both profiler and MKL verbose are enabled (adding additional time for both measurements). We can see here that the difference between python time and profile time is similar to the results in the previous tables and it is the most significant one.

Exact results of my measurements could be find in logs: dot_issue_logs.zip

pengzhao-intel · 2020-06-03T01:56:30Z

@djaym7 please take a review whether @anko-intel answered your question :)

djaym7 added the Bug label Apr 4, 2020

pengzhao-intel added the MKLDNN label Apr 7, 2020

pengzhao-intel removed the MKLDNN label Jun 3, 2020

djaym7 closed this as completed Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOT product too slow on CPU and GPU compared to np and pytorch #17971

DOT product too slow on CPU and GPU compared to np and pytorch #17971

djaym7 commented Apr 4, 2020 •

edited

Loading

pengzhao-intel commented Apr 7, 2020

anko-intel commented Apr 29, 2020

djaym7 commented May 11, 2020

leezu commented May 11, 2020

djaym7 commented May 12, 2020

anko-intel commented May 25, 2020

pengzhao-intel commented Jun 3, 2020

DOT product too slow on CPU and GPU compared to np and pytorch #17971

DOT product too slow on CPU and GPU compared to np and pytorch #17971

Comments

djaym7 commented Apr 4, 2020 • edited Loading

pengzhao-intel commented Apr 7, 2020

anko-intel commented Apr 29, 2020

djaym7 commented May 11, 2020

leezu commented May 11, 2020

djaym7 commented May 12, 2020

anko-intel commented May 25, 2020

pengzhao-intel commented Jun 3, 2020

djaym7 commented Apr 4, 2020 •

edited

Loading