Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

DOT product too slow on CPU and GPU compared to np and pytorch #17971

Closed
djaym7 opened this issue Apr 4, 2020 · 7 comments
Closed

DOT product too slow on CPU and GPU compared to np and pytorch #17971

djaym7 opened this issue Apr 4, 2020 · 7 comments
Labels

Comments

@djaym7
Copy link

djaym7 commented Apr 4, 2020

Given below are times for CPU-np, mxnet-CPU, mxnet-GPU, pytorch-CPU, pytorch-GPU.
CPU times on numpy (np) and and pytorch CPU are comparable but mxnet is crazy slow.
GPU time on pytorch is way faster than mxnet.

hardware/software : sagemaker p3.2x latest (1.6.0 cu102mkl)

image

image
image

image

@pengzhao-intel
Copy link
Contributor

@anko-intel please help take a look for this issue, thanks.

@anko-intel
Copy link
Contributor

Hi @djaym7
I reproduced the issue locally on Skylake-X i9-7920X. But I am not sure if I observed the same issue. Could you run following python scripts on your environment and upload the results?
test_dot_issue.py.txt
test_dot_issue_logs.py.txt

@djaym7
Copy link
Author

djaym7 commented May 11, 2020

@anko-intel here are the results
image
image

@leezu
Copy link
Contributor

leezu commented May 11, 2020

Any relation to #17980 for cpu side? Are you comparing to mkl enabled np / pytorch?

@djaym7
Copy link
Author

djaym7 commented May 12, 2020

np is not using mkl
image

Pytorch is using mkl
image

@anko-intel
Copy link
Contributor

Hi @djaym7
Thank you for your results. I observe some similarity in the results measured locally on Skylake-X i9-7920X on MxNet 1.6.0 cu102mkl binary. The only exception is the time for 512x512 tensor on MxNet(?).
MxNet compiled from master branch (on b214477 - fix (#18313)) uses MKL if available, and the results are much better. But Mxnet is still worse than NumPy for smaller tensors.
image

Additional measurements on the master with MxNet Profiler enabled show that > 80us is spent between python and time noted by Profiler for dot operation.
It seems to be an already know issue #14883 and #17097 regarding passing python/C++ barrier. For me it sounds like fixing python-MXNet binding overhead issue should also fix this issue.
image

Results in table below, neglecting measurement noise, shows that differences between time measured in python and MKL are almost the same as between python and MXNet Profiler, so it confirms python <-> C++ API issue.
image

In the last table there are results for MxNet when both profiler and MKL verbose are enabled (adding additional time for both measurements). We can see here that the difference between python time and profile time is similar to the results in the previous tables and it is the most significant one.
image

Exact results of my measurements could be find in logs: dot_issue_logs.zip

@pengzhao-intel
Copy link
Contributor

@djaym7 please take a review whether @anko-intel answered your question :)

@djaym7 djaym7 closed this as completed Jun 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants