Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Significant slowdown in some DGL models #16603

Open
zheng-da opened this issue Oct 24, 2019 · 14 comments
Open

Significant slowdown in some DGL models #16603

zheng-da opened this issue Oct 24, 2019 · 14 comments

Comments

@zheng-da
Copy link
Contributor

I recently compare the performance of DGL KGE models on MXNet 1.5 and the current master branch. I noticed significant slowdown. On MXNet 1.5, it takes 12 seconds to run 1000 batches, and now takes 20 seconds. It seems there is no slowdown on operators after some profiling.

To reproduce the problem, please install DGL 0.4, download the DGL KGE package by cloning the DGL repo. The DGL KGE package is under apps/kg. Run the following command:

DGLBACKEND=mxnet python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid --test -adv
@zheng-da
Copy link
Contributor Author

Here is the profiling result on MXNet 1.5
mxnet 1 5

Here is the profiling result in MXNet 1.6
mxnet 1 6

@anirudh2290
Copy link
Member

which commit on 1.6 ?

@ChaiBapchya
Copy link
Contributor

If there is no slowdown on operators, but the models suffer significantly, what's your guess? What might be the cause? Any direction we can start deep diving?

@access2rohit
Copy link
Contributor

access2rohit commented Oct 28, 2019

@zheng-da can you share the output of
python -c "from mxnet.runtime import Features; print(Features())"
for both mxnet-1.5 and mxnet-1.6

@ddavydenko
Copy link
Contributor

@zachgk , please assign to accesstorohit@?

@anirudh2290
Copy link
Member

@access2rohit , @ChaiBapchya are you guys looking at this ?

@ChaiBapchya
Copy link
Contributor

Waiting on @zheng-da for some details
@mxnet-label-bot add [Pending requester info]

@anirudh2290
Copy link
Member

@ChaiBapchya it would be worth checking if this PR : #16526 caused some slowdown.

@ChaiBapchya
Copy link
Contributor

Tried reproducing
Got a segfault

Build flags

[✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✖ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔ DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG]

DGL & MXNet versions

python3 -c "import mxnet;import dgl;print(mxnet.__version__);print(dgl.__version__)"
1.5.1
0.4

Log

/workspace/dgl_issue/dgl/apps/kg$ DGLBACKEND=mxnet python3 train.py --model DistMult --dataset FB15k --batch_size 1024     --neg_sample_size 256 --hidden_dim 2000 --gam
ma 500.0 --lr 0.1 --max_step 100000     --batch_size_eval 16 --gpu 0 --valid --test -adv
Logs are being recorded at: ckpts/DistMult_FB15k_1/train.log
File not found. Downloading from https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/FB15k.zip
8it [00:00, 11.36it/s]
Download finished. Unzipping the file...
Unzip finished.
|Train|: 483142
|valid|: 50000
|test|: 59071
eval on 50000 edges
eval on 50000 edges
eval on 59071 edges
eval on 59071 edges
[Train](0/100000) average pos_loss: 0.6942696571350098
[Train](0/100000) average neg_loss: 0.7048686146736145
[Train](0/100000) average loss: 0.6995691061019897
[Train](0/100000) average regularization: 0.05789175629615784
0.23716044425964355
[Train](1000/100000) average pos_loss: 0.2992950127273798
[Train](1000/100000) average neg_loss: 0.571561429977417
[Train](1000/100000) average loss: 0.43542822140455245
[Train](1000/100000) average regularization: 0.12885531912744044
16.814379453659058

...
Trains for rest of the steps

[Train](99000/100000) average pos_loss: 0.18922672630846502
[Train](99000/100000) average neg_loss: 0.3200621349364519
[Train](99000/100000) average loss: 0.2546444304138422
[Train](99000/100000) average regularization: 0.10238696759194135
13.827665090560913
training takes 1850.2152254581451 seconds
Test average MRR at [99999/100000]: 0.7777071453668755
Test average MR at [99999/100000]: 44.82855377427164
Test average HITS@1 at [99999/100000]: 0.7043134533019587
Test average HITS@3 at [99999/100000]: 0.8327182543041425
Test average HITS@10 at [99999/100000]: 0.8948722723502226

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2e6b140) [0x7f933e364140]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f9414ae14b0]
  [bt] (2) /usr/local/cuda/lib64/libcudart.so.10.0(+0x1d9fe) [0x7f94111fd9fe]
  [bt] (3) /usr/local/cuda/lib64/libcudart.so.10.0(+0x2296b) [0x7f941120296b]
  [bt] (4) /usr/local/cuda/lib64/libcudart.so.10.0(cudaSetDevice+0x47) [0x7f941122a087]
  [bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25bf40a) [0x7f933dab840a]
  [bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25cc21e) [0x7f933dac521e]
  [bt] (7) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b4d8e) [0x7f933daadd8e]
  [bt] (8) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25b5cd4) [0x7f933daaecd4]

@access2rohit
Copy link
Contributor

@zheng-da can you share the flags for mxnet 1.6.x/master branch that you used ?
Also was it mxnet.15.1 or 1.5.0 from pypi ? Can you confirm ?
for flags you can fire the following command directly from bash
python -c "from mxnet.runtime import Features; print(Features())"

@access2rohit
Copy link
Contributor

@anirudh2290 We need to know the exact build flags used for 1.6.x before we start our deep dive.
@ChaiBapchya lets wait for @zheng-da 's reply

@zheng-da
Copy link
Contributor Author

I tested with nightly build. so the build flags are what nightly build usually has.

@access2rohit
Copy link
Contributor

access2rohit commented Oct 31, 2019

@zheng-da which version was it ? cu100 or 101. Was it w/ or w/o MKL ? It would be much easier if you could simply give us the thr output of this:
python -c "from mxnet.runtime import Features; print(Features())"
Can you also clarify whether it was 1.5.0 or 1.5.1 that was used for comparison ?

@zheng-da
Copy link
Contributor Author

zheng-da commented Nov 4, 2019

I just tried the experiment again and there is no problem. The command to run the experiment:

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 2000 --gpu 0

You can use the following commands to install MXNet. The problem is very easy to reproduce. You can install the MKLDNN version if you want. It makes no difference.

pip3 install mxnet-cu100
pip3 install --pre mxnet-cu100

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants