-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Performance Regression on CUDA10 #14725
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add [performance, cuda] |
Good catch, that's a big drop. Would you be able to run nvprof --summary before each run and paste the output here? |
@KellenSunderland I was looking at this with @stu1130 from mxnet profiler it seemed like fully_connected_backward seems to be taking more time. It could be that cublas gemm (internally called by fc backward) performance regressed for certain input shapes. agreed nvprof would be worth to look at. |
Good investigation so far, your explanation makes sense @anirudh2290 and @stu1130. I think that's enough information to get NVIDIA started on verifying the regression and looking for root causes and a fix. If we wanted to investigate further there is one more step we could potentially run to really lock down exactly why this kernel regressed. I've found API logging from cuDNN and cuBLAS to be quite useful. If you want to take a look at them, we could re-run without nvprof and add the env vars: export CUBLAS_LOGINFO_DBG=1
export CUBLAS_LOGDEST_DBG=/tmp/cublas_api_logs.txt Then grep /tmp/cublas_api_logs.txt for volta_sgemm_128x64_nt calls, and see how many variants of parameters we're calling that kernel with there are. We could then just create a minimal reproducible call to cublas that uses those params (hopefully there aren't many of them) and show the arguments for which the cublas lib regresses. |
Thanks for the useful suggestion @KellenSunderland . @stu1130 obtained the logs and didnt see any volta_sgemm_128x64_nt calls . We see cublasSgemmEx_internal calls and we tried to grep for configuration of m=128 n=64 but weren't able to find anything. I am not sure which cublas call maps to this call (volta_sgemm_128_64_nt) and we have also asked nvidia for help. Let us know if you have any ideas. |
Update for what @anirudh2290 & I have done so far
And we got the result
|
Rerun the minimal reproducible script shown above with shape combination - data shape (32,2600), wieght: (32,650), bias: (2600,650'). Set the num = 100000 and got the result along with nvprof -s
We can find volta_sgemm_128x64_nt on CUDA 10 took almost 3 times longer than CUDA 9.2. The reason why the total time it takes is simliar is that volta_sgemm_32x32_sliced1x4_nn takes most of the excution time and it's not the case in LSTM.
|
I am using mxnet-cu100 for CUDA10, but i am seeing that irrespective of using mx.cpu() or mx.gpu(), the system ends up always using the GPU. Anyone knows why? |
Tested the same model with CUDA 10.1 and got around 1027.52625 throughput which is similar to what we got 1005.25185 on CUDA 9.2 |
After sync up with Nvidia, the solution is to either upgrade to CUDA 10.1 or downgrade to CUDA 9.2. |
Description
I have observed performance regression on LSTM model. The performance drop from around 950 samples/ sec to around 780 samples/sec. I compared the result with mxnet-cu100mkl and mxnet-cu92mkl. Both of them are 1.4.0 without using nightly build.
Environment info (Required)
Steps to reproduce
(Paste the commands you ran that produced the error.)
Sample result
cu100mkl result:
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 340.001303, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 632.3600240369404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 787.0409230769234, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 534.887, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1131.2, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 796, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 361.069259, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 4132.82, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1414.0}
cu92mkl result:
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 328.856996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 633.2544512279404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 916.6658846153846, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 458.66499999999996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1132.8, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 736, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 351.549512, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 9839.61, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1416.0}
More data points
G3dn.16xlarge -cu100mkl aveage throughtput: 445
G3dn.16xlarge -cu92mkl average throughput: 444
On G3dn(Tesla M60) I don't see the performance regression
P3.16xlarge -cu100 aveage throughtput: 779
P3.16xlarge -cu92 aveage throughtput: 930
Update suggested by @KellenSunderland
nvprof --profile-child-processes -s python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 1 --metrics-suffix test --kvstore local > nvprof_output_cu100mkl 2>&1
cu100mkl
cu92mkl
MXNet Profiler API result for 1 epoch
cu100mkl
cu92mkl
The text was updated successfully, but these errors were encountered: