Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Dependency Update] Upgrade cuDNN & NCCL #14884

Merged
merged 2 commits into from
May 6, 2019

Conversation

stu1130
Copy link
Contributor

@stu1130 stu1130 commented May 5, 2019

Description

Upgrade the CUDA 9.0/9.2/10.0 with latest cuDNN 7.5.1 & NCCL 2.4.2

Checklist

Run three models ResNet50 with ImageNet & LSTM with PTB & MLP with MNIST
Performance shown below
Environment: P3.16xlarge Deep Learning Base AMI
Codebase: commit 1540a84
I also applied the #14837 PR change
The unit of thoughput is samples/per second
Each throughput is calcuated by average of 5 runs

ResNet

model: Resnet50
dataset: Imagenet
number of gpu: 8
epochs: 3 (only to test throughput)
preprocess command: sudo pip install gluoncv==0.2.0b20180625
command: python mxnet_benchmark/train_imagenet.py --use-rec --batch-size 128 --dtype float32 —num-data-workers 40 —num-epochs 3 —gpus 0,1,2,3,4,5,6,7 --lr 0.05 --last-gamma —mode symbolic —model resnet50_v1b —rec-train /home/ubuntu/data/train-passthrough.rec —rec-train-idx /home/ubuntu/data/train-passthrough.idx —rec-val /home/ubuntu/data/val-passthrough.rec —rec-val-idx /home/ubuntu/data/val-passthrough.idx
github repo: https://github.com/rahul003/deep-learning-benchmark-mirror.git*

Throughput Tables cuDNN 7.5.1/NCCL 2.4.2 cuDNN 7.3.1/NCCL 2.3.4 Perforamnce Difference
CUDA 10 2831.54405 2821.9832 0.339%
CUDA 9.2 2832.36803 2843.28968 -0.384%
CUDA 9.0 2815.83939 2851.92915 -1.265%

**There is another performance regression with --batch-size 256 --dtype float16 --mode hybrid, please find more details on #14838

LSTM

model: LSTM
dataset: PTB(Penn Treebank)
number of gpu: 1
epochs: 10
command:
python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 10 --metrics-suffix test --kvstore local
python word_language_model/lstm_bucketing.py —num-hidden 650 —num-embed 650 —gpus 0 --epochs 10 --kv-store local

Throughput Tables cuDNN 7.5.1/NCCL 2.4.2 cuDNN 7.3.1/NCCL 2.3.4 Perforamnce Difference
CUDA 10 847.98222 868.28966 -2.339%
CUDA 9.2 1005.25185 1051.06692 -4.359%
CUDA 9.0 1002.59081 1028.46962 -1.265%

The CUDA 10 have a performance regression issue, please see #14725 to find more details.

MLP

model: 3 dense layers with num_hidden=64 and relu as activation
dataset: MNIST
number of gpu: 1
epochs: 10
command:
python2 benchmark_runner.py —framework mxnet —metrics-policy mlp —task-name mlp —metrics-suffix test —num-gpus 1 —command-to-execute 'python3 mlp.py' —data-set mnist

Throughput Tables cuDNN 7.5.1/NCCL 2.4.2 cuDNN 7.3.1/NCCL 2.3.4 Perforamnce Difference
CUDA 10 4192.20685 4094.76838 2.38%
CUDA 9.2 4212.68214 4280.69164 -1.589%
CUDA 9.0 4232.10159 4273.43268 -0.967%

Comments

@szha @lanking520 @eric-haibin-lin

@stu1130 stu1130 requested a review from szha as a code owner May 5, 2019 23:52
@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label May 6, 2019
@szha szha merged commit 0255dd6 into apache:master May 6, 2019
perdasilva added a commit to perdasilva/incubator-mxnet that referenced this pull request May 7, 2019
szha pushed a commit that referenced this pull request May 8, 2019
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
* upgrade cuDNN & NCCL

* retrigger CI
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* upgrade cuDNN & NCCL

* retrigger CI
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
@stu1130 stu1130 deleted the upgrade_cudnn_nccl branch January 12, 2020 01:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants