training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

fullfanta · 2019-04-25T02:31:32Z

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)
Compared 1.3.1 and 1.4.0, training speed of 1.4.0 is slower than 1.3.1 with large number of classes, for example 80000.

Environment info (Required)

----------Python Info----------
Version : 3.5.2
Compiler : GCC 5.4.0 20160609
Build : ('default', 'Nov 12 2018 13:43:14')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 10.0.1
Directory : /hanmail/.local/lib/python3.5/site-packages/pip
----------MXNet Info-----------
Version : 1.4.0
Directory : /usr/local/lib/python3.5/dist-packages/mxnet
Commit Hash : a03d59e
----------System Info----------
Platform : Linux-4.4.0-78-generic-x86_64-with-Ubuntu-16.04-xenial
system : Linux
node : *
release : 4.4.0-78-generic
version : #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2195.156
BogoMIPS: 4392.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts

Package used (Python/R/Scala/Julia):
python3.5

Build info (Required if built from source)

I installed through pip3.
pip3 install mxnet-cu92==1.3.1
pip3 install mxnet-cu92==1.4.0

Minimum reproducible example

I used imagenet classification benchmark. (https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_imagenet.py)

Default network is resnet with 50 layers.
Number of classes is 80000.

command is simple as

python3 train_imagenet.py --benchmark=1 --gpus=0,1,2,3,4,5,6,7 --batch-size=1024 --num-classes=80000

following result is from 1.4.0

INFO:root:start with arguments Namespace(batch_size=1024, benchmark=1, brightness=0, contrast=0, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', fill_value=127, gc_threshold=0.5, gc_type='none', gpus='0,1,2,3,4,5,6,7', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_crop_size=-1, max_random_area=1, max_random_aspect_ratio=0, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_crop_size=-1, min_random_area=1, min_random_aspect_ratio=None, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=80000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0, profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=0, random_resized_crop=0, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0, save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[11:11:31] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20]	Speed: 751.94 samples/sec	accuracy=0.002046
INFO:root:Epoch[0] Batch [20-40]	Speed: 747.25 samples/sec	accuracy=0.466602
INFO:root:Epoch[0] Batch [40-60]	Speed: 749.05 samples/sec	accuracy=0.982373
INFO:root:Epoch[0] Batch [60-80]	Speed: 752.68 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [80-100]	Speed: 746.27 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [100-120]	Speed: 747.09 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [120-140]	Speed: 745.22 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [140-160]	Speed: 751.97 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [160-180]	Speed: 741.86 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [180-200]	Speed: 746.97 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [200-220]	Speed: 750.02 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [220-240]	Speed: 747.36 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [240-260]	Speed: 745.85 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [260-280]	Speed: 749.36 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [280-300]	Speed: 751.17 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [300-320]	Speed: 747.06 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [320-340]	Speed: 752.16 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [340-360]	Speed: 752.29 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [360-380]	Speed: 751.47 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [380-400]	Speed: 751.15 samples/sec	accuracy=1.000000

following result is from 1.3.1

INFO:root:start with arguments Namespace(batch_size=1024, benchmark=1, brightness=0, contrast=0, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', fill_value=127, gc_threshold=0.5, gc_type='none', gpus='0,1,2,3,4,5,6,7', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_crop_size=-1, max_random_area=1, max_random_aspect_ratio=0, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_crop_size=-1, min_random_area=1, min_random_aspect_ratio=None, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=80000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0, profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=0, random_resized_crop=0, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0, save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[11:23:20] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20]	Speed: 1018.92 samples/sec	accuracy=0.001442
INFO:root:Epoch[0] Batch [40]	Speed: 1019.44 samples/sec	accuracy=0.464893
INFO:root:Epoch[0] Batch [60]	Speed: 1020.81 samples/sec	accuracy=0.997559
INFO:root:Epoch[0] Batch [80]	Speed: 1021.99 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [100]	Speed: 1021.33 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [120]	Speed: 1020.30 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [140]	Speed: 1023.01 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [160]	Speed: 1025.59 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [180]	Speed: 1023.41 samples/sec	accuracy=1.000000

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-04-25T02:31:35Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

lanking520 · 2019-04-25T18:24:05Z

@fullfanta Thanks for reporting that.
@apeforest @samskalicky @szha Please take a look if you have time.

samskalicky · 2019-04-25T18:33:53Z

@fullfanta we just fixed one issue: #14570 can you please the latest master branch by either:

pip install mxnet --pre
build from source

and see if the performance degradation still exists?

fullfanta · 2019-04-29T09:04:11Z

@samskalicky

I installed mxnet_cu92-1.5.0b20190428 throuth pip and noticed training speed is almost close to 1.3.1.
I close this issue because it is solved. :)

INFO:root:Epoch[0] Batch [0-20]	Speed: 1001.33 samples/sec	accuracy=0.001395
INFO:root:Epoch[0] Batch [20-40]	Speed: 992.78 samples/sec	accuracy=0.466797
INFO:root:Epoch[0] Batch [40-60]	Speed: 993.50 samples/sec	accuracy=0.994873
INFO:root:Epoch[0] Batch [60-80]	Speed: 991.49 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [80-100]	Speed: 995.76 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [100-120]	Speed: 995.24 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [120-140]	Speed: 990.84 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [140-160]	Speed: 992.87 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [160-180]	Speed: 994.61 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [180-200]	Speed: 997.63 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [200-220]	Speed: 987.09 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [220-240]	Speed: 994.05 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [240-260]	Speed: 994.24 samples/sec	accuracy=1.000000

lanking520 added Performance Training labels Apr 25, 2019

fullfanta closed this as completed Apr 29, 2019

apeforest mentioned this issue Aug 21, 2019

topk regression in v1.5 #15703

Closed

leezu mentioned this issue Oct 21, 2019

FastText embeddings not working on MxNET 1.5.1/GluonNLP 0.8.1 dmlc/gluon-nlp#981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

fullfanta commented Apr 25, 2019

mxnet-label-bot commented Apr 25, 2019

lanking520 commented Apr 25, 2019

samskalicky commented Apr 25, 2019 •

edited

Loading

fullfanta commented Apr 29, 2019 •

edited

Loading

training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

Comments

fullfanta commented Apr 25, 2019

Description

Environment info (Required)

Build info (Required if built from source)

Minimum reproducible example

mxnet-label-bot commented Apr 25, 2019

lanking520 commented Apr 25, 2019

samskalicky commented Apr 25, 2019 • edited Loading

fullfanta commented Apr 29, 2019 • edited Loading

samskalicky commented Apr 25, 2019 •

edited

Loading

fullfanta commented Apr 29, 2019 •

edited

Loading