Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

training speed at 1.4.0 is slower than 1.3.1 with large number of classes. #14790

Closed
fullfanta opened this issue Apr 25, 2019 · 4 comments
Closed

Comments

@fullfanta
Copy link
Contributor

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)
Compared 1.3.1 and 1.4.0, training speed of 1.4.0 is slower than 1.3.1 with large number of classes, for example 80000.

Environment info (Required)

----------Python Info----------
Version : 3.5.2
Compiler : GCC 5.4.0 20160609
Build : ('default', 'Nov 12 2018 13:43:14')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 10.0.1
Directory : /hanmail/.local/lib/python3.5/site-packages/pip
----------MXNet Info-----------
Version : 1.4.0
Directory : /usr/local/lib/python3.5/dist-packages/mxnet
Commit Hash : a03d59e
----------System Info----------
Platform : Linux-4.4.0-78-generic-x86_64-with-Ubuntu-16.04-xenial
system : Linux
node : *
release : 4.4.0-78-generic
version : #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2195.156
BogoMIPS: 4392.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts

Package used (Python/R/Scala/Julia):
python3.5

Build info (Required if built from source)

I installed through pip3.
pip3 install mxnet-cu92==1.3.1
pip3 install mxnet-cu92==1.4.0

Minimum reproducible example

I used imagenet classification benchmark. (https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_imagenet.py)

Default network is resnet with 50 layers.
Number of classes is 80000.

command is simple as

python3 train_imagenet.py --benchmark=1 --gpus=0,1,2,3,4,5,6,7 --batch-size=1024 --num-classes=80000

following result is from 1.4.0

INFO:root:start with arguments Namespace(batch_size=1024, benchmark=1, brightness=0, contrast=0, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', fill_value=127, gc_threshold=0.5, gc_type='none', gpus='0,1,2,3,4,5,6,7', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_crop_size=-1, max_random_area=1, max_random_aspect_ratio=0, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_crop_size=-1, min_random_area=1, min_random_aspect_ratio=None, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=80000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0, profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=0, random_resized_crop=0, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0, save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[11:11:31] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20]	Speed: 751.94 samples/sec	accuracy=0.002046
INFO:root:Epoch[0] Batch [20-40]	Speed: 747.25 samples/sec	accuracy=0.466602
INFO:root:Epoch[0] Batch [40-60]	Speed: 749.05 samples/sec	accuracy=0.982373
INFO:root:Epoch[0] Batch [60-80]	Speed: 752.68 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [80-100]	Speed: 746.27 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [100-120]	Speed: 747.09 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [120-140]	Speed: 745.22 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [140-160]	Speed: 751.97 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [160-180]	Speed: 741.86 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [180-200]	Speed: 746.97 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [200-220]	Speed: 750.02 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [220-240]	Speed: 747.36 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [240-260]	Speed: 745.85 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [260-280]	Speed: 749.36 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [280-300]	Speed: 751.17 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [300-320]	Speed: 747.06 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [320-340]	Speed: 752.16 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [340-360]	Speed: 752.29 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [360-380]	Speed: 751.47 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [380-400]	Speed: 751.15 samples/sec	accuracy=1.000000

following result is from 1.3.1

INFO:root:start with arguments Namespace(batch_size=1024, benchmark=1, brightness=0, contrast=0, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', fill_value=127, gc_threshold=0.5, gc_type='none', gpus='0,1,2,3,4,5,6,7', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_crop_size=-1, max_random_area=1, max_random_aspect_ratio=0, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_crop_size=-1, min_random_area=1, min_random_aspect_ratio=None, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=80000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0, profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=0, random_resized_crop=0, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0, save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[11:23:20] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20]	Speed: 1018.92 samples/sec	accuracy=0.001442
INFO:root:Epoch[0] Batch [40]	Speed: 1019.44 samples/sec	accuracy=0.464893
INFO:root:Epoch[0] Batch [60]	Speed: 1020.81 samples/sec	accuracy=0.997559
INFO:root:Epoch[0] Batch [80]	Speed: 1021.99 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [100]	Speed: 1021.33 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [120]	Speed: 1020.30 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [140]	Speed: 1023.01 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [160]	Speed: 1025.59 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [180]	Speed: 1023.41 samples/sec	accuracy=1.000000
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@lanking520
Copy link
Member

@fullfanta Thanks for reporting that.
@apeforest @samskalicky @szha Please take a look if you have time.

@samskalicky
Copy link
Contributor

samskalicky commented Apr 25, 2019

@fullfanta we just fixed one issue: #14570 can you please the latest master branch by either:

  • pip install mxnet --pre
  • build from source

and see if the performance degradation still exists?

@fullfanta
Copy link
Contributor Author

fullfanta commented Apr 29, 2019

@samskalicky

I installed mxnet_cu92-1.5.0b20190428 throuth pip and noticed training speed is almost close to 1.3.1.
I close this issue because it is solved. :)

INFO:root:Epoch[0] Batch [0-20]	Speed: 1001.33 samples/sec	accuracy=0.001395
INFO:root:Epoch[0] Batch [20-40]	Speed: 992.78 samples/sec	accuracy=0.466797
INFO:root:Epoch[0] Batch [40-60]	Speed: 993.50 samples/sec	accuracy=0.994873
INFO:root:Epoch[0] Batch [60-80]	Speed: 991.49 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [80-100]	Speed: 995.76 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [100-120]	Speed: 995.24 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [120-140]	Speed: 990.84 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [140-160]	Speed: 992.87 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [160-180]	Speed: 994.61 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [180-200]	Speed: 997.63 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [200-220]	Speed: 987.09 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [220-240]	Speed: 994.05 samples/sec	accuracy=1.000000
INFO:root:Epoch[0] Batch [240-260]	Speed: 994.24 samples/sec	accuracy=1.000000

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants