The performance of distribute is far lower than single training #78

ivanbrave · 2019-08-06T09:58:44Z

Describe the bug
I follow the step-by-step tutorial and best practice tutorial, but I cannot reproduce the benchmark of below.

The training speed of 2workers* 8cards/each machine is far lower than 1worker*8cards/each machine.
I use four matchines, each is V100, 25Gbps tcp uniformly,
11.28.228.135 -> scheduler, server0（two instances）
11.28.228.136 -> server1（two instances）
11.28.228.139->worker0 (8 cards)
11.28.228.140->worker1 (8cards)
The total speed of 2workers with 16gpus is 40 * 16 = 640 img/sec, but the speed of single worker with 8 gpus is 170 * 8 = 1360 img/sec, I have tried the method in best practice tutorial.
2 * 8 performance:
Iter #0: 33.5 img/sec per GPU
Iter #1: 34.2 img/sec per GPU
Iter #2: 44.3 img/sec per GPU
Iter #3: 44.0 img/sec per GPU
Iter #4: 30.9 img/sec per GPU
Iter #5: 43.4 img/sec per GPU

1 * 8 performance:
Iter #0: 87.9 img/sec per GPU
Iter #1: 172.0 img/sec per GPU
Iter #2: 165.7 img/sec per GPU
Iter #3: 172.7 img/sec per GPU
Iter #4: 178.4 img/sec per GPU
Iter #5: 175.2 img/sec per GPU

To Reproduce
Steps to reproduce the behavior:
The scheduler and server script:
docker pull bytepsimage/byteps_server
sudo docker run -it --net=host bytepsimage/byteps_server bash
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=11.28.228.135 # the scheduler IP
export DMLC_PS_ROOT_PORT=7615 # the scheduler port
python /usr/local/byteps/launcher/launch.py
export MXNET_OMP_MAX_THREADS=32
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=11.28.228.135 # the scheduler IP
export DMLC_PS_ROOT_PORT=7615 # the scheduler port
python /usr/local/byteps/launcher/launch.py

The worker script:
sudo nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/worker_tensorflow bash
export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # say you have 4 GPUs
export DMLC_WORKER_ID=0 # worker-1
export DMLC_NUM_WORKER=2 # 2 workers
export DMLC_ROLE=worker # your role is worker
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=11.28.228.135 # the scheduler IP
export DMLC_PS_ROOT_PORT=7615 # the scheduler port
export EVAL_TYPE=benchmark
python /usr/local/byteps/launcher/launch.py
/usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh
--model ResNet50 --num-iters 1000000

Expected behavior
The total speed of distribute training is faster than single machine training, reproduce the chart data you have mentioned.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS:
GCC version:
CUDA and NCCL version:
Framework (TF, PyTorch, MXNet):

Additional context
Add any other context about the problem here.

bobzhuyb · 2019-08-06T17:27:13Z

export MXNET_OMP_MAX_THREADS=32 -- how many CPU cores do you have on servers? This number should be lower than your physical CPU cores.

You can try the tips in the comment here
#68 (comment)

In any case, MXNET_OMP_MAX_THREADS=10 should be more than enough for 25Gbps TCP networks. Too large MXNET_OMP_MAX_THREADS leads to significant performance degradation.

ivanbrave · 2019-08-07T01:52:15Z

export MXNET_OMP_MAX_THREADS=32 -- how many CPU cores do you have on servers? This number should be lower than your physical CPU cores.

You can try the tips in the comment here
#68 (comment)

In any case, MXNET_OMP_MAX_THREADS=10 should be more than enough for 25Gbps TCP networks. Too large MXNET_OMP_MAX_THREADS leads to significant performance degradation.

I have 64cores each server
export MXNET_CPU_WORKER_NTHREADS=32, is this used in server or worker ? and 32 is a fixed interger?

bobzhuyb · 2019-08-07T04:38:40Z

64 physical cores or hyper-threading logic cores? I would still suggest you use much lower OMP threads. Since you start two instances per physical server, I think MXNET_OMP_MAX_THREADS=8 is good enough.

MXNET_CPU_WORKER_NTHREADS=32 is used at server. This may or may not work well.

What's the iperf throughput of single TCP connection in your environment?

ivanbrave · 2019-08-09T13:24:33Z

64 physical cores or hyper-threading logic cores? I would still suggest you use much lower OMP threads. Since you start two instances per physical server, I think MXNET_OMP_MAX_THREADS=8 is good enough.

MXNET_CPU_WORKER_NTHREADS=32 is used at server. This may or may not work well.

What's the iperf throughput of single TCP connection in your environment?

does server and worker need in the same ip address segment？
does below ip works?
11.0.151.230 - server1
11.0.151.232 - server2
11.28.228.139 - worker1
11.28.228.140 - worker2

bobzhuyb · 2019-08-09T15:51:14Z

It works, as long as the TCP throughput is good

ivanbrave changed the title ~~The performance of distribute is lower than single training~~ The performance of distribute is far lower than single training Aug 6, 2019

ymjiang mentioned this issue Aug 22, 2019

docs: pin down server omp thread number #86

Merged

DeruiLiu mentioned this issue Aug 6, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance of distribute is far lower than single training #78

The performance of distribute is far lower than single training #78

ivanbrave commented Aug 6, 2019 •

edited

Loading

bobzhuyb commented Aug 6, 2019 •

edited

Loading

ivanbrave commented Aug 7, 2019

bobzhuyb commented Aug 7, 2019

ivanbrave commented Aug 9, 2019

bobzhuyb commented Aug 9, 2019

The performance of distribute is far lower than single training #78

The performance of distribute is far lower than single training #78

Comments

ivanbrave commented Aug 6, 2019 • edited Loading

bobzhuyb commented Aug 6, 2019 • edited Loading

ivanbrave commented Aug 7, 2019

bobzhuyb commented Aug 7, 2019

ivanbrave commented Aug 9, 2019

bobzhuyb commented Aug 9, 2019

ivanbrave commented Aug 6, 2019 •

edited

Loading

bobzhuyb commented Aug 6, 2019 •

edited

Loading