Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance of distribute is far lower than single training #78

Open
ivanbrave opened this issue Aug 6, 2019 · 5 comments
Open

Comments

@ivanbrave
Copy link

ivanbrave commented Aug 6, 2019

Describe the bug
I follow the step-by-step tutorial and best practice tutorial, but I cannot reproduce the benchmark of below.
image
The training speed of 2workers* 8cards/each machine is far lower than 1worker*8cards/each machine.
I use four matchines, each is V100, 25Gbps tcp uniformly,
11.28.228.135 -> scheduler, server0(two instances)
11.28.228.136 -> server1(two instances)
11.28.228.139->worker0 (8 cards)
11.28.228.140->worker1 (8cards)
The total speed of 2workers with 16gpus is 40 * 16 = 640 img/sec, but the speed of single worker with 8 gpus is 170 * 8 = 1360 img/sec, I have tried the method in best practice tutorial.
2 * 8 performance:
Iter #0: 33.5 img/sec per GPU
Iter #1: 34.2 img/sec per GPU
Iter #2: 44.3 img/sec per GPU
Iter #3: 44.0 img/sec per GPU
Iter #4: 30.9 img/sec per GPU
Iter #5: 43.4 img/sec per GPU

1 * 8 performance:
Iter #0: 87.9 img/sec per GPU
Iter #1: 172.0 img/sec per GPU
Iter #2: 165.7 img/sec per GPU
Iter #3: 172.7 img/sec per GPU
Iter #4: 178.4 img/sec per GPU
Iter #5: 175.2 img/sec per GPU

To Reproduce
Steps to reproduce the behavior:
The scheduler and server script:
docker pull bytepsimage/byteps_server
sudo docker run -it --net=host bytepsimage/byteps_server bash
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=11.28.228.135 # the scheduler IP
export DMLC_PS_ROOT_PORT=7615 # the scheduler port
python /usr/local/byteps/launcher/launch.py
export MXNET_OMP_MAX_THREADS=32
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=11.28.228.135 # the scheduler IP
export DMLC_PS_ROOT_PORT=7615 # the scheduler port
python /usr/local/byteps/launcher/launch.py

The worker script:
sudo nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/worker_tensorflow bash
export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # say you have 4 GPUs
export DMLC_WORKER_ID=0 # worker-1
export DMLC_NUM_WORKER=2 # 2 workers
export DMLC_ROLE=worker # your role is worker
export DMLC_NUM_SERVER=4
export DMLC_PS_ROOT_URI=11.28.228.135 # the scheduler IP
export DMLC_PS_ROOT_PORT=7615 # the scheduler port
export EVAL_TYPE=benchmark
python /usr/local/byteps/launcher/launch.py
/usr/local/byteps/example/tensorflow/run_tensorflow_byteps.sh
--model ResNet50 --num-iters 1000000

Expected behavior
The total speed of distribute training is faster than single machine training, reproduce the chart data you have mentioned.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS:
  • GCC version:
  • CUDA and NCCL version:
  • Framework (TF, PyTorch, MXNet):

Additional context
Add any other context about the problem here.

@ivanbrave ivanbrave changed the title The performance of distribute is lower than single training The performance of distribute is far lower than single training Aug 6, 2019
@bobzhuyb
Copy link
Member

bobzhuyb commented Aug 6, 2019

export MXNET_OMP_MAX_THREADS=32 -- how many CPU cores do you have on servers? This number should be lower than your physical CPU cores.

You can try the tips in the comment here
#68 (comment)

In any case, MXNET_OMP_MAX_THREADS=10 should be more than enough for 25Gbps TCP networks. Too large MXNET_OMP_MAX_THREADS leads to significant performance degradation.

@ivanbrave
Copy link
Author

export MXNET_OMP_MAX_THREADS=32 -- how many CPU cores do you have on servers? This number should be lower than your physical CPU cores.

You can try the tips in the comment here
#68 (comment)

In any case, MXNET_OMP_MAX_THREADS=10 should be more than enough for 25Gbps TCP networks. Too large MXNET_OMP_MAX_THREADS leads to significant performance degradation.

I have 64cores each server
export MXNET_CPU_WORKER_NTHREADS=32, is this used in server or worker ? and 32 is a fixed interger?

@bobzhuyb
Copy link
Member

bobzhuyb commented Aug 7, 2019

64 physical cores or hyper-threading logic cores? I would still suggest you use much lower OMP threads. Since you start two instances per physical server, I think MXNET_OMP_MAX_THREADS=8 is good enough.

MXNET_CPU_WORKER_NTHREADS=32 is used at server. This may or may not work well.

What's the iperf throughput of single TCP connection in your environment?

@ivanbrave
Copy link
Author

64 physical cores or hyper-threading logic cores? I would still suggest you use much lower OMP threads. Since you start two instances per physical server, I think MXNET_OMP_MAX_THREADS=8 is good enough.

MXNET_CPU_WORKER_NTHREADS=32 is used at server. This may or may not work well.

What's the iperf throughput of single TCP connection in your environment?

does server and worker need in the same ip address segment?
does below ip works?
11.0.151.230 - server1
11.0.151.232 - server2
11.28.228.139 - worker1
11.28.228.140 - worker2

@bobzhuyb
Copy link
Member

bobzhuyb commented Aug 9, 2019

It works, as long as the TCP throughput is good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants