Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance slightly lower than horovod #188

Open
bengineml opened this issue Jan 9, 2020 · 3 comments
Open

Performance slightly lower than horovod #188

bengineml opened this issue Jan 9, 2020 · 3 comments

Comments

@bengineml
Copy link

Describe the bug
We've been doing some benchmarking of horovod vs byte-ps on AWS. We were hoping to see some performance improvements from using byteps for 64 GPU jobs. We've noticed that byte-ps has about the same performance for us, perhaps even a few % lower.

To Reproduce
Training a FasterRCNN model from open-mmlab on the open-images dataset on AWS.
Horovod setup: 8x p3.16xlarge workers
Byte-ps setup: 8x p3.16xlarge workers, 8x c5n.4xlarge servers with the following env-vars:

MXNET_OMP_MAX_THREADS=10
MXNET_CPU_WORKER_NTHREADS=32

Expected behavior
We expected to see a performance increase from using byte-ps over using horovod. Are there some other tweaks we can apply to try and increase the speed of byte-ps? Perhaps the model-training code itself is the bottleneck and won't benefit from better distributed training infrastructure.

Right now we're seeing the following time/step

Horovod: approx 0.591 s / step
Byte-PS: approx 0.615 s / step

Environment (please complete the following information):

  • OS: ubuntu 18.04
  • GCC version: 4.9
  • CUDA and NCCL version: CUDA 10 and NCCL 2.4.8
  • Framework (TF, PyTorch, MXNet): pytorch 1.1.0
@bobzhuyb
Copy link
Member

bobzhuyb commented Jan 9, 2020

Hello, would you tell us which version of byteps are you using? Are you using the latest master branch / latest docker image or an early version, like v0.1 tag (https://github.com/bytedance/byteps/tree/v0.1) or even earlier? If you are using something earlier, we strongly recommend that you move to the v0.1 or master, because they would had better performance. If you use these recent versions, you can get rid of the two env's you mention since BytePS no longer relies on MXNet.

After the above, You may try increasing the number of server instances, e.g., start two server instances per c5n.4xlarge. The reason is that TCP has difficulties saturating your 25Gbps network bandwidth, and more server instances mean more TCP connections and hopefully you'll get better performance.

@bengineml
Copy link
Author

Thanks for the pointers. The numbers in my previous comment were from an older version of byte-ps. After upgrading to a more recent version (commit bb91039) and running two servers per c5n.xlarge, I'm getting better performance, but still about even with Horovod.

Setup Seconds / Step
Horovod ≈0.591
Old BytePS ≈0.615
New BytePS, 2x servers ≈0.590

@bobzhuyb
Copy link
Member

bobzhuyb commented Jan 17, 2020

@bengineml It seems that cross-worker communication is not a bottleneck. Have you done any profiling?

We can also do a quick estimation for the communication overhead. Would you tell us the below numbers:

  1. the training speed on one single V100, using the same per-GPU batch size
  2. the training speed on one p3.16xlarge with 8 V100's, using the same per-GPU batch size
  3. the total size of the model (number of parameters * 2 or 4, depending on whether you are using fp16 or fp32)

Firstly, we know that you can never run faster than the numbers in #1 and #2.
With #3, we can approximate the communication time by (the total size of the model / 25Gbps bandwidth of p3.16xlarge). If this communication time is much less than 100ms, you may not see much improvement by using a different distributed framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants