-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance slightly lower than horovod #188
Comments
Hello, would you tell us which version of byteps are you using? Are you using the latest master branch / latest docker image or an early version, like v0.1 tag (https://github.com/bytedance/byteps/tree/v0.1) or even earlier? If you are using something earlier, we strongly recommend that you move to the v0.1 or master, because they would had better performance. If you use these recent versions, you can get rid of the two env's you mention since BytePS no longer relies on MXNet. After the above, You may try increasing the number of server instances, e.g., start two server instances per |
Thanks for the pointers. The numbers in my previous comment were from an older version of
|
@bengineml It seems that cross-worker communication is not a bottleneck. Have you done any profiling? We can also do a quick estimation for the communication overhead. Would you tell us the below numbers:
Firstly, we know that you can never run faster than the numbers in #1 and #2. |
Describe the bug
We've been doing some benchmarking of horovod vs byte-ps on AWS. We were hoping to see some performance improvements from using byteps for 64 GPU jobs. We've noticed that byte-ps has about the same performance for us, perhaps even a few % lower.
To Reproduce
Training a FasterRCNN model from open-mmlab on the open-images dataset on AWS.
Horovod setup:
8x p3.16xlarge workers
Byte-ps setup:
8x p3.16xlarge workers, 8x c5n.4xlarge servers
with the following env-vars:Expected behavior
We expected to see a performance increase from using byte-ps over using horovod. Are there some other tweaks we can apply to try and increase the speed of byte-ps? Perhaps the model-training code itself is the bottleneck and won't benefit from better distributed training infrastructure.
Right now we're seeing the following time/step
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: