-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement distributed training using horovod #3533
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -203,9 +203,15 @@ If you have a capable compute architecture, it is possible to distribute the tra | |
Horovod is capable of using MPI and NVIDIA's NCCL for highly optimized inter-process communication. | ||
It also offers `Gloo <https://github.com/facebookincubator/gloo>`_ as an easy-to-setup communication backend. | ||
|
||
For more information about setup or tuning of Horovod please visit `Horovod's Github <https://github.com/horovod/horovod>`_. | ||
For more information about setup or tuning of Horovod please visit `Horovod's documentation <https://horovod.readthedocs.io/en/stable/summary_include.html>`_. | ||
|
||
To train on 4 machines using 4 GPUs each: | ||
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine). | ||
However, this can cause unpredictable problems and user interaction in training code is needed. | ||
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine. | ||
The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No risk of improper interactions with batch size for example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not get your question at all. Specified batch size via CLI will treated as batch size for each worker not for the machine or complete system, Therefore we do learning rate rescaling. If you change code it would be possible to set different batch sizes on each gpu (e.g. for different memory or load balancing). This would open doors for load balance problem, you do not what to support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
batch size applies equally to all GPUs of one machine? Sorry but the few There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Horovod by itself does nothing with the batch size and therefore also
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L423 (Is it possible to link the code here directly?) So, your effective batch size for training on which the Optimizer is applyed is To prevent network convergence problems because of this bigger effective batch size we scale the learning rate as recommented by the horovod devs In theory horovod has no problem if you apply different batch sizes to each gpus. In practice you want to make sure every process finishes with its batch at about the same time (load balance). If one process is much late horovod error handling will take action. |
||
|
||
Detailed documentation how to run Horovod is provided `here <https://horovod.readthedocs.io/en/stable/running.html>`_. | ||
The short command to train on 4 machines using 4 GPUs each: | ||
|
||
.. code-block:: bash | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯