Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using horovod #3533

Merged
merged 4 commits into from
Mar 15, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions doc/TRAINING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,9 +203,15 @@ If you have a capable compute architecture, it is possible to distribute the tra
Horovod is capable of using MPI and NVIDIA's NCCL for highly optimized inter-process communication.
It also offers `Gloo <https://github.com/facebookincubator/gloo>`_ as an easy-to-setup communication backend.

For more information about setup or tuning of Horovod please visit `Horovod's Github <https://github.com/horovod/horovod>`_.
For more information about setup or tuning of Horovod please visit `Horovod's documentation <https://horovod.readthedocs.io/en/stable/summary_include.html>`_.

To train on 4 machines using 4 GPUs each:
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
However, this can cause unpredictable problems and user interaction in training code is needed.
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No risk of improper interactions with batch size for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get your question at all.

Specified batch size via CLI will treated as batch size for each worker not for the machine or complete system, Therefore we do learning rate rescaling.
If all gpus are equal I do not see any problems, since all get the same batch size.

If you change code it would be possible to set different batch sizes on each gpu (e.g. for different memory or load balancing). This would open doors for load balance problem, you do not what to support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all gpus are equal I do not see any problems, since all get the same batch size.

batch size applies equally to all GPUs of one machine?

Sorry but the few hovorovrun examples makes it hard to grasp the meaning of the commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Horovod by itself does nothing with the batch size and therefore also horovodrun
The batch size used on each horovod process and so every GPU on every machine is the batch size you specified like normally with DeepSpeech.py --train_batch_size and is only used on creating dataset

train_set = create_dataset(FLAGS.train_files.split(','),
                           batch_size=FLAGS.train_batch_size,
                           epochs=FLAGS.epochs,
                           augmentations=Config.augmentations,
                           cache_path=FLAGS.feature_cache,
                           train_phase=True,
                           exception_box=exception_box,
                           process_ahead=Config.num_devices * FLAGS.train_batch_size * 2,
                           reverse=FLAGS.reverse_train,
                           limit=FLAGS.limit_train,
                           buffering=FLAGS.read_buffer,
                           split_dataset=split_dataset)

https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L423 (Is it possible to link the code here directly?)

So, your effective batch size for training on which the Optimizer is applyed is FLAGS.train_batch_size * NUM_PROC where NUM_PROC is the number of used horovod processes which is equally to the number gpus on your whole setup. In Horovod terminus, equally to HPC MPI terminus (horovod.size())

To prevent network convergence problems because of this bigger effective batch size we scale the learning rate as recommented by the horovod devs
with
FLAGS.learning_rate * NUM_PROC
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L487

In theory horovod has no problem if you apply different batch sizes to each gpus. In practice you want to make sure every process finishes with its batch at about the same time (load balance). If one process is much late horovod error handling will take action.


Detailed documentation how to run Horovod is provided `here <https://horovod.readthedocs.io/en/stable/running.html>`_.
The short command to train on 4 machines using 4 GPUs each:

.. code-block:: bash

Expand Down