-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement distributed training using horovod #3533
Conversation
I do think this makes it very difficult to review. One question: how can this interact with TensorFlow 2 ? @reuben is working to upgrade the training path to TF2, so if Horovod breaks or requires significant changes it might be problematic? |
Horovod works fine with TensorFlow 2. I already took a look at #3485. There should be not problem to get it compatible with reuben's code |
"no problem" as in "it's a piece of cake and can be folded with the tensorflow upgrade" or as in "it will break badly until it is fixed" I know @reuben was having perfs issues on his PR, it'd be super nice if you have feedback / time to check if horovod helps there as well. |
@NanoNabla Also, you'll want @reuben to have a look, but he's generally busy and currently not available for at least one full week, so please be patient :-). |
Distributed training using Horovod | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
If you have a capable compute architecture, it is possible to distribute the training using `Horovod <https://github.com/horovod/horovod>`_. A fast network is recommended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand that people using Horovod should know the underlying requirements, I think a fast network
might be troublesome to some users: we have had requests of people to use distributed training to leverage several GPUs on only Gigabit Ethernet networks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still doing experiments, but ethernet might be sufficient for small setups, e.g. using two or three systems.
|
||
.. code-block:: bash | ||
|
||
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python3 DeepSpeech.py --train_files [...] --horovod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth linking some official stable crash-course "how to use horovod" here ?
@@ -76,6 +76,10 @@ def main(): | |||
'tensorflow == 1.15.4' | |||
] | |||
|
|||
horovod_pypi_dep = [ | |||
'horovod[tensorflow] == 0.21.3' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have not checked the pypi repo) how does that works when tensorflow-gpu
is needed, is this horovod[tensorflow]
still working as expected ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you want to take a look here https://horovod.readthedocs.io/en/stable/install.html
There is only horovod[tensorflow]
for TensorFlow bindings and should be the same as HOROVOD_WITH_TENSORFLOW=1 pip install horovod
.
doc/TRAINING.rst
Outdated
|
||
For more information about setup or tuning of Horovod please visit `Horovod's Github <https://github.com/horovod/horovod>`_. | ||
|
||
To train on 4 machines using 4 GPUs each: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are there any requirements:
- same GPUs ?
- same OS ?
- same drivers ?
- same number of GPUs on each system ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, Horovod is expected to run on highly heterogeneous systems, it is not documented well what works.
I would not support something else than different GPUs per machine.This can be controlled by the number of processes per host, since every process has only one GPU pinned to it.
horovodrun -np SUM_OF_PROCNUM -H server1:NUMPROC_1,server2:NUMPROC_2,server3 [...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can add link to Horovod doc covering those aspects ? And articulate that we cannot support anything besides homogenous configurations and that heterogenous cases will have to deal with their mess ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's much easier to review. I only had a quick look though, I still need to go over in details of the training loop as well as of the dataset handling.
@NanoNabla @AndreasGocht Dumb question, but are there limitations that might make Horovod not working within a Docker context? |
Hey, I have never tried, but there is no principle limitation I am aware of. However, using MPI, there might be some gotchas:
I am not sure if Gloo is an alternative here. BTW: do you know Singularity? It's an HPC container environment, so it focuses on performance. It might be worth a look. Best, Andreas |
As read in the docs Horovod should run within a Docker. Other people on our HPC system use Horovod within Singularity which is a HPC container environment simular to docker and mostly compatible as far as I know. |
I guess I should have been more clear in my question, it's not in the context of leveraging distributed training, but on a single machine: I was thinking of giving a test run between non horovod training and horovod enabled here on my 2x RTX 2080 Ti. So, I would not care about MPI and others. |
If you run both processes in the same container, I do not see any problems since there communication and so on. Finally run |
To train on 4 machines using 4 GPUs each: | ||
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine). | ||
However, this can cause unpredictable problems and user interaction in training code is needed. | ||
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine). | ||
However, this can cause unpredictable problems and user interaction in training code is needed. | ||
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine. | ||
The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No risk of improper interactions with batch size for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not get your question at all.
Specified batch size via CLI will treated as batch size for each worker not for the machine or complete system, Therefore we do learning rate rescaling.
If all gpus are equal I do not see any problems, since all get the same batch size.
If you change code it would be possible to set different batch sizes on each gpu (e.g. for different memory or load balancing). This would open doors for load balance problem, you do not what to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all gpus are equal I do not see any problems, since all get the same batch size.
batch size applies equally to all GPUs of one machine?
Sorry but the few hovorovrun
examples makes it hard to grasp the meaning of the commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Horovod by itself does nothing with the batch size and therefore also horovodrun
The batch size used on each horovod process and so every GPU on every machine is the batch size you specified like normally with DeepSpeech.py --train_batch_size
and is only used on creating dataset
train_set = create_dataset(FLAGS.train_files.split(','),
batch_size=FLAGS.train_batch_size,
epochs=FLAGS.epochs,
augmentations=Config.augmentations,
cache_path=FLAGS.feature_cache,
train_phase=True,
exception_box=exception_box,
process_ahead=Config.num_devices * FLAGS.train_batch_size * 2,
reverse=FLAGS.reverse_train,
limit=FLAGS.limit_train,
buffering=FLAGS.read_buffer,
split_dataset=split_dataset)
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L423 (Is it possible to link the code here directly?)
So, your effective batch size for training on which the Optimizer is applyed is FLAGS.train_batch_size * NUM_PROC
where NUM_PROC
is the number of used horovod processes which is equally to the number gpus on your whole setup. In Horovod terminus, equally to HPC MPI terminus (horovod.size())
To prevent network convergence problems because of this bigger effective batch size we scale the learning rate as recommented by the horovod devs
with
FLAGS.learning_rate * NUM_PROC
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L487
In theory horovod has no problem if you apply different batch sizes to each gpus. In practice you want to make sure every process finishes with its batch at about the same time (load balance). If one process is much late horovod error handling will take action.
We know you only have sparse time but what is the current state of checking this PR? Can we support it e.g. providing further information and explanations? Are their preconditions for you like TF2 integration? We already checked the problem in #3485 because @lissyx mentioned it here but we don't have any clue how to fix it. Horovod does not change performance on a single worker (single GPU). You could just blow your batch size by using multiple horovod workers (and so GPUs) to it's previous value but that's not what you want by efficiency. |
Well, I asked about tf2 specifically because I know it will take me some time to properly test this, and I wanted to avoid you waste your time on something that would be requiring complete rewrite. Unfortunately, I have not yet been able to dedicate time, but if planets gets aligned properly tonight, this might change quickly. |
Just say the coordinates and I'll kick the planets for you. :) |
I think if this is passing training tests and doesn't break compatibility with loading existing checkpoints then it's OK to land as-is. The TF2 path is looking like an almost complete training code rewrite anyway, so I don't think it makes sense to block this PR on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. The only question I have is about checkpoint compatibility. It doesn't look like it's broken by this, but I haven't tested. If it is broken, then we should document at least in the --horovod
flag doc, similar to --train_cudnn
.
What was the result of your CI tests? I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully. |
Good for me. |
Does that means "yes" to reuben's question ? |
None yet, but I just triggered TC run (maybe one of the last, see #3317) |
It ain't breaking badly, let's merge. |
As already mentioned in Discourse we implemented distributed training using Horovod.
We tried to keep the changes as minimal as possible. It is still possible to run your undistributed code version. However we also noticed a slightly improvement by using Horovod on one machine.
We copied your
train
-function totrain_with_horovod
but I can also provide an merge by using someif FLAGS.horovod
. This would reduce code duplication. It is just to improve readability at the moment.While our code has a good scaling on local stored datasets, we experienced a lack of performance on datasets stored on you globally distributed filesystem (BeeGFS). This could be caused by
tf.data.Dataset.shard
. We will investigate on this.