Implement distributed training using horovod #3533

NanoNabla · 2021-02-16T13:13:35Z

As already mentioned in Discourse we implemented distributed training using Horovod.

We tried to keep the changes as minimal as possible. It is still possible to run your undistributed code version. However we also noticed a slightly improvement by using Horovod on one machine.

We copied your train-function to train_with_horovod but I can also provide an merge by using some if FLAGS.horovod. This would reduce code duplication. It is just to improve readability at the moment.

While our code has a good scaling on local stored datasets, we experienced a lack of performance on datasets stored on you globally distributed filesystem (BeeGFS). This could be caused by tf.data.Dataset.shard. We will investigate on this.

doc/TRAINING.rst

setup.py

training/deepspeech_training/train.py

lissyx · 2021-02-16T13:27:18Z

We copied your train-function to train_with_horovod but I can also provide an merge by using some if FLAGS.horovod. This would reduce code duplication. It is just to improve readability at the moment.

I do think this makes it very difficult to review.

One question: how can this interact with TensorFlow 2 ? @reuben is working to upgrade the training path to TF2, so if Horovod breaks or requires significant changes it might be problematic?

NanoNabla · 2021-02-16T13:30:12Z

Horovod works fine with TensorFlow 2. I already took a look at #3485. There should be not problem to get it compatible with reuben's code

lissyx · 2021-02-16T13:33:10Z

Horovod works fine with TensorFlow 2. I already took a look at #3485. There should be not problem to get it compatible with reuben's code

"no problem" as in "it's a piece of cake and can be folded with the tensorflow upgrade" or as in "it will break badly until it is fixed"

I know @reuben was having perfs issues on his PR, it'd be super nice if you have feedback / time to check if horovod helps there as well.

lissyx · 2021-02-16T14:30:00Z

@NanoNabla Also, you'll want @reuben to have a look, but he's generally busy and currently not available for at least one full week, so please be patient :-).

lissyx · 2021-02-17T10:04:18Z

doc/TRAINING.rst

+Distributed training using Horovod
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you have a capable compute architecture, it is possible to distribute the training using `Horovod <https://github.com/horovod/horovod>`_. A fast network is recommended.


While I understand that people using Horovod should know the underlying requirements, I think a fast network might be troublesome to some users: we have had requests of people to use distributed training to leverage several GPUs on only Gigabit Ethernet networks.

We are still doing experiments, but ethernet might be sufficient for small setups, e.g. using two or three systems.

lissyx · 2021-02-17T10:04:57Z

doc/TRAINING.rst

+
+.. code-block:: bash
+
+    horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python3 DeepSpeech.py --train_files [...] --horovod


Maybe worth linking some official stable crash-course "how to use horovod" here ?

lissyx · 2021-02-17T10:05:48Z

setup.py

@@ -76,6 +76,10 @@ def main():
        'tensorflow == 1.15.4'
    ]

+    horovod_pypi_dep = [
+        'horovod[tensorflow] == 0.21.3'


(I have not checked the pypi repo) how does that works when tensorflow-gpu is needed, is this horovod[tensorflow] still working as expected ?

Maybe you want to take a look here https://horovod.readthedocs.io/en/stable/install.html
There is only horovod[tensorflow] for TensorFlow bindings and should be the same as HOROVOD_WITH_TENSORFLOW=1 pip install horovod.

lissyx · 2021-02-17T10:07:11Z

doc/TRAINING.rst

+
+For more information about setup or tuning of Horovod please visit `Horovod's Github <https://github.com/horovod/horovod>`_.
+
+To train on 4 machines using 4 GPUs each:


Also, are there any requirements:

same GPUs ?

same OS ?

same drivers ?

same number of GPUs on each system ?

However, Horovod is expected to run on highly heterogeneous systems, it is not documented well what works.
I would not support something else than different GPUs per machine.This can be controlled by the number of processes per host, since every process has only one GPU pinned to it.
horovodrun -np SUM_OF_PROCNUM -H server1:NUMPROC_1,server2:NUMPROC_2,server3 [...]

Maybe you can add link to Horovod doc covering those aspects ? And articulate that we cannot support anything besides homogenous configurations and that heterogenous cases will have to deal with their mess ? :)

lissyx

Thanks, it's much easier to review. I only had a quick look though, I still need to go over in details of the training loop as well as of the dataset handling.

lissyx · 2021-02-18T09:03:03Z

@NanoNabla @AndreasGocht Dumb question, but are there limitations that might make Horovod not working within a Docker context?

AndreasGocht · 2021-02-18T09:52:34Z

Hey,

I have never tried, but there is no principle limitation I am aware of. However, using MPI, there might be some gotchas:

If you use multiple Container the Containers need to see each other. There is a StackOverflow entry which discusses this.
Some MPI implementations try to use Kernel modules doing zero copies between processes. I am not sure how this works with Docker. However, as I remember they do have a fallback (though they might not be as fast)

I am not sure if Gloo is an alternative here.

BTW: do you know Singularity? It's an HPC container environment, so it focuses on performance. It might be worth a look.

Best,

Andreas

NanoNabla · 2021-02-18T09:55:21Z

As read in the docs Horovod should run within a Docker.
As docker is not supported by our hpc system and I don't have to capability to test it locally I can not give first hand support for this.

Other people on our HPC system use Horovod within Singularity which is a HPC container environment simular to docker and mostly compatible as far as I know.

lissyx · 2021-02-18T11:34:21Z

I guess I should have been more clear in my question, it's not in the context of leveraging distributed training, but on a single machine: I was thinking of giving a test run between non horovod training and horovod enabled here on my 2x RTX 2080 Ti. So, I would not care about MPI and others.

NanoNabla · 2021-02-18T11:46:42Z

If you run both processes in the same container, I do not see any problems since there communication and so on.
Just install DeepSpeech as usual, but set set DS_WITH_HOROVOD. It will take some time longer, since Horovod will be build from PyPi (no prebuild packages).
Without having MPI installed it should use Gloo and NCCL (already configured in your docker example?) which is fine in your setup.

Finally run
horovodrun -np 2 python DeepSpeech --hororov [...]

lissyx · 2021-03-01T14:03:39Z

doc/TRAINING.rst

-To train on 4 machines using 4 GPUs each:
+Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
+However, this can cause unpredictable problems and user interaction in training code is needed.
+Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.


lissyx · 2021-03-01T14:04:01Z

doc/TRAINING.rst

+Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
+However, this can cause unpredictable problems and user interaction in training code is needed.
+Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.
+The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``.


No risk of improper interactions with batch size for example?

I do not get your question at all.

Specified batch size via CLI will treated as batch size for each worker not for the machine or complete system, Therefore we do learning rate rescaling.
If all gpus are equal I do not see any problems, since all get the same batch size.

If you change code it would be possible to set different batch sizes on each gpu (e.g. for different memory or load balancing). This would open doors for load balance problem, you do not what to support.

If all gpus are equal I do not see any problems, since all get the same batch size.

batch size applies equally to all GPUs of one machine?

Sorry but the few hovorovrun examples makes it hard to grasp the meaning of the commands.

Horovod by itself does nothing with the batch size and therefore also horovodrun
The batch size used on each horovod process and so every GPU on every machine is the batch size you specified like normally with DeepSpeech.py --train_batch_size and is only used on creating dataset

train_set = create_dataset(FLAGS.train_files.split(','), batch_size=FLAGS.train_batch_size, epochs=FLAGS.epochs, augmentations=Config.augmentations, cache_path=FLAGS.feature_cache, train_phase=True, exception_box=exception_box, process_ahead=Config.num_devices * FLAGS.train_batch_size * 2, reverse=FLAGS.reverse_train, limit=FLAGS.limit_train, buffering=FLAGS.read_buffer, split_dataset=split_dataset)

https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L423 (Is it possible to link the code here directly?)

So, your effective batch size for training on which the Optimizer is applyed is FLAGS.train_batch_size * NUM_PROC where NUM_PROC is the number of used horovod processes which is equally to the number gpus on your whole setup. In Horovod terminus, equally to HPC MPI terminus (horovod.size())

To prevent network convergence problems because of this bigger effective batch size we scale the learning rate as recommented by the horovod devs
with
FLAGS.learning_rate * NUM_PROC
https://github.com/tud-zih-tools/DeepSpeech/blob/329bf876069720cf05b4e4700e6d0dde104b6bac/training/deepspeech_training/train.py#L487

In theory horovod has no problem if you apply different batch sizes to each gpus. In practice you want to make sure every process finishes with its batch at about the same time (load balance). If one process is much late horovod error handling will take action.

NanoNabla · 2021-03-09T14:12:10Z

We know you only have sparse time but what is the current state of checking this PR? Can we support it e.g. providing further information and explanations? Are their preconditions for you like TF2 integration?

We already checked the problem in #3485 because @lissyx mentioned it here but we don't have any clue how to fix it. Horovod does not change performance on a single worker (single GPU). You could just blow your batch size by using multiple horovod workers (and so GPUs) to it's previous value but that's not what you want by efficiency.

lissyx · 2021-03-09T14:40:14Z

We know you only have sparse time but what is the current state of checking this PR? Can we support it e.g. providing further information and explanations? Are their preconditions for you like TF2 integration?

Well, I asked about tf2 specifically because I know it will take me some time to properly test this, and I wanted to avoid you waste your time on something that would be requiring complete rewrite. Unfortunately, I have not yet been able to dedicate time, but if planets gets aligned properly tonight, this might change quickly.

NanoNabla · 2021-03-09T14:57:43Z

Just say the coordinates and I'll kick the planets for you. :)

reuben · 2021-03-09T15:02:24Z

I think if this is passing training tests and doesn't break compatibility with loading existing checkpoints then it's OK to land as-is. The TF2 path is looking like an almost complete training code rewrite anyway, so I don't think it makes sense to block this PR on that.

reuben

Overall LGTM. The only question I have is about checkpoint compatibility. It doesn't look like it's broken by this, but I haven't tested. If it is broken, then we should document at least in the --horovod flag doc, similar to --train_cudnn.

NanoNabla · 2021-03-15T16:05:36Z

What was the result of your CI tests?

I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully.

lissyx · 2021-03-15T16:26:51Z

I think if this is passing training tests and doesn't break compatibility with loading existing checkpoints then it's OK to land as-is. The TF2 path is looking like an almost complete training code rewrite anyway, so I don't think it makes sense to block this PR on that.

Good for me.

lissyx · 2021-03-15T16:27:11Z

I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully.

Does that means "yes" to reuben's question ?

lissyx · 2021-03-15T16:28:45Z

What was the result of your CI tests?

None yet, but I just triggered TC run (maybe one of the last, see #3317)

lissyx · 2021-03-15T17:30:33Z

It ain't breaking badly, let's merge.

NanoNabla · 2021-03-15T17:46:42Z

Thanks @lissyx and @reuben !

I checked checkpointing compatibility. I ran 2 nodes each 2 GPUs with horovod and used this checkpoints for 1 gpu on one node without horovod successfully.

Does that means "yes" to reuben's question ?

It was a yes. It works. No problems.

implement distributed training using horovod

11edd92