Guide using automatic mixed precision for nvidia tensor cores #640

JRMeyer · 2021-03-08T03:16:32Z

JRMeyer
Mar 8, 2021
Maintainer

>>> carlfm01
[May 20, 2019, 9:42pm]

Hello, this will be a quick guide to deploy Nvidia Docker container and
take advantage of the Nvidia Tensor Cores without changing the code.

Before starting the deployment of the optimized container should read:

NVIDIA Developer Blog -- 18 Mar
19

### Automatic Mixed Precision for NVIDIA Tensor Core Architecture in TensorFlow slash | slash ...

NVIDIA's Automatic Mixed Precision for TensorFlow automates mixed
precision math in TensorFlow, training models up to 3x faster with
minimal code changes.

The requirements are almost the same as normal DeepSpeech deployment for
training, we will require 3 extra things:

Requirements :

1. A GPU that contains Tensorcores, you should research if your GPU has
TensorCores.

2. The NVIDIA NGC TensorFlow 19.03
container

3. Install docker, it depends on your platform so, I'm not including
the installation in this guide.

Installing the container:

Before installing the container, you should only clone the DeepSpeech
repo and remove tensorflow requirement from requeriments.txt

Next, we pull the container by running:

> docker pull
> nvcr.io/nvidia/tensorflow:19.04-py3

Now we should setup our workspace inside the downloaded container, to
activate the container we will run:

sudo nvidia-docker run -it --rm -v $HOME:$HOME nvcr.io/nvidia/tensorflow:19.04-py3

Notice how I used my current home, this allow me to use my existing
paths inside the container. If you don't want to match home, you can use
to set it to any other directory:

> sudo nvidia-docker run -it slash --rm -v /user/home/deepspeech:/deepspeech
> nvcr.io/nvidia/tensorflow:19.04-py3

In my case I was using a cloud instance with an extra mounted disk, if
you need to add other path to the image like I required to just add an
extra -v .

Now we need to install the requirements:

Again, make sure you removed TensorFlow dependency from
requirements.txt, the container already is using an optimized version of
TensorFlow fully compatible with DeepSpeech.

Run inside the container at your deepspeech cloned repo:

> pip3 install -r requirements.txt

We need the decoder too:

> pip3 install slash $(python3 util/taskcluster.py slash --decoder)

You probably will hit an issue related to pandas and python 3.5

To fix the issue run:

> python3 -m pip install slash --upgrade pandas

Notice that we don't need to use a virtual environment.

Finally, we need to enable the use of auto mixed precision by:

> export TF_ENABLE_AUTO_MIXED_PRECISION=1

To check if your GPU is using tensor cores you can use nvprof in
front of your command, something like:

> nvprof python DeepSpeech.py slash --the rest of your params

Then you will get a log of used instructions, to know if the tensor
cores were used you need to search for: s884gem_fp16

My result on my small test of 300h and 1 V100 GPU:

Type Time WER Epochs
------------------------------------- --------- ---------- --------
Normal training(fp32) 2:27:54 0.084087 10
Auto Mixed precision training(fp16) 1:39:03 0.091663 10

Unfortunately, I can't run larger test.

This is a potential PR, please feel free to suggest any changes and
share insights if you use the container.

[Query regarding speed of training and issues with
convergence

[This is an archived TTS discussion thread from discourse.mozilla.org/t/guide-using-automatic-mixed-precision-for-nvidia-tensor-cores]

JRMeyer · 2021-03-08T03:16:35Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[May 20, 2019, 9:39pm]

Nice guide, thanks! One question, in your table, when you refer to
normal training, is it using TensorRT or is it something different ?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:37Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> carlfm01
[May 20, 2019, 9:47pm]

> using TensorRT or is it something different

I think TensorRT is only for inference, and I meant the fp32 training.
Thanks, I've edited the post to make it clear.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:40Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> madhavajay
[June 26, 2019, 3:38pm]

Would it be possible to use this new feature on an RTX card without
retraining the model by simply enabling mixed precision in the newer
docker container image?

export TF_ENABLE_AUTO_MIXED_PRECISION=1

Alternatively, would a simple alternative be to load the existing
pre-trained weights and then using this mode do a few passes of the data
just to adjust the weights with the new lower precision fp16 and then
export that new model?

thats awesome. Do you have
any idea on potential benchmark numbers for inference?

On a higher level, what is the
bottleneck for inference time on the current DeepSpeech implementation?
I know its an RNN so I assume its less parallelizable than other models
like image based CNNs.

Here are the numbers I get for a 4 second test file using DeepSpeech
0.3:

CPU: AMD EPYC 16 core slash
CARD: RTX 2080 in 655ms slash
CARD: GTX 1080 in 700ms slash
CARD: GTX 680 in 1200ms

Comparing the RTX 2080 and the GTX 1080 it seems that the GPU
architecture isn't haven't a huge difference.

It's certainly not using the Tensor Cores.

What is the best way to get these numbers down? slash
Try the above Mixed Precision mode on the RTX? slash
Would the TensorFlow Lite FP16 model be able to take advantage of the
Tensor Cores as well?

Keen to hear any ideas on getting inference speed down.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:43Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[June 26, 2019, 4:25pm]

> It's certainly not using the Tensor Cores.

Since it requires special handling, it's not surprising

> Comparing the RTX 2080 and the GTX 1080 it seems that the GPU
> architecture isn't haven't a huge difference.

Again, for inference, we know GPUs are faster but we also know any
decent GPU will give good perfs, there's no huge improvement expected.
Your 680/1080/2080 comparison is likely more influenced by clock speed /
memory bandwidth rather than the GPU's architecture.

> What is the best way to get these numbers down?

At some point, only batching more inferences together will likely help
much more. In the end it all depends on what you want to achieve.

> Would the TensorFlow Lite FP16 model be able to take advantage of the
> Tensor Cores as well?

No idea, getting real time accross devices on more interactive usecase
is more of a priority for us. We are welcoming anything improving other
use cases of course.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:45Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> madhavajay
[June 26, 2019, 5:01pm]

thanks for the quick response.

did you get any numbers on
inference time with this TF_ENABLE_AUTO_MIXED_PRECISION mode?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:48Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[June 26, 2019, 5:05pm]

> Would the TensorFlow Lite FP16 model be able to take advantage of the
> Tensor Cores as well?

Sorry, overheating here, I missed Lite: GPU delegation on TFLite is
still in early stages, so I don't think upstream even cares about that
...

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:16:51Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> carlfm01
[June 26, 2019, 8:11pm]

> Alternatively, would a simple alternative be to load the existing
> pre-trained weights and then using this mode do a few passes of the
> data just to adjust the weights with the new lower precision fp16 and
> then export that new model

It should work, I'm doing the opposite, training first on fp16 and then
fine tuning on fp32. When you train on fp16 the entire saved model will
be fp32.

> Do you have any idea on potential benchmark numbers for inference?

No, sorry
![:confused:](

> It's certainly not using the Tensor Cores.

You can ask here, maybe there are new uptades : slash

TensorRT](https://discourse.mozilla.org/t/deepspeech-with-tensorrt/38778/5)
drop-close='true'
.box}

> I am actually attempting this right now. So, from what I have read
> online, you need to freeze the graph after running the checkpoints and
> then run TensorRT? Have you had an any luck?

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide using automatic mixed precision for nvidia tensor cores #640

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Guide using automatic mixed precision for nvidia tensor cores #640

JRMeyer Mar 8, 2021 Maintainer

Replies: 7 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author