-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Training GPU RTX 2080 #700
Comments
Try without Anaconda and see if the training speed improves. Although this repo works with Anaconda, we don't have enough developer interest to support it. |
Will reopen issue if slow training speed is confirmed with a normal Python installation. |
Do you have cuDNN installed? |
I have version 10.0 of cuDNN installed, yes. |
cuDNN 8.1 for CUDA v10.2. |
use cuDNN 7.5 |
Same result with r=2 ~42 steps |
@MGSousa did it work out for you? |
@Rainer2465 I changed outputs/step to four ( r=4 ~91 ) somewhat got decent results for now. |
@MGSousa you're brazilian right? Can we talk a lil more about it? |
Nividia is providing a benchmark for Tacotron 2. Does it make sense to run it and compare the results with what we have with Real Time Voice Cloning during training, or does it compare apples with oranges ? The goal of that is to check whether the poor performance comes from Drivers / CUDA or from code or from Anaconda. |
Experiment Expected behaviour |
The command I would like to use |
The repo includes a profiler which is used for encoder training. You can try something similar to find the bottleneck for synthesizer training. |
Originally posted by @Ca-ressemble-a-du-fake in #711 (comment) With the whole repo in an unmodified state (including synthesizer hparams), I get 0.72-0.74 steps/s. Switching to Tacotron2 on Tensorflow 1.x ( Because I wanted to understand whether this difference was caused by using Taco1 vs Taco2, or PT vs TF, I also ran a training experiment with a PyTorch Taco2 implementation. Results below.
We can conclude that PyTorch is slower training than Tensorflow 1.x. And there are some other unknown issues causing your training to be even slower.
|
This is a nice benchmark. Which command did you use for GPU utilization ? And which dataset did you use ? RTX 3070 should be faster than GTX 1660s on the same dataset or it should be independent from it (my rate was given for a French dataset) ? So can we say that around 1 step / s should be OK for r = 2 and batch_size = 12 ? Should the results be the same using vanilla Tacotron 2 training on same dataset ? I just know basic Python (eg : basic string manipulation, basic external program call, ...). How do you use the profiler you talked about earlier ? I did not find any reference of it in |
I made a branch for this: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/700_slow_training Example training output with profiler:
|
Thanks a lot for these precise instructions! Forward and backward passes are higher than what you showed in your comment 2 days ago. Otherwise if your comment just above deals with a GTX 1660s then my profiler results look good since RTX 3070 should be faster. {| Epoch: 1/61 (30/748) | Loss: 0.3490 | 1.2 steps/s | Step: 115k | } Average execution time over 10 steps: (Sorry I did not succeed in formatting the table as you did! |
I thought your training rate was 0.52-0.54 steps/s, did something change? Your profiler results look reasonable to me. |
It was indeed (for r = 2) but then I applied gradual training like r = 16 till 20k, then 8 till 40k, then 7 till 80k then 5 till 160k, and finally 2 till the end. Batch size is set constant to 12. Does it still look reasonable to you although r has increased ? In 10k, r will switch back to 2 so I'll post the profiler results to compare to. GPU utilization shows roughly 40% (range 20-60%) and memory usage 5.7/7.8 GB (mainly used by python3 with 5.3GB) so there may be room for improvements. Maybe by increasing batch size ? |
Now that r has decreased to 2 with a batch size of 12, training rate is back to 0.52-0.54 step / s Backward pass takes around 1 s. `{| Epoch: 2/208 (272/748) | Loss: 0.3196 | 0.56 steps/s | Step: 166k | } Average execution time over 10 steps: Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms GPU utilization is still around 40% (range 20-60%) and memory has increased to 6.7GB (only for training) |
What is the GPU performance state reported by |
|
If GPU is running at P2, doesn't seem like it is the bottleneck. I am running out of ideas.
|
Nvidia driver is 470 (I tried several ones, this one is the proprietary one) Cuda is now 11.5 (I also tried 11.3). |
I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify |
I also tried to increase batch_size to 20 and quickly got a
|
Increasing batch_size to 16 did not cause the CUDA out of memory but did not noticeably increase GPU utilization. VRAM usage as increased up to 6.9GB. |
Training schedule updateFirst, update the training schedule in synthesizer/hparams.py so it only runs 20 steps from scratch.
Dataloader updateNext, you will need to change this line so Real-Time-Voice-Cloning/synthesizer/train.py Line 150 in 7432046
CommandTrain a new model from scratch. You can call it anything, I called mine "test". The
OutputClick here to display profiler output
|
Thank you @blue-fish for your guide! Unfortunately the computer is running out of ram after completing the second stage of the profiling (the one that involves Autograd). Neither GPU not CPU was more loaded than usual, but after the second stage completed, the RAM (and then swap space) skyrocketed and computer became unusable. I tried with 20, 10, and even 5 steps failed because 12 GB of RAM were depleted. When computer become usable again I will try to profile for 2 steps only. |
So I could profile the training for a single step only. The main differences I see with your results are :
But the profiling has only one step. I don't know how you can achieve to format comments so well, I was not successful in doing so. Detailed results are barely legible consequently they cannot be posted. Environment SummaryPyTorch 1.10.0+cu113 DEBUG compiled w/ CUDA 11.3
cProfile output
What should I look for / watch out in the results ? |
Does your motherboard support PCI Express 3.0? If it only supports PCIe 2.0 then that is likely the bottleneck. Some motherboards are manufactured with slots that fit a x16 GPU, but don't include all 16 PCIe lanes for communication. |
@MGSousa @StElysse @Ca-ressemble-a-du-fake For now, I am going to assume that "slow training" is caused by hardware being old or having limitations that prevent the GPU from operating at its full potential. If this is not the case, please provide hardware details including:
|
@blue-fish thanks for your support. You are right according to Consequently you may be right old hardware could be the bottleneck for the GPU although CPU usage keeps around 12% while training, RAM usage is around 65% and |
Just a follow-up on that topic. I tried to train the model with SIWIS dataset in French on Google Colab.
While training with batch size of 32 and r = 7, I get on
I am quite surprised it is not way faster on such high end GPU (around 3 times faster than on my old setup [except for the GPU]). Is there a way to know if code is dealing with floating point 16 or 32 operations, because I read somewhere that one mode brought a performance boost ? For now it is way too far from my current understanding of the topic! |
@MGSousa @Rainer2465 vocês possuem algum modelo treinado hoje? podemos conversar? |
Hi,
I am training a dataset in Portuguese but the process is very slow using CUDA with default
hparams
.In this test I'm using batch_size = 8.
As Expected:
Run 2~3 times faster
Anyone has this problem on Windows with Anaconda?
The text was updated successfully, but these errors were encountered: