Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Training GPU RTX 2080 #700

Closed
MGSousa opened this issue Mar 11, 2021 · 38 comments
Closed

Slow Training GPU RTX 2080 #700

MGSousa opened this issue Mar 11, 2021 · 38 comments

Comments

@MGSousa
Copy link

MGSousa commented Mar 11, 2021

Hi,
I am training a dataset in Portuguese but the process is very slow using CUDA with default hparams.

In this test I'm using batch_size = 8.

img1

As Expected:
Run 2~3 times faster

Anyone has this problem on Windows with Anaconda?

@ghost
Copy link

ghost commented Mar 24, 2021

Try without Anaconda and see if the training speed improves.

Although this repo works with Anaconda, we don't have enough developer interest to support it.

@ghost ghost closed this as completed Mar 24, 2021
@ghost
Copy link

ghost commented Mar 24, 2021

Will reopen issue if slow training speed is confirmed with a normal Python installation.

@ghost
Copy link

ghost commented Apr 1, 2021

@MGSousa @StElysse Thanks for your observations in #711 which confirm the issue. Reopening this, no idea where to start since I don't have Windows to test.

@ghost ghost reopened this Apr 1, 2021
@ghost
Copy link

ghost commented Apr 1, 2021

Do you have cuDNN installed?

@ghost ghost added the bug Something isn't working label Apr 1, 2021
@StElysse
Copy link

StElysse commented Apr 1, 2021

I have version 10.0 of cuDNN installed, yes.

@MGSousa
Copy link
Author

MGSousa commented Apr 1, 2021

cuDNN 8.1 for CUDA v10.2.

@justinjohn0306
Copy link

cuDNN 8.1 for CUDA v10.2.

use cuDNN 7.5

@justinjohn0306
Copy link

@MGSousa
Copy link
Author

MGSousa commented Apr 5, 2021

cuDNN 8.1 for CUDA v10.2.

use cuDNN 7.5

Same result with r=2 ~42 steps

@ghost ghost mentioned this issue May 30, 2021
@Rainer2465
Copy link

@MGSousa did it work out for you?

@MGSousa
Copy link
Author

MGSousa commented Jul 21, 2021

@Rainer2465 I changed outputs/step to four ( r=4 ~91 ) somewhat got decent results for now.
But with r=2 the results remain the same (42 ~ 43 steps /s).

@Rainer2465
Copy link

@MGSousa you're brazilian right? Can we talk a lil more about it?
se for brasileiro podemos ver alguma forma de conversarmos sobre isso? Queria saber mais sobre o dataset de vozes para o treinamento

@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 5, 2021

Nividia is providing a benchmark for Tacotron 2. Does it make sense to run it and compare the results with what we have with Real Time Voice Cloning during training, or does it compare apples with oranges ? The goal of that is to check whether the poor performance comes from Drivers / CUDA or from code or from Anaconda.

@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 5, 2021

Experiment
So far (without running the benchmark) I have noticed that the GPU power consumption is very low during training. Indeed on my RTX 3070 power draw (nvidia-smi -q -d POWER | grep Draw) shows around 60W (compared to 130W while mining or 100W while playing game). So I doubt GPU is used at its full capabilities.

Expected behaviour
If the power limit is set to 130W I would expect the power draw to reach this limit while training since many users reports 100% utilization of their GPU during training. Behaviour should be similar to mining.

@Ca-ressemble-a-du-fake
Copy link
Contributor

The command nvidia-smi -q -g 0 -d UTILIZATION -l 1 shows GPU usage varying between 20 and 50% during training. So GPU is far from being fully utilized whereas it should. So it looks like there is a huge gain margin to harvest somewhere.

I would like to use nvtop to have a graph but it is not compatible with the latest nvidia drivers. If I force the install then there is a driver / library mismatch and the training runs even slower (less than 0.1 steps / s).

@ghost
Copy link

ghost commented Nov 5, 2021

The repo includes a profiler which is used for encoder training. You can try something similar to find the bottleneck for synthesizer training.

@ghost
Copy link

ghost commented Nov 5, 2021

Same problem here too, on Ubuntu 20.04 without Anaconda and on RTX 3070 under Python 3.8.8. Around 0.52-0.54 step / s.

Originally posted by @Ca-ressemble-a-du-fake in #711 (comment)

With the whole repo in an unmodified state (including synthesizer hparams), I get 0.72-0.74 steps/s.

Switching to Tacotron2 on Tensorflow 1.x (5425557, installed with these instructions), I get 1.10-1.14 steps/s. Tensorflow training speed is variable, with the training getting faster as the model gets better. This experiment was done with the #538 model as a starting point, finetuning on a large single-speaker dataset.

Because I wanted to understand whether this difference was caused by using Taco1 vs Taco2, or PT vs TF, I also ran a training experiment with a PyTorch Taco2 implementation. Results below.

Model PT/TF Model Parameters Training Speed
Tacotron 1 PyTorch 1.3.1 30.87M 0.72-0.74 steps/s
Tacotron 2 PyTorch 1.3.1 28.44M 0.65-0.66 steps/s
Tacotron 2 Tensorflow 1.15 28.44M 1.10-1.14 steps/s

We can conclude that PyTorch is slower training than Tensorflow 1.x. And there are some other unknown issues causing your training to be even slower.

* OS: Ubuntu 20.04
    * NVIDIA Driver Version: 460.91.03
    * CUDA Version: 11.2 

* GPU: GTX 1660S (desktop)
    * Performance state: P2
    * Power draw: 90W (range: 70-110W)
    * Utilization: 80% (range: 65-95%)
    * VRAM: 5.5/6.0 GB

* Python 3.7.9 (with Anaconda)
* Same training speed obtained for Python 3.8.10 (without Anaconda) with torch==1.7.1

* Dataset storage: HDD

@Ca-ressemble-a-du-fake
Copy link
Contributor

This is a nice benchmark. Which command did you use for GPU utilization ? And which dataset did you use ? RTX 3070 should be faster than GTX 1660s on the same dataset or it should be independent from it (my rate was given for a French dataset) ?

So can we say that around 1 step / s should be OK for r = 2 and batch_size = 12 ? Should the results be the same using vanilla Tacotron 2 training on same dataset ?

I just know basic Python (eg : basic string manipulation, basic external program call, ...). How do you use the profiler you talked about earlier ? I did not find any reference of it in encoder_train.py,should I add a call in synthesizer_train.py ?

@ghost
Copy link

ghost commented Nov 6, 2021

Which command did you use for GPU utilization ?

watch -n 0.5 nvidia-smi

How do you use the profiler you talked about earlier ?

I made a branch for this: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/700_slow_training
If you don't want to get the branch, make these modifications: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/989a3e43a834100ae76ac6ff7edd3e5d532ced80

Example training output with profiler:

{| Epoch: 1/8 (20/2564) | Loss: 6.107 | 0.75 steps/s | Step: 0k | }
Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:    0ms   std:    0ms
  Data to cuda (10/10):                            mean:    1ms   std:    0ms
  Forward pass (10/10):                            mean:  466ms   std:   98ms
  Loss (10/10):                                    mean:   12ms   std:    2ms
  Backward pass (10/10):                           mean:  780ms   std:  153ms
  Parameter update (10/10):                        mean:   16ms   std:    0ms
  Extras (visualizations, saving) (10/10):         mean:    4ms   std:    0ms

@Ca-ressemble-a-du-fake
Copy link
Contributor

Thanks a lot for these precise instructions! Forward and backward passes are higher than what you showed in your comment 2 days ago. Otherwise if your comment just above deals with a GTX 1660s then my profiler results look good since RTX 3070 should be faster.

{| Epoch: 1/61 (30/748) | Loss: 0.3490 | 1.2 steps/s | Step: 115k | }

Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms
Data to cuda (10/10): mean: 1ms std: 0ms
Forward pass (10/10): mean: 292ms std: 64ms
Loss (10/10): mean: 5ms std: 1ms
Backward pass (10/10): mean: 454ms std: 101ms
Parameter update (10/10): mean: 28ms std: 1ms
Extras (visualizations, saving) (10/10): mean: 0ms std: 0ms

(Sorry I did not succeed in formatting the table as you did!

@ghost
Copy link

ghost commented Nov 7, 2021

I thought your training rate was 0.52-0.54 steps/s, did something change? Your profiler results look reasonable to me.

@Ca-ressemble-a-du-fake
Copy link
Contributor

It was indeed (for r = 2) but then I applied gradual training like r = 16 till 20k, then 8 till 40k, then 7 till 80k then 5 till 160k, and finally 2 till the end. Batch size is set constant to 12. Does it still look reasonable to you although r has increased ? In 10k, r will switch back to 2 so I'll post the profiler results to compare to.

GPU utilization shows roughly 40% (range 20-60%) and memory usage 5.7/7.8 GB (mainly used by python3 with 5.3GB) so there may be room for improvements. Maybe by increasing batch size ?

@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 7, 2021

Now that r has decreased to 2 with a batch size of 12, training rate is back to 0.52-0.54 step / s

Backward pass takes around 1 s.

`{| Epoch: 2/208 (272/748) | Loss: 0.3196 | 0.56 steps/s | Step: 166k | }

Average execution time over 10 steps:

Blocking, waiting for batch (threaded) (10/10): mean: 0ms std: 0ms
Data to cuda (10/10): mean: 1ms std: 0ms
Forward pass (10/10): mean: 682ms std: 169ms
Loss (10/10): mean: 13ms std: 4ms
Backward pass (10/10): mean: 1060ms std: 259ms
Parameter update (10/10): mean: 28ms std: 1ms
Extras (visualizations, saving) (10/10): mean: 0ms std: 0ms`

GPU utilization is still around 40% (range 20-60%) and memory has increased to 6.7GB (only for training)

@ghost
Copy link

ghost commented Nov 7, 2021

What is the GPU performance state reported by nvidia-smi?

@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 8, 2021

nvidia-smi command reports performance state P2 being used. GPU utilization dropped a little bit falling even as low as 0% sometimes, but most of the time between 20-50%.

@ghost
Copy link

ghost commented Nov 8, 2021

If GPU is running at P2, doesn't seem like it is the bottleneck. I am running out of ideas.

  • Which NVIDIA driver and CUDA version is installed?
  • Is there any possibility of a CPU bottleneck? For example a low-performance or obsolete CPU.

@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 9, 2021

Nvidia driver is 470 (I tried several ones, this one is the proprietary one) Cuda is now 11.5 (I also tried 11.3).
CPU does not seem to be the bottleneck since its utilization lives around 50% its an old i7 2600.

@Ca-ressemble-a-du-fake
Copy link
Contributor

I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify train.py so that PT profiler terminates in a given amount of time without messing up with the code. I tried to manually set max_step to 254010 because training was at step 254k and I wanted it to train for 10 steps before exiting, but it made the Loss increase, so I reverted my changes.

@Ca-ressemble-a-du-fake
Copy link
Contributor

I also tried to increase batch_size to 20 and quickly got a CUDA out of memory error that advises something :

RuntimeError: CUDA out of memory. Tried to allocate 210.00 MiB (GPU 0; 7.79 GiB total capacity; 5.01 GiB already allocated; 166.25 MiB free; 5.29 GiB reserved in total by PyTorch) **If reserved memory is >> allocated memory** try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It looks like reserved memory > allocated memory (but not >>) so should I apply their advice ?

@Ca-ressemble-a-du-fake
Copy link
Contributor

Increasing batch_size to 16 did not cause the CUDA out of memory but did not noticeably increase GPU utilization. VRAM usage as increased up to 6.9GB.

@ghost
Copy link

ghost commented Nov 9, 2021

I tried to run PyTorch bottleneck as advised on PT forum but could not correctly modify train.py so that PT profiler terminates in a given amount of time without messing up with the code.

Training schedule update

First, update the training schedule in synthesizer/hparams.py so it only runs 20 steps from scratch.

        ### Tacotron Training
        tts_schedule = [(2,  1e-3,  20,  12)],  # Train only 20 steps for benchmarking

Dataloader update

Next, you will need to change this line so num_workers=0. The profiler doesn't work with multi-worker dataloaders.

num_workers=2 if platform.system() != "Windows" else 0,

Command

Train a new model from scratch. You can call it anything, I called mine "test". The --force_restart option prevents the saved checkpoints from stopping training prematurely.

python -m torch.utils.bottleneck synthesizer_train.py --force_restart test datasets_root/SV2TTS/synthesizer/

Output

Click here to display profiler output
--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.7.1 DEBUG compiled w/ CUDA 10.2
Running with Python 3.8 and 

`pip3 list` truncated output:
numpy==1.19.4
torch==1.7.1
torchfile==0.1.0
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         6085394 function calls (5892177 primitive calls) in 43.148 seconds

   Ordered by: internal time
   List reduced from 7179 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20   15.886    0.794   15.886    0.794 {method 'run_backward' of 'torch._C._EngineBase' objects}
     2992    8.177    0.003    8.177    0.003 {method 'read' of '_io.BufferedReader' objects}
      552    3.517    0.006    3.518    0.006 {built-in method io.open}
     5250    1.717    0.000    1.717    0.000 {method 'to' of 'torch._C._TensorBase' objects}
    31000    1.018    0.000    1.018    0.000 {built-in method addmm}
    12400    0.934    0.000    0.934    0.000 {built-in method lstm_cell}
     6200    0.803    0.000    9.254    0.001 synthesizer/models/tacotron.py:270(forward)
      480    0.709    0.001    0.710    0.001 {built-in method numpy.fromfile}
       40    0.626    0.016    0.626    0.016 {built-in method gru}
    19020    0.600    0.000    0.600    0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
    12400    0.494    0.000    1.292    0.000 synthesizer/models/tacotron.py:265(zoneout)
     6200    0.491    0.000    2.612    0.000 synthesizer/models/tacotron.py:221(forward)
 94620/20    0.463    0.000   10.396    0.520 venv/lib/python3.8/site-packages/torch/nn/modules/module.py:715(_call_impl)
     6480    0.448    0.000    0.448    0.000 {built-in method conv1d}
    18720    0.419    0.000    0.419    0.000 {built-in method cat}


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
              TBackward        18.46%     556.886ms        18.46%     556.886ms     556.886ms       0.000us           NaN       0.000us       0.000us             1  
              TBackward         9.39%     283.453ms         9.39%     283.453ms     283.453ms       0.000us           NaN       0.000us       0.000us             1  
                aten::t         9.39%     283.451ms         9.39%     283.451ms     283.451ms       0.000us           NaN       0.000us       0.000us             1  
           aten::conv1d         9.36%     282.482ms         9.36%     282.482ms     282.482ms       0.000us           NaN       0.000us       0.000us             1  
      aten::convolution         9.36%     282.480ms         9.36%     282.480ms     282.480ms       0.000us           NaN       0.000us       0.000us             1  
     aten::_convolution         9.36%     282.478ms         9.36%     282.478ms     282.478ms       0.000us           NaN       0.000us       0.000us             1  
             aten::add_         9.36%     282.389ms         9.36%     282.389ms     282.389ms       0.000us           NaN       0.000us       0.000us             1  
              TBackward         4.73%     142.740ms         4.73%     142.740ms     142.740ms       0.000us           NaN       0.000us       0.000us             1  
                aten::t         4.73%     142.737ms         4.73%     142.737ms     142.737ms       0.000us           NaN       0.000us       0.000us             1  
        aten::transpose         4.65%     140.276ms         4.65%     140.276ms     140.276ms       0.000us           NaN       0.000us       0.000us             1  
           BmmBackward0         2.35%      70.930ms         2.35%      70.930ms      70.930ms       0.000us           NaN       0.000us       0.000us             1  
              aten::bmm         2.35%      70.904ms         2.35%      70.904ms      70.904ms       0.000us           NaN       0.000us       0.000us             1  
               aten::to         2.33%      70.235ms         2.33%      70.235ms      70.235ms       0.000us           NaN       0.000us       0.000us             1  
    aten::empty_strided         2.33%      70.190ms         2.33%      70.190ms      70.190ms       0.000us           NaN       0.000us       0.000us             1  
              aten::gru         1.84%      55.546ms         1.84%      55.546ms      55.546ms       0.000us           NaN       0.000us       0.000us             1  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.017s
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
               AddmmBackward        21.06%     553.945ms        21.06%     553.945ms     553.945ms     553.600ms        27.44%     553.600ms     553.600ms             1  
                    aten::mm        21.06%     553.915ms        21.06%     553.915ms     553.915ms     553.572ms        27.44%     553.572ms     553.572ms             1  
          UnsqueezeBackward0        10.92%     287.145ms        10.92%     287.145ms     287.145ms     286.460ms        14.20%     286.460ms     286.460ms             1  
               aten::squeeze        10.92%     287.135ms        10.92%     287.135ms     287.135ms       0.000us         0.00%       0.000us       0.000us             1  
                 aten::addmm         6.06%     159.393ms         6.06%     159.393ms     159.393ms     159.392ms         7.90%     159.392ms     159.392ms             1  
                MulBackward0         5.76%     151.387ms         5.76%     151.387ms     151.387ms     150.768ms         7.47%     150.768ms     150.768ms             1  
               aten::dropout         3.20%      84.220ms         3.20%      84.220ms      84.220ms      84.218ms         4.17%      84.218ms      84.218ms             1  
        aten::_fused_dropout         3.20%      84.213ms         3.20%      84.213ms      84.213ms      84.214ms         4.17%      84.214ms      84.214ms             1  
                aten::stride         3.20%      84.148ms         3.20%      84.148ms      84.148ms       0.000us         0.00%       0.000us       0.000us             1  
    CudnnConvolutionBackward         2.90%      76.257ms         2.90%      76.257ms      76.257ms      70.970ms         3.52%      70.970ms      70.970ms             1  
                AddBackward0         2.59%      68.060ms         2.59%      68.060ms      68.060ms      68.057ms         3.37%      68.057ms      68.057ms             1  
                 CatBackward         2.34%      61.523ms         2.34%      61.523ms      61.523ms       1.808ms         0.09%       1.808ms       1.808ms             1  
                 CatBackward         2.31%      60.804ms         2.31%      60.804ms      60.804ms       1.012ms         0.05%       1.012ms       1.012ms             1  
                 CatBackward         2.31%      60.761ms         2.31%      60.761ms      60.761ms       1.804ms         0.09%       1.804ms       1.804ms             1  
                 CatBackward         2.19%      57.595ms         2.19%      57.595ms      57.595ms       1.712ms         0.08%       1.712ms       1.712ms             1  
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.631s
CUDA time total: 2.018s

@Ca-ressemble-a-du-fake
Copy link
Contributor

Thank you @blue-fish for your guide! Unfortunately the computer is running out of ram after completing the second stage of the profiling (the one that involves Autograd). Neither GPU not CPU was more loaded than usual, but after the second stage completed, the RAM (and then swap space) skyrocketed and computer became unusable.

I tried with 20, 10, and even 5 steps failed because 12 GB of RAM were depleted. When computer become usable again I will try to profile for 2 steps only.

@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 10, 2021

So I could profile the training for a single step only. The main differences I see with your results are :

  • different CUDA version (11.3 (although 11.5 in actually installed) VS 10.2)
  • CUDA total time > CPU total time (in your case it is the opposite)

But the profiling has only one step.

I don't know how you can achieve to format comments so well, I was not successful in doing so. Detailed results are barely legible consequently they cannot be posted.


Environment Summary

PyTorch 1.10.0+cu113 DEBUG compiled w/ CUDA 11.3
Running with Python 3.8 and

pip3 list truncated output:
numpy==1.19.4
torch==1.10.0+cu113
torchaudio==0.10.0+cu113
torchfile==0.1.0
torchvision==0.11.1+cu113


cProfile output

     1846962 function calls (1806034 primitive calls) in 9.077 seconds

What should I look for / watch out in the results ?

@ghost
Copy link

ghost commented Nov 10, 2021

Does your motherboard support PCI Express 3.0? If it only supports PCIe 2.0 then that is likely the bottleneck.

Some motherboards are manufactured with slots that fit a x16 GPU, but don't include all 16 PCIe lanes for communication.

@ghost
Copy link

ghost commented Nov 10, 2021

@MGSousa @StElysse @Ca-ressemble-a-du-fake

For now, I am going to assume that "slow training" is caused by hardware being old or having limitations that prevent the GPU from operating at its full potential.

If this is not the case, please provide hardware details including:

  • Motherboard manufacturer and model (confirm PCIe 3.0/4.0 support, the PCIe x16 slot supports x16 bandwidth)
  • CPU model
  • RAM speed and amount

@Ca-ressemble-a-du-fake
Copy link
Contributor

@blue-fish thanks for your support. You are right according to nvidia X server settings PCIe generation is only Gen2 (CPU i7 2600 does not support PCIe 3.0). Yet maximum PCIe link width is well reported as x16 (5.0 GT/s).

Consequently you may be right old hardware could be the bottleneck for the GPU although CPU usage keeps around 12% while training, RAM usage is around 65% and PCIe Bandwidth Utilization is reported as low as 1% by nvidia X server settings. So no hardware seems to work at its full potential.

@ghost ghost closed this as completed Nov 12, 2021
@ghost ghost removed the bug Something isn't working label Nov 12, 2021
@Ca-ressemble-a-du-fake
Copy link
Contributor

Ca-ressemble-a-du-fake commented Nov 17, 2021

Just a follow-up on that topic. I tried to train the model with SIWIS dataset in French on Google Colab.
Environment is as follows :

  • Drivers : NVIDIA-SMI 495.44 Driver Version: 460.32.03 CUDA Version: 11.2
  • CPU is Intel(R) Xeon(R) CPU @ 2.30GHz with 2 processors and around 13GB of RAM.

While training with batch size of 32 and r = 7, I get on

  • K80 (around 11GB VRAM - Kepler arch from 2014) : 0,40 - 0,45 steps / s (I tried both Pytorch installs with CUDA 10.2 and CUDA 11.3 the latter being a little bit slower).
  • V100-SXM2 (around 16GB VRAM - Volta arch from 2017) : 1.4 steps / s (default install with Cuda 10.2)

I am quite surprised it is not way faster on such high end GPU (around 3 times faster than on my old setup [except for the GPU]).

Is there a way to know if code is dealing with floating point 16 or 32 operations, because I read somewhere that one mode brought a performance boost ? For now it is way too far from my current understanding of the topic!

@jxnxts
Copy link

jxnxts commented Jan 20, 2022

@Rainer2465 Mudei as saídas/passo para quatro (r=4 ~91 ) um pouco tem resultados decentes por enquanto. Mas com r=2 os resultados permanecem os mesmos (42 ~ 43 passos /s).

@MGSousa @Rainer2465 vocês possuem algum modelo treinado hoje? podemos conversar?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants