Can't train on two GPU's #664

zxmanxz · 2021-02-16T16:27:41Z

Hi, when I tried to train synthesizer model on my laptop with 1 Nvidia 1650 GPU all was good but when I tried to run training process on my server with two Nvidia GeForce 1080Ti I got an error:
`
╰─ python synthesizer_train.py pretrained_new datasets/SV2TTS/synthesizer -s 50 -b 50 ─╯

Arguments:
run_id: pretrained_new
syn_dir: datasets/SV2TTS/synthesizer
models_dir: synthesizer/saved_models/
save_every: 50
backup_every: 50
force_restart: False
hparams:

Checkpoint path: synthesizer/saved_models/pretrained_new/pretrained_new.pt
Loading training data from: datasets/SV2TTS/synthesizer/train.txt
Using model: Tacotron
Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 30.870M

Loading weights at synthesizer/saved_models/pretrained_new/pretrained_new.pt
Tacotron weights loaded from step 0
Using inputs from:
datasets/SV2TTS/synthesizer/train.txt
datasets/SV2TTS/synthesizer/mels
datasets/SV2TTS/synthesizer/embeds
Found 259 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
| 20k Steps | 12 | 0.001 | 2 |
+----------------+------------+---------------+------------------+

Traceback (most recent call last):
File "synthesizer_train.py", line 35, in
train(**vars(args))
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/train.py", line 175, in train
mels, embeds)
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/utils/init.py", line 17, in data_parallel_workaround
outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 362, in forward
device = next(self.parameters()).device # use same device as parameters
StopIteration

`

ghost · 2021-02-16T16:56:19Z

My ability to help with this is limited, since I don't have a server with multiple GPUs to test.

Let's see if the data_parallel_workaround is not required. In synthesizer/train.py, try copying the code from line 177 over to line 174.

Real-Time-Voice-Cloning/synthesizer/train.py

Lines 172 to 177 in 10ca8f7

    
           # Parallelize model onto GPUS using workaround due to python bug 
        
           if device.type == "cuda" and torch.cuda.device_count() > 1: 
        
               m1_hat, m2_hat, attention, stop_pred = data_parallel_workaround(model, texts, 
        
                                                                               mels, embeds) 
        
           else: 
        
               m1_hat, m2_hat, attention, stop_pred = model(texts, mels, embeds)

zxmanxz · 2021-02-16T17:53:35Z

And paste where?
Also I tried to set CUDA_VISIBLE_DEVICES=0 (to get only one GPU) but problem was the same...

ghost · 2021-02-16T18:57:53Z

If you get an identical message on a single GPU, then something is wrong because it shouldn't be executing the multi-GPU code.

Why don't you try setting CUDA_VISIBLE_DEVICES inside synthesizer_train.py? (This file is in the root of the repo, unlike train.py) See #489 (comment) . Then run the original, unmodified code. Paste the error message if you get one.

zxmanxz · 2021-02-16T19:50:56Z

It works fine with single GPU, mb you can give me an advise how to get full GPU usage (e.g. now it just using 4 GB and the other 7 are available)

ghost · 2021-02-16T20:12:12Z

To increase VRAM usage, adjust the batch size parameter (far right number) in hparams.

Real-Time-Voice-Cloning/synthesizer/hparams.py

Lines 52 to 57 in 10ca8f7

    
           tts_schedule = [(2,  1e-3,  20_000,  12),   # Progressive training schedule 
        
                           (2,  5e-4,  40_000,  12),   # (r, lr, step, batch_size) 
        
                           (2,  2e-4,  80_000,  12),   # 
        
                           (2,  1e-4, 160_000,  12),   # r = reduction factor (# of mel frames 
        
                           (2,  3e-5, 320_000,  12),   #     synthesized for each decoder iteration) 
        
                           (2,  1e-5, 640_000,  12)],  # lr = learning rate

zxmanxz · 2021-02-16T20:34:36Z

Thank you, if there would be the way to parallel computation between many GPU, it would be great.

ghost · 2021-02-17T01:56:27Z

@zxmanxz Try this branch for multi-GPU training. If it works I will submit a pull request.
https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

ghost · 2021-02-18T19:18:05Z

@zxmanxz Can you let me know if the multi-GPU branch above works for you?

zxmanxz · 2021-02-19T09:11:08Z

Yes, I'll try to use multi GPU's later.

ghost · 2021-02-28T06:58:18Z

@zxmanxz When will you be able to test the multi-GPU training code? https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

chayan-agrawal · 2021-03-02T11:53:48Z

@blue-fish In the above mentioned code, we get another error at Line no 110 in synthesizer/train.py.
torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'load'

ghost · 2021-03-02T15:52:49Z

@chayan-agrawal Which version of torch are you using? I'm using torch==1.7.1 and don't get that error.

chayan-agrawal · 2021-03-02T16:02:20Z

@blue-fish I am also using torch==1.7.1. Instead of model.load if used model.module.load it works on single GPU. Other GPUs are not in use.

ghost · 2021-03-02T16:45:07Z

@chayan-agrawal Thanks for suggesting that change. I have updated the code with your suggestion: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/a90d2340c0d0c416bcec4da089a2c9ce3e4ed7d4

DataParallel also works in CPU and single GPU environment, so it is not necessary to check for multiple GPUs. It would be nice to get feedback on whether it works for multiple GPUs.

ghost · 2021-03-02T18:03:09Z

Before we even think of merging this code, we'll need to consider these issues:

Parallel training on multiple GPUs introduces a lot of overhead, slowing training: Multi-GPU training as-ideas/ForwardTacotron#9
Using DataParallel may result in incorrect gradient computation on multiple GPUs: Nn.dataparallel with multiple output, weird gradient result None pytorch/pytorch#15716 (might be fixed in torch >= 1.4.0, but issue is still open)

chayan-agrawal · 2021-03-02T18:26:50Z

@blue-fish I have multiple GPUs on my system but it is working for only single GPU. Any help on how can I use multiple GPUs

ghost · 2021-03-04T18:57:52Z

@chayan-agrawal I don't have a multiple GPU environment to troubleshoot. All I can suggest is to ensure that Python sees both of your GPUs. For example, add this to the beginning of synthesizer_train.py to have it use the first and second GPUs.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

ghost · 2021-03-07T06:44:14Z

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

Synergyst · 2021-07-02T05:16:05Z

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

How to use multi-GPU? fatchord/WaveRNN#189

Pytorch 1.5 DataParallel huggingface/transformers#3936

Hey, I'm having issues with this as well..
I feel it's just something stupid-simple I'm overlooking and is an easy fix though if you are able to get it working.. :)

The original repo worked fine with 1 GPU using Torch 1.6.0 before I installed a 2nd GPU to speed up training..
I am using torch 1.7.1 like you said was working for you.
Torch version 1.4.0 does not actually run.

I have tried using both the CorentinJ and your blue-fish repo forks (yours being the one you had suggested which was the branch for multi-GPU support).
The main repo does not run with Torch 1.4.0, 1.6.0, nor 1.7.1 unless I remove the second GPU from the system.
Your repo branch I mentioned does work with the enviroment path override added and using Torch 1.7.1.. however it does not actually utilize the second GPU.

Is there a requirements.txt that you can provide for testing? Perhaps I have some other library installed which breaks this functionality? I'm grasping at straws at this point..
I have been working at it for days but to no avail.. Didn't want to post here until I felt that I needed assistance.

Kind regards.

fede-astolfi · 2021-09-03T16:36:56Z

i am having the exact same problem, has anyone solved it somehow?

linan06kuaishou · 2021-11-30T09:52:27Z

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

How to use multi-GPU? fatchord/WaveRNN#189

Pytorch 1.5 DataParallel huggingface/transformers#3936

As Synergyst mentioned, using torch version 1.4 dosen't work. The error i got is:
"AttributeError: 'PosixPath' object has no attribute 'tell'"
I googled it and find that to solve it i have to use torch version above 1.6.
Awkward face...

ghost added the bug Something isn't working label Mar 7, 2021

wangkewk mentioned this issue Aug 21, 2021

训练模型时显存爆了 babysor/MockingBird#27

Closed

ghost mentioned this issue Aug 25, 2021

train using mutiple GPUs #791

Closed

linan06kuaishou mentioned this issue Nov 30, 2021

How to use multi-GPU? fatchord/WaveRNN#189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't train on two GPU's #664

Can't train on two GPU's #664

zxmanxz commented Feb 16, 2021

ghost commented Feb 16, 2021

zxmanxz commented Feb 16, 2021

ghost commented Feb 16, 2021

zxmanxz commented Feb 16, 2021

ghost commented Feb 16, 2021

zxmanxz commented Feb 16, 2021

ghost commented Feb 17, 2021

ghost commented Feb 18, 2021

zxmanxz commented Feb 19, 2021

ghost commented Feb 28, 2021

chayan-agrawal commented Mar 2, 2021

ghost commented Mar 2, 2021

chayan-agrawal commented Mar 2, 2021 •

edited

Loading

ghost commented Mar 2, 2021

ghost commented Mar 2, 2021

chayan-agrawal commented Mar 2, 2021

ghost commented Mar 4, 2021

ghost commented Mar 7, 2021

Synergyst commented Jul 2, 2021

fede-astolfi commented Sep 3, 2021

linan06kuaishou commented Nov 30, 2021

Can't train on two GPU's #664

Can't train on two GPU's #664

Comments

zxmanxz commented Feb 16, 2021

ghost commented Feb 16, 2021

zxmanxz commented Feb 16, 2021

ghost commented Feb 16, 2021

zxmanxz commented Feb 16, 2021

ghost commented Feb 16, 2021

zxmanxz commented Feb 16, 2021

ghost commented Feb 17, 2021

ghost commented Feb 18, 2021

zxmanxz commented Feb 19, 2021

ghost commented Feb 28, 2021

chayan-agrawal commented Mar 2, 2021

ghost commented Mar 2, 2021

chayan-agrawal commented Mar 2, 2021 • edited Loading

ghost commented Mar 2, 2021

ghost commented Mar 2, 2021

chayan-agrawal commented Mar 2, 2021

ghost commented Mar 4, 2021

ghost commented Mar 7, 2021

Synergyst commented Jul 2, 2021

fede-astolfi commented Sep 3, 2021

linan06kuaishou commented Nov 30, 2021

chayan-agrawal commented Mar 2, 2021 •

edited

Loading