Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't train on two GPU's #664

Open
zxmanxz opened this issue Feb 16, 2021 · 21 comments
Open

Can't train on two GPU's #664

zxmanxz opened this issue Feb 16, 2021 · 21 comments
Labels
bug Something isn't working

Comments

@zxmanxz
Copy link

zxmanxz commented Feb 16, 2021

Hi, when I tried to train synthesizer model on my laptop with 1 Nvidia 1650 GPU all was good but when I tried to run training process on my server with two Nvidia GeForce 1080Ti I got an error:
`
╰─ python synthesizer_train.py pretrained_new datasets/SV2TTS/synthesizer -s 50 -b 50 ─╯

Arguments:
run_id: pretrained_new
syn_dir: datasets/SV2TTS/synthesizer
models_dir: synthesizer/saved_models/
save_every: 50
backup_every: 50
force_restart: False
hparams:

Checkpoint path: synthesizer/saved_models/pretrained_new/pretrained_new.pt
Loading training data from: datasets/SV2TTS/synthesizer/train.txt
Using model: Tacotron
Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 30.870M

Loading weights at synthesizer/saved_models/pretrained_new/pretrained_new.pt
Tacotron weights loaded from step 0
Using inputs from:
datasets/SV2TTS/synthesizer/train.txt
datasets/SV2TTS/synthesizer/mels
datasets/SV2TTS/synthesizer/embeds
Found 259 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
| 20k Steps | 12 | 0.001 | 2 |
+----------------+------------+---------------+------------------+

Traceback (most recent call last):
File "synthesizer_train.py", line 35, in
train(**vars(args))
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/train.py", line 175, in train
mels, embeds)
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/utils/init.py", line 17, in data_parallel_workaround
outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 362, in forward
device = next(self.parameters()).device # use same device as parameters
StopIteration

`

@ghost
Copy link

ghost commented Feb 16, 2021

My ability to help with this is limited, since I don't have a server with multiple GPUs to test.

Let's see if the data_parallel_workaround is not required. In synthesizer/train.py, try copying the code from line 177 over to line 174.

# Parallelize model onto GPUS using workaround due to python bug
if device.type == "cuda" and torch.cuda.device_count() > 1:
m1_hat, m2_hat, attention, stop_pred = data_parallel_workaround(model, texts,
mels, embeds)
else:
m1_hat, m2_hat, attention, stop_pred = model(texts, mels, embeds)

@zxmanxz
Copy link
Author

zxmanxz commented Feb 16, 2021

And paste where?
Also I tried to set CUDA_VISIBLE_DEVICES=0 (to get only one GPU) but problem was the same...

@ghost
Copy link

ghost commented Feb 16, 2021

If you get an identical message on a single GPU, then something is wrong because it shouldn't be executing the multi-GPU code.

Why don't you try setting CUDA_VISIBLE_DEVICES inside synthesizer_train.py? (This file is in the root of the repo, unlike train.py) See #489 (comment) . Then run the original, unmodified code. Paste the error message if you get one.

@zxmanxz
Copy link
Author

zxmanxz commented Feb 16, 2021

It works fine with single GPU, mb you can give me an advise how to get full GPU usage (e.g. now it just using 4 GB and the other 7 are available)

@ghost
Copy link

ghost commented Feb 16, 2021

To increase VRAM usage, adjust the batch size parameter (far right number) in hparams.

tts_schedule = [(2, 1e-3, 20_000, 12), # Progressive training schedule
(2, 5e-4, 40_000, 12), # (r, lr, step, batch_size)
(2, 2e-4, 80_000, 12), #
(2, 1e-4, 160_000, 12), # r = reduction factor (# of mel frames
(2, 3e-5, 320_000, 12), # synthesized for each decoder iteration)
(2, 1e-5, 640_000, 12)], # lr = learning rate

@zxmanxz
Copy link
Author

zxmanxz commented Feb 16, 2021

Thank you, if there would be the way to parallel computation between many GPU, it would be great.

@ghost
Copy link

ghost commented Feb 17, 2021

@zxmanxz Try this branch for multi-GPU training. If it works I will submit a pull request.
https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

@ghost
Copy link

ghost commented Feb 18, 2021

@zxmanxz Can you let me know if the multi-GPU branch above works for you?

@zxmanxz
Copy link
Author

zxmanxz commented Feb 19, 2021

Yes, I'll try to use multi GPU's later.

@ghost
Copy link

ghost commented Feb 28, 2021

@zxmanxz When will you be able to test the multi-GPU training code? https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training

@chayan-agrawal
Copy link

@blue-fish In the above mentioned code, we get another error at Line no 110 in synthesizer/train.py.
torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'load'

@ghost
Copy link

ghost commented Mar 2, 2021

@chayan-agrawal Which version of torch are you using? I'm using torch==1.7.1 and don't get that error.

@chayan-agrawal
Copy link

chayan-agrawal commented Mar 2, 2021

@blue-fish I am also using torch==1.7.1. Instead of model.load if used model.module.load it works on single GPU. Other GPUs are not in use.

@ghost
Copy link

ghost commented Mar 2, 2021

@chayan-agrawal Thanks for suggesting that change. I have updated the code with your suggestion: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/a90d2340c0d0c416bcec4da089a2c9ce3e4ed7d4

DataParallel also works in CPU and single GPU environment, so it is not necessary to check for multiple GPUs. It would be nice to get feedback on whether it works for multiple GPUs.

@ghost
Copy link

ghost commented Mar 2, 2021

Before we even think of merging this code, we'll need to consider these issues:

@chayan-agrawal
Copy link

@blue-fish I have multiple GPUs on my system but it is working for only single GPU. Any help on how can I use multiple GPUs

@ghost
Copy link

ghost commented Mar 4, 2021

@chayan-agrawal I don't have a multiple GPU environment to troubleshoot. All I can suggest is to ensure that Python sees both of your GPUs. For example, add this to the beginning of synthesizer_train.py to have it use the first and second GPUs.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

@ghost
Copy link

ghost commented Mar 7, 2021

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

@ghost ghost added the bug Something isn't working label Mar 7, 2021
@Synergyst
Copy link

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

Hey, I'm having issues with this as well..
I feel it's just something stupid-simple I'm overlooking and is an easy fix though if you are able to get it working.. :)

The original repo worked fine with 1 GPU using Torch 1.6.0 before I installed a 2nd GPU to speed up training..
I am using torch 1.7.1 like you said was working for you.
Torch version 1.4.0 does not actually run.

I have tried using both the CorentinJ and your blue-fish repo forks (yours being the one you had suggested which was the branch for multi-GPU support).
The main repo does not run with Torch 1.4.0, 1.6.0, nor 1.7.1 unless I remove the second GPU from the system.
Your repo branch I mentioned does work with the enviroment path override added and using Torch 1.7.1.. however it does not actually utilize the second GPU.

Is there a requirements.txt that you can provide for testing? Perhaps I have some other library installed which breaks this functionality? I'm grasping at straws at this point..
I have been working at it for days but to no avail.. Didn't want to post here until I felt that I needed assistance.

Kind regards.

@fede-astolfi
Copy link

i am having the exact same problem, has anyone solved it somehow?

@linan06kuaishou
Copy link

You might have to downgrade to torch==1.4.0 to get DataParallel to work.

As Synergyst mentioned, using torch version 1.4 dosen't work. The error i got is:
"AttributeError: 'PosixPath' object has no attribute 'tell'"
I googled it and find that to solve it i have to use torch version above 1.6.
Awkward face...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants