-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't train on two GPU's #664
Comments
My ability to help with this is limited, since I don't have a server with multiple GPUs to test. Let's see if the data_parallel_workaround is not required. In synthesizer/train.py, try copying the code from line 177 over to line 174. Real-Time-Voice-Cloning/synthesizer/train.py Lines 172 to 177 in 10ca8f7
|
And paste where? |
If you get an identical message on a single GPU, then something is wrong because it shouldn't be executing the multi-GPU code. Why don't you try setting CUDA_VISIBLE_DEVICES inside |
It works fine with single GPU, mb you can give me an advise how to get full GPU usage (e.g. now it just using 4 GB and the other 7 are available) |
To increase VRAM usage, adjust the batch size parameter (far right number) in hparams. Real-Time-Voice-Cloning/synthesizer/hparams.py Lines 52 to 57 in 10ca8f7
|
Thank you, if there would be the way to parallel computation between many GPU, it would be great. |
@zxmanxz Try this branch for multi-GPU training. If it works I will submit a pull request. |
@zxmanxz Can you let me know if the multi-GPU branch above works for you? |
Yes, I'll try to use multi GPU's later. |
@zxmanxz When will you be able to test the multi-GPU training code? https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/664_multi_gpu_training |
@blue-fish In the above mentioned code, we get another error at Line no 110 in synthesizer/train.py. |
@chayan-agrawal Which version of torch are you using? I'm using torch==1.7.1 and don't get that error. |
@blue-fish I am also using torch==1.7.1. Instead of model.load if used model.module.load it works on single GPU. Other GPUs are not in use. |
@chayan-agrawal Thanks for suggesting that change. I have updated the code with your suggestion: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/a90d2340c0d0c416bcec4da089a2c9ce3e4ed7d4 DataParallel also works in CPU and single GPU environment, so it is not necessary to check for multiple GPUs. It would be nice to get feedback on whether it works for multiple GPUs. |
Before we even think of merging this code, we'll need to consider these issues:
|
@blue-fish I have multiple GPUs on my system but it is working for only single GPU. Any help on how can I use multiple GPUs |
@chayan-agrawal I don't have a multiple GPU environment to troubleshoot. All I can suggest is to ensure that Python sees both of your GPUs. For example, add this to the beginning of
|
You might have to downgrade to |
Hey, I'm having issues with this as well.. The original repo worked fine with 1 GPU using Torch 1.6.0 before I installed a 2nd GPU to speed up training.. I have tried using both the CorentinJ and your blue-fish repo forks (yours being the one you had suggested which was the branch for multi-GPU support). Is there a requirements.txt that you can provide for testing? Perhaps I have some other library installed which breaks this functionality? I'm grasping at straws at this point.. Kind regards. |
i am having the exact same problem, has anyone solved it somehow? |
As Synergyst mentioned, using torch version 1.4 dosen't work. The error i got is: |
Hi, when I tried to train synthesizer model on my laptop with 1 Nvidia 1650 GPU all was good but when I tried to run training process on my server with two Nvidia GeForce 1080Ti I got an error:
`
╰─ python synthesizer_train.py pretrained_new datasets/SV2TTS/synthesizer -s 50 -b 50 ─╯
Arguments:
run_id: pretrained_new
syn_dir: datasets/SV2TTS/synthesizer
models_dir: synthesizer/saved_models/
save_every: 50
backup_every: 50
force_restart: False
hparams:
Checkpoint path: synthesizer/saved_models/pretrained_new/pretrained_new.pt
Loading training data from: datasets/SV2TTS/synthesizer/train.txt
Using model: Tacotron
Using device: cuda
Initialising Tacotron Model...
Trainable Parameters: 30.870M
Loading weights at synthesizer/saved_models/pretrained_new/pretrained_new.pt
Tacotron weights loaded from step 0
Using inputs from:
datasets/SV2TTS/synthesizer/train.txt
datasets/SV2TTS/synthesizer/mels
datasets/SV2TTS/synthesizer/embeds
Found 259 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
| 20k Steps | 12 | 0.001 | 2 |
+----------------+------------+---------------+------------------+
Traceback (most recent call last):
File "synthesizer_train.py", line 35, in
train(**vars(args))
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/train.py", line 175, in train
mels, embeds)
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/utils/init.py", line 17, in data_parallel_workaround
outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/roma/anaconda3/envs/work/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/roma/new_Real-Time-Voice-Cloning/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 362, in forward
device = next(self.parameters()).device # use same device as parameters
StopIteration
`
The text was updated successfully, but these errors were encountered: