Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error for training with Tacotron #8

Open
rishikksh20 opened this issue Apr 2, 2019 · 14 comments
Open

Getting error for training with Tacotron #8

rishikksh20 opened this issue Apr 2, 2019 · 14 comments

Comments

@rishikksh20
Copy link

rishikksh20 commented Apr 2, 2019

I have used this (https://github.com/Rayhane-mamah/Tacotron-2) implementation for pre-processing. And when I write command for training
python3 train.py --dataset Tacotron training_data
And getting this error

x = torch.cat([x.unsqueeze(-1), mels, a1[:,:,:-1]], dim=2)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 1000 and 1280 in dimension 1 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:83

Error seems to be straight forward that concatenate dimension size don't match, so I debug that and found following sizes of these three tensors:

print(x.unsqueeze(-1).size())    ----- > torch.Size([64, 1280, 1])    
print(mels.size())                      ------ > torch.Size([64, 1000, 80])
print(a1[:,:,:-1].size())               ------- > torch.Size([64, 1000, 31])

clearly see the dimension 1 of x and mels are not equal. @geneing So how I resolve it do I need to do some kind of reshaping or something else.

@rishikksh20
Copy link
Author

rishikksh20 commented Apr 2, 2019

Ok by changing the `hop_size' to 200 I am able to resolve issue but I want to train for sample rate 22050, and following settings:

num_mels = 80, 
num_freq = 513,
fft_size = 1024,
hop_size = 256,
sample_rate = 22050

So do have any solution for that, Am I able to train --dataset Tacotron model with these settings ?

@G-Wang
Copy link

G-Wang commented Apr 2, 2019

In hparams.py, you need to make sure your upsample factor multiplies out to be equal to the hop size for processing Mel spectrogram. E.g So if tacotron 2 has hopsize of 256, you can use either (4,8,8) or (4,4,16) for upsample factor.

@rishikksh20
Copy link
Author

@G-Wang Yeah I resolved that but get another error:

Traceback (most recent call last):
  File "train.py", line 444, in <module>
    train_loop(device, model, data_loader, optimizer, checkpoint_dir)
  File "train.py", line 305, in train_loop
    for i, (x, m, y) in enumerate(tqdm(data_loader)):
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 1022, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 623, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/Dev/rishikesh/speech_synthensis/symon/WaveRNN-Pytorch/dataset.py", line 132, in discrete_collate
    mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
  File "/home/ubuntu/Dev/rishikesh/speech_synthensis/symon/WaveRNN-Pytorch/dataset.py", line 132, in <listcomp>
    mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
  File "mtrand.pyx", line 992, in mtrand.RandomState.randint
ValueError: Range cannot be empty (low >= high) unless no samples are taken

Know the issue is with pre-processed dataset so I am working to resolve it in case you have any idea regarding this, please let me know

@geneing
Copy link
Owner

geneing commented Apr 3, 2019

@rishikksh20 Could you please set a breakpoint in dataset.py line 132 and check what max_offsets list contains. If it contains negative offsets, then could you please check "batch" list and the shapes of entries.

Basically, max_offsets contains the last column in the input mels that can be used (due to required padding). If the mel input is too short, then max_offset will be negative and the next line will fail.

I think I had one dataset where sentences that had just one or two words, which resulted in very short wav file and very few mel frames.

@rishikksh20
Copy link
Author

@geneing How much hours of data required to generate good voice?
Like wavenet generate a good voice from 2 hours of data.

@geneing
Copy link
Owner

geneing commented Apr 11, 2019

@rishikksh20 I've been using two datasets LJSpeech and M-AILABS (Mary Ann reader). Both are ~24 hours of speech. I haven't tried smaller datasets because I use the same dataset for Tacotron training - in the end what matters to me is voice quality from mel specs produced from text by tacotron. Besides, voice quality evaluation is highly subjective :).

@echelon
Copy link

echelon commented Apr 18, 2019

@geneing Sorry for dog piling in this issue, but since you mentioned it, what were the hparams you used with LJSpeech?

I tried the following and trained for 5000 epochs (505000 steps), but the results sound like gibberish. (This is one of the mels in the dataset.)

hop_size=256,
sample_rate=22050,
upsample_factors=(4, 4, 16),

 # shouldn't have any impact, but including for posterity:
save_every_step=5000,
evaluate_every_step=5000,

Like rishikksh20, I'm also using Rayhane-mamah/Tacotron-2 for preprocessing, but I've made the following hparams adjustments there:

tacotron_batch_size = 8,
wavenet_batch_size = 2,

Could it be that by trying to reduce my GPU memory footprint in Tacotron-2 that I've affected my WaveRNN training? Or do I just have bad hparams for WaveRNN that don't account for the 22050 Hz sample rate? Or maybe I'm simply not training long enough?

@rishikksh20
Copy link
Author

Getting very large error for mixture input type :

using noam learning rate decay
no checkpoint specified as --checkpoint argument, creating new model...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:0, running loss:323150838.5625, average loss:1369283.2142478814, current lr:1.475e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.99it/s]
epoch:1, running loss:220346417.3125, average loss:933671.2597987289, current lr:2.95e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:2, running loss:194686806.8125, average loss:824944.0966631356, current lr:4.425e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:3, running loss:193793209.1875, average loss:821157.6660487289, current lr:5.9e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:4, running loss:196869593.96875, average loss:834193.194782839, current lr:7.375e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:5, running loss:191893224.75, average loss:813106.8845338983, current lr:8.85e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:6, running loss:185908305.6875, average loss:787747.0579978813, current lr:0.00010325, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.99it/s]
epoch:7, running loss:181063116.21875, average loss:767216.5941472457, current lr:0.000118, num_pruned:0 (0%)

Is input type mixture working ?

@rishikksh20
Copy link
Author

Ok issue has resolved :

using noam learning rate decay
no checkpoint specified as --checkpoint argument, creating new model...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:0, running loss:1375.3774342536926, average loss:5.827870484125817, current lr:1.475e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:1, running loss:914.5991067886353, average loss:3.875419944019641, current lr:2.95e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:2, running loss:847.640928030014, average loss:3.591698847584805, current lr:4.425e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:3, running loss:834.7676503658295, average loss:3.5371510608721586, current lr:5.9e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:4, running loss:836.0202960968018, average loss:3.5424588817661093, current lr:7.375e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:5, running loss:839.1563384532928, average loss:3.555747196835987, current lr:8.85e-05, num_pruned:0 (0%)

@geneing please change

return -torch.sum(log_sum_exp(log_probs))
to return -torch.mean(log_sum_exp(log_probs))
But one question remains please respond as per your experience @G-Wang @geneing that
is mixture perform better than bits or gaussian , in my case I am using Tacotron 2 as TTS front-end and train WaveRNN with GTA mels.

@G-Wang
Copy link

G-Wang commented Jun 26, 2019

@rishikksh20 using my own TTS front end I've found mu-law 10 bit do it well enogh for me.

@rishikksh20
Copy link
Author

@G-Wang if you don't mind could you tell me which Tacotron implementation you are using and how much hours of data working fine for you. In my case, I code my own tacotron but the structure similar to this Tacotron-2 and have 36 hours of male voice but I am struggling to train WaveRNN with GTA, even after 1 min steps I still get lots of loud noise.
My hparams are the following :

        num_mels = 80,  
	rescale = True, 
	rescaling_max = 0.999,
	trim_silence = True,

	fft_size = 1024,
	hop_size = 256,
	sample_rate = 22050, 
	frame_shift_ms = None,

	signal_normalization = True,
	allow_clipping_in_normalization = True,
	symmetric_mels = True, 
	max_abs_value = 4., 

	#Limits
	min_level_db =- 100,
	ref_level_db = 20,
	fmin = 125,
	fmax = 7600,

It's been great pleasure if you help me bit.

@G-Wang
Copy link

G-Wang commented Jun 26, 2019

I'm using tacotron 2 variants, training on audiobook datasets as well as ljspeech. Have you looked into where the loud noises are coming from? Do you get these loud noises if you invert your TTS linear spectrogram with griffin-Lim or LWS, or just by inspecting the generated spectrograms? If not then perhaps you haven't matched up the mel features exactly between TTS and Wavernn?

@rishikksh20
Copy link
Author

rishikksh20 commented Jun 26, 2019

@G-Wang Tacotron 2 trained on input with signal normalization in [0,1]. By the way thanks for your help.

@G-Wang
Copy link

G-Wang commented Jun 26, 2019

@rishikksh20 another thing to look into if you haven't already is exactly how preprocessing is done for your setup. Note in my vocoder repo(not sure if gening has changed it in his fork) I use lws to compute Mel features in audio.py, Becuase I like to use lws as vocoder over griffin Lim. But I see in other repos like Nvidia tacotron 2 where they use librosa to compute Mel features etc. So if you want TTS front end to match vocoder, make sure both are trained on the same mel/linear spectrogram generated by either lws or librosa in audio preprocessing. Also need to note other things like preemphasis, etc that occur in some audio preprocessing script but not in others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants