-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting error for training with Tacotron #8
Comments
Ok by changing the `hop_size' to 200 I am able to resolve issue but I want to train for sample rate 22050, and following settings:
So do have any solution for that, Am I able to train |
In hparams.py, you need to make sure your upsample factor multiplies out to be equal to the hop size for processing Mel spectrogram. E.g So if tacotron 2 has hopsize of 256, you can use either (4,8,8) or (4,4,16) for upsample factor. |
@G-Wang Yeah I resolved that but get another error:
Know the issue is with pre-processed dataset so I am working to resolve it in case you have any idea regarding this, please let me know |
@rishikksh20 Could you please set a breakpoint in dataset.py line 132 and check what max_offsets list contains. If it contains negative offsets, then could you please check "batch" list and the shapes of entries. Basically, max_offsets contains the last column in the input mels that can be used (due to required padding). If the mel input is too short, then max_offset will be negative and the next line will fail. I think I had one dataset where sentences that had just one or two words, which resulted in very short wav file and very few mel frames. |
@geneing How much hours of data required to generate good voice? |
@rishikksh20 I've been using two datasets LJSpeech and M-AILABS (Mary Ann reader). Both are ~24 hours of speech. I haven't tried smaller datasets because I use the same dataset for Tacotron training - in the end what matters to me is voice quality from mel specs produced from text by tacotron. Besides, voice quality evaluation is highly subjective :). |
@geneing Sorry for dog piling in this issue, but since you mentioned it, what were the hparams you used with LJSpeech? I tried the following and trained for 5000 epochs (505000 steps), but the results sound like gibberish. (This is one of the mels in the dataset.)
Like rishikksh20, I'm also using Rayhane-mamah/Tacotron-2 for preprocessing, but I've made the following hparams adjustments there:
Could it be that by trying to reduce my GPU memory footprint in Tacotron-2 that I've affected my WaveRNN training? Or do I just have bad hparams for WaveRNN that don't account for the 22050 Hz sample rate? Or maybe I'm simply not training long enough? |
Getting very large error for
Is input type |
Ok issue has resolved :
@geneing please change WaveRNN-Pytorch/distributions.py Line 136 in 7b317c4
return -torch.mean(log_sum_exp(log_probs)) But one question remains please respond as per your experience @G-Wang @geneing that is mixture perform better than bits or gaussian , in my case I am using Tacotron 2 as TTS front-end and train WaveRNN with GTA mels.
|
@rishikksh20 using my own TTS front end I've found mu-law 10 bit do it well enogh for me. |
@G-Wang if you don't mind could you tell me which Tacotron implementation you are using and how much hours of data working fine for you. In my case, I code my own tacotron but the structure similar to this Tacotron-2 and have 36 hours of male voice but I am struggling to train WaveRNN with GTA, even after 1 min steps I still get lots of loud noise.
It's been great pleasure if you help me bit. |
I'm using tacotron 2 variants, training on audiobook datasets as well as ljspeech. Have you looked into where the loud noises are coming from? Do you get these loud noises if you invert your TTS linear spectrogram with griffin-Lim or LWS, or just by inspecting the generated spectrograms? If not then perhaps you haven't matched up the mel features exactly between TTS and Wavernn? |
@G-Wang Tacotron 2 trained on input with |
@rishikksh20 another thing to look into if you haven't already is exactly how preprocessing is done for your setup. Note in my vocoder repo(not sure if gening has changed it in his fork) I use lws to compute Mel features in audio.py, Becuase I like to use lws as vocoder over griffin Lim. But I see in other repos like Nvidia tacotron 2 where they use librosa to compute Mel features etc. So if you want TTS front end to match vocoder, make sure both are trained on the same mel/linear spectrogram generated by either lws or librosa in audio preprocessing. Also need to note other things like preemphasis, etc that occur in some audio preprocessing script but not in others. |
I have used this (https://github.com/Rayhane-mamah/Tacotron-2) implementation for pre-processing. And when I write command for training
python3 train.py --dataset Tacotron training_data
And getting this error
Error seems to be straight forward that concatenate dimension size don't match, so I debug that and found following sizes of these three tensors:
clearly see the dimension 1 of
x
andmels
are not equal. @geneing So how I resolve it do I need to do some kind of reshaping or something else.The text was updated successfully, but these errors were encountered: