-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about alignment #22
Comments
Well it is strange for both of us. If you enabled |
I didn't activate any options about amp and distributed_run. I'm gonna compare two mechnism of calculating mel spectrogram. Meanwhile, if you have found any clues, pleaes share them out asap. Thanks in advance! |
I forgot to tell you that you need to change the mel_pad_val back into -4 or -5 which should be the lowest Mel value. |
I noticed this param, but I have a question to consult. This param is just used as the value of silence when padding. Based on my knowladge, a value that doesn't appear in the numerical range mel is usually chooesed. Then why will -11.5129, which is far smaller than -4 or -5, cause alignment issue? (Btw, I will immediately retrain this repo with modifying this param to -4 to check.) |
After carefully check your codes, I'm supposed to simply change mel_pad_val to -4(or the lowest mel value). Because, in your repo, mel-spec is normalized by the function "dynamic_range_compression" in the mode of "load_mel_from_disk=false", after which the minimum of mel is -11.5129(equals to torch.log(1e-5)). Thus, the final mel-specs fed into tacotron2 are in numerical range of [-11.5129, torch.log(mel.max())], which is not a symmetrical interval. I'm not sure whether it it the reason to ruin the training. I wonder when training this tacotron2, is a symmetrical interval of mel value needed? |
I have not tested your idea closely, but the implementation in |
I retrained the repo with mel-specs calculated in the mode "load_mel_from_disk=False", which calls the line. The alignment is then much more normal. However, most of them are as the the first following picture. Some of them are perfect as the second following picture. I couldn't understand the reason for the first kind of alignment. BEG for your suggestions! |
It confused me too. But the numpy file loading approach would lead to the uncertain lowest value of GTA Mel spectrograms due to the ref_min_db bias in Mel extraction. And therefore it would also lead to the uncertainty of Mel padding of the vocoder due to different corpus of anchors. |
The lowest value of GTA mel specs is -4.0, determined by codes here. Thus, I set "mel_pad_val=-4". |
You will know that it is not correct when you generate GTA Mel spectrograms. |
I've figured out why the alignment shows in such strange way. When ploting alignment, a random index is first generated to decide plotting which one out of a batch alignments. In training stage, one batch data, including texts and mels are padded based on their corresponding max lengths. A random alignment(e.g. with shape[128,151]) out of one batch(with the number of len(alignments)) has (128-seq_len) lines of zeros. Meanwhile its second dim, corresponding to decoding steps, also has some meaningless values because of padding operation here. |
Good job! I have an idea that you can try replacing spectral_normalize with amp_to_db to check out whether the alignment get improved. |
Actually, in my project, I have totally replaced data processing by functions in common/audio.py and set option "load_mel_from_disk"=True . It seems easier and more stable for training, at least for me. |
I trained this repo with BiaoBei dataset and no parameters have been modified. However, it didn't get the perfect alignment even after 278 epoches. Compared to your training situation, would you please give me some suggestions on the improvement of alignment.
The text was updated successfully, but these errors were encountered: