Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about alignment #22

Closed
MorganCZY opened this issue May 8, 2020 · 14 comments
Closed

about alignment #22

MorganCZY opened this issue May 8, 2020 · 14 comments

Comments

@MorganCZY
Copy link

I trained this repo with BiaoBei dataset and no parameters have been modified. However, it didn't get the perfect alignment even after 278 epoches. Compared to your training situation, would you please give me some suggestions on the improvement of alignment.
image

@begeekmyfriend
Copy link
Owner

Well it is strange for both of us. If you enabled amp_run in parsing arguments, please set amp operational level into O0 to rise up the training precision. If not, you might try the old Mel spectrograms extraction method that it would load Mel spectrograms from numpy files as this code shows. Let us make some comparison.

@MorganCZY
Copy link
Author

I didn't activate any options about amp and distributed_run. I'm gonna compare two mechnism of calculating mel spectrogram. Meanwhile, if you have found any clues, pleaes share them out asap. Thanks in advance!

@MorganCZY
Copy link
Author

I retrained this repo with the option "load_mel_from_disk", but the alignment is still not correct after 30 epochs.
image

@begeekmyfriend
Copy link
Owner

I forgot to tell you that you need to change the mel_pad_val back into -4 or -5 which should be the lowest Mel value.

@MorganCZY
Copy link
Author

I noticed this param, but I have a question to consult. This param is just used as the value of silence when padding. Based on my knowladge, a value that doesn't appear in the numerical range mel is usually chooesed. Then why will -11.5129, which is far smaller than -4 or -5, cause alignment issue? (Btw, I will immediately retrain this repo with modifying this param to -4 to check.)

@MorganCZY
Copy link
Author

MorganCZY commented May 9, 2020

After carefully check your codes, I'm supposed to simply change mel_pad_val to -4(or the lowest mel value). Because, in your repo, mel-spec is normalized by the function "dynamic_range_compression" in the mode of "load_mel_from_disk=false", after which the minimum of mel is -11.5129(equals to torch.log(1e-5)). Thus, the final mel-specs fed into tacotron2 are in numerical range of [-11.5129, torch.log(mel.max())], which is not a symmetrical interval. I'm not sure whether it it the reason to ruin the training. I wonder when training this tacotron2, is a symmetrical interval of mel value needed?

@begeekmyfriend
Copy link
Owner

I have not tested your idea closely, but the implementation in stft.py is used in many other TTS projects like MelGAN, WaveGlow and so on where the Mel values are not symmetrical and the lowest value is -11.5129. But I have no idea whether it ruins the alignment in TTS training.

@MorganCZY
Copy link
Author

I retrained the repo with mel-specs calculated in the mode "load_mel_from_disk=False", which calls the line. The alignment is then much more normal. However, most of them are as the the first following picture. Some of them are perfect as the second following picture. I couldn't understand the reason for the first kind of alignment. BEG for your suggestions!
image
image

@begeekmyfriend
Copy link
Owner

begeekmyfriend commented May 11, 2020

It confused me too. But the numpy file loading approach would lead to the uncertain lowest value of GTA Mel spectrograms due to the ref_min_db bias in Mel extraction. And therefore it would also lead to the uncertainty of Mel padding of the vocoder due to different corpus of anchors.
There is still a way worth trying for the approach in stft.py, you can comment out preemphasis method in wav loading as well as deemphasis method in griffin-lim function which I added myself. I am not sure if it works. By the way, I did not modify any code in stft.py.

@MorganCZY
Copy link
Author

MorganCZY commented May 11, 2020

But the numpy file loading approach would lead to the uncertain lowest value of GTA Mel spectrograms due to the ref_min_db bias in Mel extraction. And therefore it would also lead to the uncertainty of Mel padding of the vocoder due to different corpus of anchors.

The lowest value of GTA mel specs is -4.0, determined by codes here. Thus, I set "mel_pad_val=-4".

@begeekmyfriend
Copy link
Owner

You will know that it is not correct when you generate GTA Mel spectrograms.

@MorganCZY
Copy link
Author

I've figured out why the alignment shows in such strange way. When ploting alignment, a random index is first generated to decide plotting which one out of a batch alignments. In training stage, one batch data, including texts and mels are padded based on their corresponding max lengths. A random alignment(e.g. with shape[128,151]) out of one batch(with the number of len(alignments)) has (128-seq_len) lines of zeros. Meanwhile its second dim, corresponding to decoding steps, also has some meaningless values because of padding operation here.
I modified relevant codes as follows, and now the alignment shows correctly.
image

@begeekmyfriend
Copy link
Owner

Good job! I have an idea that you can try replacing spectral_normalize with amp_to_db to check out whether the alignment get improved.

@MorganCZY
Copy link
Author

Actually, in my project, I have totally replaced data processing by functions in common/audio.py and set option "load_mel_from_disk"=True . It seems easier and more stable for training, at least for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants