Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref_level_db = 20 and min_level_db = -100 Where did these values come from?Statistics? #17

Closed
WithoutDoubt opened this issue Mar 18, 2020 · 10 comments
Labels
question Further information is requested

Comments

@WithoutDoubt
Copy link

No description provided.

@WithoutDoubt WithoutDoubt changed the title ref_level_db = 20 and min_level_db Where did these values come from?Statistics? ref_level_db = 20 and min_level_db = -100 Where did these values come from?Statistics? Mar 18, 2020
@begeekmyfriend
Copy link
Owner

Well these hyper parameters have been abandoned in the latest version where a new STFT based on convolution has been applied to mel spectrograms preprocessing. Because these hyper parameters would lead to indetermination of melspec values while the new STFT the lowest value is fixed to 11.5129 as known as mel padding. You might print mel.min() and mel.max() to verify it.

@begeekmyfriend begeekmyfriend added the question Further information is requested label Mar 19, 2020
@lukewys
Copy link

lukewys commented Mar 23, 2020

Dear @begeekmyfriend , I have been wondering about these parameters too (from earlier version). Basically you apply:
D = _stft(preemphasis(y))
S = _amp_to_db(np.abs(D)) - hparams.ref_level_db
np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1) ##this is to normalize

So could you tell me what's the reason behind this? And why you choose such value?
Thanks in advance!

@begeekmyfriend
Copy link
Owner

As far as I know, this approach is deprived early from Kyubyong's tacotron. It would lead to nagative bias between GTA mel spectrograms and groud truth ones due to ref_level_db which is indispensable for log calculation. Therefore I am using convolutional STFT currently. You might see it in data_function.py.

@lukewys
Copy link

lukewys commented Mar 23, 2020

Thanks very much! I am also working on audio synthesis and processing. I am using the old preprocess method, and I am now considering changing to yours. Thanks again!

@begeekmyfriend
Copy link
Owner

You can print mel.min() to find out the lowest values seem fixed whatever they are ground truth or inference.

@lukewys
Copy link

lukewys commented Mar 30, 2020

Dear @begeekmyfriend, I took a bit deeper look into the code you write, can I understand your current mel_spectrogram generation in the following way?

mel = dynamic_range_compression(_linear_to_mel(np.abs(_stft(preemphasis(y)))))

So I run a bit through the code on my data and found that the range of Mel spectrogram now becomes:
Max: 1.6808715
Min: -10.579149 (I think the lowest value possible is np.log(1e-5) which is -11.5)

Am I running the code correctly? And also I am wondering why now use log instead of 20log10? Is there a disadvantage comparing to 20log10?

Thanks very much in advance!

@begeekmyfriend
Copy link
Owner

The lowest value must be -11.5129 as the STFT is the same with other TTS projects on PyTorch like melgan. I can provide a piece of python script to verify it.

import glob
import os
import numpy as np
import sys

mins = []
maxs = []
for f in glob.glob(os.path.join(sys.argv[1], '**', 'mels', '*.npy')):
	mel = np.load(f)
	mins.append(mel.min())
	maxs.append(mel.max())

print(sorted(mins)[0], sorted(maxs)[-1])

@lukewys
Copy link

lukewys commented Mar 31, 2020

Dear @begeekmyfriend , thanks vert much.

@begeekmyfriend
Copy link
Owner

begeekmyfriend commented Aug 2, 2020

Hi all, I think I have found out the cause behind the transition. Please look at this online PPT which illustrates the quantization from fp32 to 8-bit done by TensorRT library. We find that it is much like the transition of amplitude to decibel. There is significant accuracy loss without saturation in general because the samples near maximum edges are noisy which might well be amplified by scaling. Therefore we need to set a threshold near the maximum edges to truncate the edge values. In Tacotron we have set a min_level_db as well as a ref_level_db and then make the clip movement in normalization. The quantization illustrated by PPT can explain such cause.

@lukewys
Copy link

lukewys commented Aug 3, 2020

Hi, @begeekmyfriend, thanks very much for sharing this slide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants