-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref_level_db = 20
and min_level_db = -100
Where did these values come from?Statistics?
#17
Comments
ref_level_db = 20
and min_level_db
Where did these values come from?Statistics? ref_level_db = 20
and min_level_db = -100
Where did these values come from?Statistics?
Well these hyper parameters have been abandoned in the latest version where a new STFT based on convolution has been applied to mel spectrograms preprocessing. Because these hyper parameters would lead to indetermination of melspec values while the new STFT the lowest value is fixed to |
Dear @begeekmyfriend , I have been wondering about these parameters too (from earlier version). Basically you apply: So could you tell me what's the reason behind this? And why you choose such value? |
As far as I know, this approach is deprived early from Kyubyong's tacotron. It would lead to nagative bias between GTA mel spectrograms and groud truth ones due to |
Thanks very much! I am also working on audio synthesis and processing. I am using the old preprocess method, and I am now considering changing to yours. Thanks again! |
You can print |
Dear @begeekmyfriend, I took a bit deeper look into the code you write, can I understand your current mel_spectrogram generation in the following way? mel = dynamic_range_compression(_linear_to_mel(np.abs(_stft(preemphasis(y))))) So I run a bit through the code on my data and found that the range of Mel spectrogram now becomes: Am I running the code correctly? And also I am wondering why now use log instead of 20log10? Is there a disadvantage comparing to 20log10? Thanks very much in advance! |
The lowest value must be -11.5129 as the STFT is the same with other TTS projects on PyTorch like melgan. I can provide a piece of python script to verify it. import glob
import os
import numpy as np
import sys
mins = []
maxs = []
for f in glob.glob(os.path.join(sys.argv[1], '**', 'mels', '*.npy')):
mel = np.load(f)
mins.append(mel.min())
maxs.append(mel.max())
print(sorted(mins)[0], sorted(maxs)[-1]) |
Dear @begeekmyfriend , thanks vert much. |
Hi all, I think I have found out the cause behind the transition. Please look at this online PPT which illustrates the quantization from fp32 to 8-bit done by TensorRT library. We find that it is much like the transition of amplitude to decibel. There is significant accuracy loss without saturation in general because the samples near maximum edges are noisy which might well be amplified by scaling. Therefore we need to set a threshold near the maximum edges to truncate the edge values. In Tacotron we have set a |
Hi, @begeekmyfriend, thanks very much for sharing this slide. |
No description provided.
The text was updated successfully, but these errors were encountered: