-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TTS] Make mel spectrogram norm configurable #6155
Conversation
Signed-off-by: Ryan <[email protected]>
I think there is a misunderstanding. Without normalisation we would get bigger numbers in bands where the mel-scale 'compresses' more (higher frequencies). Think sum instead of averaging (roughly). I think this is what you interpreted as 'extended' range. This could be clarified by visualising distributions of values for different mel bands. Could you please do that? I would expect both higher mean and variance for upper bands. |
Note that the above is orthogonal to |
True, the point of slaney normalization is definitely not to reduce the range of values. But it is a side effect, which does not work well with log compression. It could be that using the linear mels after normalization is better than using log mels without normalization (I suspect both would produce similar results). If people are interested, I could create visuals showing the distribution of features using the different preprocessing configurations. |
My main point is, this side effect most likely materialises in the upper bands, with lesser effect on bands where the mel-scale is more linear. When used as input to a neural network, this effectively puts a bigger weight on higher frequencies from the start. For your goal, I would rather suggest just scaling input before the logarithm ( I mean, this is all theory, experiments might show otherwise, but a priori I have concerns stated above |
Below is 1 audio sample with different mel configs. There are limits to what we can infer by just looking at 1 piece of audio, but at a high level we see:
From this I guess that a function like log(1.0 + 50 * slaney(mel)) would be optimal, but is likely to produce the same behavior as Would be happy to hear anyone else's thoughts on this. mel log(1.0 + mel) slaney(mel) log(max(1E-5, slaney(mel))) - Our current TTS config log(max(1E-5, 50 * slaney(mel))) log(1.0 + slaney(mel)) log(1.0 + 50 * slaney(mel)) |
A little bit off-topic, this PR only makes functions configurable while it is the user's responsibility to decide which trick is more applicable to their needs. The discussion is very valuable, shall we move to other places for further discussion>? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine with me, since it's backward compatible. Please ask for final review when PR is ready
Side note - while this is a TTS change, you are modifying ASR code, so please make sure to notify someone from ASR for reviews (usually me or @VahidooX) before merging. |
Signed-off-by: Ryan <[email protected]>
Signed-off-by: Ryan <[email protected]>
for more information, see https://pre-commit.ci
added a resource that showing diff between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good ! Thanks
@XuesongYang Yes it's a good idea to make a discussion on the topic with a summary of the discussion + plots. Better still if experiments show notable improvement, then simply updating configs should be useful |
* [TTS] Make mel spectrogram norm configurable Signed-off-by: Ryan <[email protected]> * [TTS] Add mel norm param to FilterbankFeaturesTA Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [TTS] Make mel spectrogram norm configurable Signed-off-by: Ryan <[email protected]> * [TTS] Add mel norm param to FilterbankFeaturesTA Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>
What does this PR do ?
Add optional configuration to AudioToMelSpectrogramPreprocessor to allow disabling area normalization in the librosa mel filter.
Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.
The intention is to experiment with disabling the area normalization and replacing the current mel spectrogram used in TTS which is:
With a more accurate representation like:
This will also enable us to compute energy features using the mel spectrogram instead of the linear spectrogram.
Collection: [TTS], [ASR]
Changelog
Usage
.yaml config:
Before your PR is "Ready for review"
Pre checks:
PR Type: