[TTS] Make mel spectrogram norm configurable #6155

rlangman · 2023-03-09T00:14:51Z

What does this PR do ?

Add optional configuration to AudioToMelSpectrogramPreprocessor to allow disabling area normalization in the librosa mel filter.

Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.

The intention is to experiment with disabling the area normalization and replacing the current mel spectrogram used in TTS which is:

log(max(1E-5, area_norm(mel))

With a more accurate representation like:

log(1.0 + mel)

This will also enable us to compute energy features using the mel spectrogram instead of the linear spectrogram.

Collection: [TTS], [ASR]

Changelog

Add an optional parameter to AudioToMelSpectrogramPreprocessor to configure the mel filter normalization, defaulting to the librosa default of 'slaney'

Usage

.yaml config:

preprocessor:
target: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
log: true
log_zero_guard_type: add
log_zero_guard_value: 1.0
mag_power: 1.0
mel_norm: null

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: Ryan <[email protected]>

racoiaws · 2023-03-09T19:01:57Z

Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.

I think there is a misunderstanding.
Let's say we have a white noise signal with uniform energy across the spectrum (STFT amplitude). With the standard "slaney" normalisation, the mel-spectrogram is supposed to also have uniform values in all bands. Reducing the range of values is not the primary goal, I guess it's more about interpretability.

Without normalisation we would get bigger numbers in bands where the mel-scale 'compresses' more (higher frequencies). Think sum instead of averaging (roughly). I think this is what you interpreted as 'extended' range.

This could be clarified by visualising distributions of values for different mel bands. Could you please do that? I would expect both higher mean and variance for upper bands.

racoiaws · 2023-03-09T19:16:00Z

Note that the above is orthogonal to log(1+mel) suggestion. Wonder why people don't do that. log(mel) is almost decibels so makes sense in theory, but log(1+mel) would make much more sense in practice

rlangman · 2023-03-09T19:17:44Z

Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.

I think there is a misunderstanding. Let's say we have a white noise signal with uniform energy across the spectrum (STFT amplitude). With the standard "slaney" normalisation, the mel-spectrogram is supposed to also have uniform values in all bands. Reducing the range of values is not the goal.

Without normalisation we would get bigger numbers in bands where the mel-scale 'compresses' (higher frequencies). Think sum instead of averaging (roughly). I think this is what you interpreted as 'extended' range.

This could be clarified by visualising distributions of values for different mel bands. Could you please do that? I would expect both higher mean and variance for upper bands.

True, the point of slaney normalization is definitely not to reduce the range of values. But it is a side effect, which does not work well with log compression. It could be that using the linear mels after normalization is better than using log mels without normalization (I suspect both would produce similar results).

If people are interested, I could create visuals showing the distribution of features using the different preprocessing configurations.

racoiaws · 2023-03-09T19:30:20Z

True, the point of slaney normalization is definitely not to reduce the range of values. But it is a side effect, which does not work well with log compression

My main point is, this side effect most likely materialises in the upper bands, with lesser effect on bands where the mel-scale is more linear. When used as input to a neural network, this effectively puts a bigger weight on higher frequencies from the start.

For your goal, I would rather suggest just scaling input before the logarithm (log(alpha * mel)), with alpha somewhere from 1 to 100 (default 1). This way, the value range would be widened the same for all bands.

I mean, this is all theory, experiments might show otherwise, but a priori I have concerns stated above

rlangman · 2023-03-09T21:27:53Z

Below is 1 audio sample with different mel configs. There are limits to what we can infer by just looking at 1 piece of audio, but at a high level we see:

The spectrograms which add 1.0 to the log guard look a lot cleaner (most values in the spectrogram are very close to 0).
Without log compression, there are a large number of small values and small number of very large values/outliers. Taking the log results in a more even distribution.
Area normalization reduces the scale of higher frequency bands relative to smaller frequency bands, due to these mel bands having a larger range.

From this I guess that a function like log(1.0 + 50 * slaney(mel)) would be optimal, but is likely to produce the same behavior as
log(1.0 + mel). Without running experiments, I would be OK with either but would slightly favor the latter by default due to its simplicity.

Would be happy to hear anyone else's thoughts on this.

mel

log(1.0 + mel)

slaney(mel)

log(max(1E-5, slaney(mel))) - Our current TTS config

log(max(1E-5, 50 * slaney(mel)))

log(1.0 + slaney(mel))

log(1.0 + 50 * slaney(mel))

nemo/collections/asr/modules/audio_preprocessing.py

XuesongYang · 2023-03-10T22:46:01Z

A little bit off-topic, this PR only makes functions configurable while it is the user's responsibility to decide which trick is more applicable to their needs. The discussion is very valuable, shall we move to other places for further discussion>?

titu1994

This is fine with me, since it's backward compatible. Please ask for final review when PR is ready

nemo/collections/asr/modules/audio_preprocessing.py

titu1994 · 2023-03-10T22:49:35Z

Side note - while this is a TTS change, you are modifying ASR code, so please make sure to notify someone from ASR for reviews (usually me or @VahidooX) before merging.

Signed-off-by: Ryan <[email protected]>

for more information, see https://pre-commit.ci

XuesongYang · 2023-03-14T07:47:08Z

added a resource that showing diff between slaney and htk mel-scale: https://groups.google.com/g/librosa/c/JDzUZggitYM

titu1994

Looks good ! Thanks

titu1994 · 2023-03-14T07:56:08Z

@XuesongYang Yes it's a good idea to make a discussion on the topic with a summary of the discussion + plots. Better still if experiments show notable improvement, then simply updating configs should be useful

* [TTS] Make mel spectrogram norm configurable Signed-off-by: Ryan <[email protected]> * [TTS] Add mel norm param to FilterbankFeaturesTA Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Make mel spectrogram norm configurable Signed-off-by: Ryan <[email protected]> * [TTS] Add mel norm param to FilterbankFeaturesTA Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>

github-actions bot added the ASR label Mar 9, 2023

rlangman requested review from XuesongYang, redoctopus, blisc and subhankar-ghosh March 9, 2023 00:15

[TTS] Make mel spectrogram norm configurable

0a0f56a

Signed-off-by: Ryan <[email protected]>

rlangman force-pushed the mel_norm branch from a562f2a to 0a0f56a Compare March 9, 2023 00:18

rlangman added the TTS label Mar 9, 2023

XuesongYang requested review from racoiaws and titu1994 March 9, 2023 17:52

XuesongYang requested changes Mar 10, 2023

View reviewed changes

nemo/collections/asr/modules/audio_preprocessing.py Outdated Show resolved Hide resolved

titu1994 reviewed Mar 10, 2023

View reviewed changes

nemo/collections/asr/modules/audio_preprocessing.py Outdated Show resolved Hide resolved

[TTS] Add mel norm param to FilterbankFeaturesTA

cc0e6f5

Signed-off-by: Ryan <[email protected]>

github-actions bot removed the TTS label Mar 11, 2023

rlangman and others added 2 commits March 10, 2023 17:02

[TTS] Fix config typo

0970b12

Signed-off-by: Ryan <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

103c7e9

for more information, see https://pre-commit.ci

XuesongYang approved these changes Mar 14, 2023

View reviewed changes

XuesongYang requested a review from titu1994 March 14, 2023 07:45

titu1994 approved these changes Mar 14, 2023

View reviewed changes

rlangman merged commit e322dd0 into main Mar 14, 2023

rlangman deleted the mel_norm branch March 14, 2023 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS] Make mel spectrogram norm configurable #6155

[TTS] Make mel spectrogram norm configurable #6155

rlangman commented Mar 9, 2023

racoiaws commented Mar 9, 2023 •

edited

Loading

racoiaws commented Mar 9, 2023

rlangman commented Mar 9, 2023

racoiaws commented Mar 9, 2023 •

edited

Loading

rlangman commented Mar 9, 2023

XuesongYang commented Mar 10, 2023

titu1994 left a comment

titu1994 commented Mar 10, 2023

XuesongYang commented Mar 14, 2023

titu1994 left a comment

titu1994 commented Mar 14, 2023

[TTS] Make mel spectrogram norm configurable #6155

[TTS] Make mel spectrogram norm configurable #6155

Conversation

rlangman commented Mar 9, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

racoiaws commented Mar 9, 2023 • edited Loading

racoiaws commented Mar 9, 2023

rlangman commented Mar 9, 2023

racoiaws commented Mar 9, 2023 • edited Loading

rlangman commented Mar 9, 2023

XuesongYang commented Mar 10, 2023

titu1994 left a comment

Choose a reason for hiding this comment

titu1994 commented Mar 10, 2023

XuesongYang commented Mar 14, 2023

titu1994 left a comment

Choose a reason for hiding this comment

titu1994 commented Mar 14, 2023

racoiaws commented Mar 9, 2023 •

edited

Loading

racoiaws commented Mar 9, 2023 •

edited

Loading