Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Make mel spectrogram norm configurable #6155

Merged
merged 4 commits into from
Mar 14, 2023
Merged

[TTS] Make mel spectrogram norm configurable #6155

merged 4 commits into from
Mar 14, 2023

Conversation

rlangman
Copy link
Collaborator

@rlangman rlangman commented Mar 9, 2023

What does this PR do ?

Add optional configuration to AudioToMelSpectrogramPreprocessor to allow disabling area normalization in the librosa mel filter.

Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.

The intention is to experiment with disabling the area normalization and replacing the current mel spectrogram used in TTS which is:

log(max(1E-5, area_norm(mel))

With a more accurate representation like:

log(1.0 + mel)

This will also enable us to compute energy features using the mel spectrogram instead of the linear spectrogram.

Collection: [TTS], [ASR]

Changelog

  • Add an optional parameter to AudioToMelSpectrogramPreprocessor to configure the mel filter normalization, defaulting to the librosa default of 'slaney'

Usage

.yaml config:

preprocessor:
target: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
log: true
log_zero_guard_type: add
log_zero_guard_value: 1.0
mag_power: 1.0
mel_norm: null

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

@racoiaws
Copy link
Collaborator

racoiaws commented Mar 9, 2023

Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.

I think there is a misunderstanding.
Let's say we have a white noise signal with uniform energy across the spectrum (STFT amplitude). With the standard "slaney" normalisation, the mel-spectrogram is supposed to also have uniform values in all bands. Reducing the range of values is not the primary goal, I guess it's more about interpretability.

Without normalisation we would get bigger numbers in bands where the mel-scale 'compresses' more (higher frequencies). Think sum instead of averaging (roughly). I think this is what you interpreted as 'extended' range.

This could be clarified by visualising distributions of values for different mel bands. Could you please do that? I would expect both higher mean and variance for upper bands.

@racoiaws
Copy link
Collaborator

racoiaws commented Mar 9, 2023

Note that the above is orthogonal to log(1+mel) suggestion. Wonder why people don't do that. log(mel) is almost decibels so makes sense in theory, but log(1+mel) would make much more sense in practice

@rlangman
Copy link
Collaborator Author

rlangman commented Mar 9, 2023

Applying area normalization to the mel spectrogram reduces its range of values from approximately [0, 100] to [0, 1], which is detrimental when you take the log of the [0, 1] values producing a poor distribution of negative features.

I think there is a misunderstanding. Let's say we have a white noise signal with uniform energy across the spectrum (STFT amplitude). With the standard "slaney" normalisation, the mel-spectrogram is supposed to also have uniform values in all bands. Reducing the range of values is not the goal.

Without normalisation we would get bigger numbers in bands where the mel-scale 'compresses' (higher frequencies). Think sum instead of averaging (roughly). I think this is what you interpreted as 'extended' range.

This could be clarified by visualising distributions of values for different mel bands. Could you please do that? I would expect both higher mean and variance for upper bands.

True, the point of slaney normalization is definitely not to reduce the range of values. But it is a side effect, which does not work well with log compression. It could be that using the linear mels after normalization is better than using log mels without normalization (I suspect both would produce similar results).

If people are interested, I could create visuals showing the distribution of features using the different preprocessing configurations.

@racoiaws
Copy link
Collaborator

racoiaws commented Mar 9, 2023

True, the point of slaney normalization is definitely not to reduce the range of values. But it is a side effect, which does not work well with log compression

My main point is, this side effect most likely materialises in the upper bands, with lesser effect on bands where the mel-scale is more linear. When used as input to a neural network, this effectively puts a bigger weight on higher frequencies from the start.

For your goal, I would rather suggest just scaling input before the logarithm (log(alpha * mel)), with alpha somewhere from 1 to 100 (default 1). This way, the value range would be widened the same for all bands.

I mean, this is all theory, experiments might show otherwise, but a priori I have concerns stated above

@rlangman
Copy link
Collaborator Author

rlangman commented Mar 9, 2023

Below is 1 audio sample with different mel configs. There are limits to what we can infer by just looking at 1 piece of audio, but at a high level we see:

  1. The spectrograms which add 1.0 to the log guard look a lot cleaner (most values in the spectrogram are very close to 0).
  2. Without log compression, there are a large number of small values and small number of very large values/outliers. Taking the log results in a more even distribution.
  3. Area normalization reduces the scale of higher frequency bands relative to smaller frequency bands, due to these mel bands having a larger range.

From this I guess that a function like log(1.0 + 50 * slaney(mel)) would be optimal, but is likely to produce the same behavior as
log(1.0 + mel). Without running experiments, I would be OK with either but would slightly favor the latter by default due to its simplicity.

Would be happy to hear anyone else's thoughts on this.

mel

mel

log(1.0 + mel)

mel_log

slaney(mel)

mel_slaney

log(max(1E-5, slaney(mel))) - Our current TTS config

mel_slaney_log_guard

log(max(1E-5, 50 * slaney(mel)))

mel_slaney_50x_log_guard

log(1.0 + slaney(mel))

mel_slaney_log

log(1.0 + 50 * slaney(mel))

mel_slaney_50x_log

@XuesongYang
Copy link
Collaborator

A little bit off-topic, this PR only makes functions configurable while it is the user's responsibility to decide which trick is more applicable to their needs. The discussion is very valuable, shall we move to other places for further discussion>?

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine with me, since it's backward compatible. Please ask for final review when PR is ready

nemo/collections/asr/modules/audio_preprocessing.py Outdated Show resolved Hide resolved
@titu1994
Copy link
Collaborator

Side note - while this is a TTS change, you are modifying ASR code, so please make sure to notify someone from ASR for reviews (usually me or @VahidooX) before merging.

@github-actions github-actions bot removed the TTS label Mar 11, 2023
@XuesongYang
Copy link
Collaborator

added a resource that showing diff between slaney and htk mel-scale: https://groups.google.com/g/librosa/c/JDzUZggitYM

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ! Thanks

@titu1994
Copy link
Collaborator

@XuesongYang Yes it's a good idea to make a discussion on the topic with a summary of the discussion + plots. Better still if experiments show notable improvement, then simply updating configs should be useful

@rlangman rlangman merged commit e322dd0 into main Mar 14, 2023
@rlangman rlangman deleted the mel_norm branch March 14, 2023 16:03
titu1994 pushed a commit to titu1994/NeMo that referenced this pull request Mar 24, 2023
* [TTS] Make mel spectrogram norm configurable

Signed-off-by: Ryan <[email protected]>

* [TTS] Add mel norm param to FilterbankFeaturesTA

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* [TTS] Make mel spectrogram norm configurable

Signed-off-by: Ryan <[email protected]>

* [TTS] Add mel norm param to FilterbankFeaturesTA

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hsiehjackson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants