Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loudness normalization with pyloudnorm #1016

Merged
merged 5 commits into from
Apr 6, 2023

Conversation

desh2608
Copy link
Collaborator

@desh2608 desh2608 commented Apr 5, 2023

Related to #966.

This PR adds a method normalize_loudness() for recordings. This takes an argument target which specifies the desired loudness (usually around -23 dB is a good value). The implementation uses pyloudnorm.

Also, we move ReverbWithImpulseResponse out of the torchaudio.py and into its own file, since it is not a torchaudio based transform.

# clipping the audio.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
loudness_normalized_audio = pyln.normalize.loudness(audio.T, loudness, target)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If audio clipping is an issue here you can add a limiter as a post-processing step, e.g. https://github.com/pzelasko/cylimiter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add it. But I don't have enough familiarity with limiters (or "loudness" for that matter) to know what to do exactly.

Copy link
Collaborator

@pzelasko pzelasko Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults should "just work" with pretty much anything, it basically keeps track of the signal's loudness with a small lookahead and reduces the gain if it crosses some threshold. Think of it as soft clipping that doesn't introduce as much distortion as hard clipping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. but again I don't know if that's a real problem with this approach and worth the extra dependency, so it's your call :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I have only used it to make the LibriCSS distant mic audio louder (it is at around -52 dB originally), and it sounds okay even with the clipping. I suppose we can let it be for now and add the limiter later if someone needs it?

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I left a single comment, up to you :)

@pzelasko pzelasko added this to the v1.14 milestone Apr 5, 2023
@desh2608 desh2608 merged commit 3a3ed61 into lhotse-speech:master Apr 6, 2023
@desh2608 desh2608 deleted the loud_norm branch April 6, 2023 01:54
"""

target: float
sampling_rate: int = 16000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzelasko should we make sampling_rate as a member?

class AudioTransform:
    def __call__(self, samples: np.ndarray, sampling_rate: int) -> np.ndarray:
        """
        Apply transform.

        To be implemented in derived classes.
        """
        raise NotImplementedError

:return: a modified copy of the current ``Recording``.
"""
transforms = self.transforms.copy() if self.transforms is not None else []
transforms.append(LoudnessNormalization(target=target).to_dict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampling_rate is missed

@lifeiteng
Copy link
Contributor

@desh2608 @pzelasko

I used set.normalize_loudness as blow: https://github.com/lifeiteng/vall-e/blob/main/valle/bin/tokenizer.py#L173

if args.prefix == "aishell":
    # NOTE: the loudness of aishell audio files is around -33
    # The best way is datamodule --on-the-fly-feats --enable-audio-aug
    cut_set = cut_set.normalize_loudness(
        target=-20.0, affix_id=True
    )

cut_set = cut_set.resample(24000)

But model's accuracy drops a lot.

截屏2023-04-14 23 44 31

@desh2608
Copy link
Collaborator Author

@lifeiteng what metric is this curve? If it is validation accuracy, did you normalize loudness for both train and validation sets? I would think that for the training set, it may be better to "perturb" the loudness (within some range) rather than normalize it, similar to how we perturb volume.

@lifeiteng
Copy link
Contributor

lifeiteng commented Apr 14, 2023

@desh2608 Training CrossEntropy Top10Accuracy.
yes, normalized loudness for both train and validation sets.
It's used in text-to-speech, yes "perturb" is better, but I want to make it simple at current stage.
I don't understand why normalization leads to a significant drop in accuracy. The bug(comments above) is not triggered.

@pzelasko
Copy link
Collaborator

@lifeiteng If it's for TTS, can you listen to the train and dev examples before and after normalization, as well to the model predictions? Maybe that could reveal if there's something funky going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants