[ASR] AudioToAudio datasets and related test #5196

anteju · 2022-10-18T23:11:06Z

What does this PR do ?

This is a draft PR of datasets for different audio-to-audio tasks.

Datasets

BaseAudioDataset: Abstract base class.
AudioToTargetDataset: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.
AudioToTargetWithReferenceDataset: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional reference signal is available.
AudioToTargetWithEmbeddingDataset: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional embedding signal. It is assumed that the embedding is in a form of a vector.

Tests

Multiple tests are implemented in test_asr_datasets.py in class TestAudioDatasets. These tests include

test_audio_to_target_dataset: multiple tests for AudioToTargetDataset
test_audio_to_target_dataset_with_target_list: tests specifically a scenario where target is provides as a list of files
test_audio_to_target_with_reference_dataset: tests for AudioToTargetWithReferenceDataset
test_audio_to_target_with_embedding_dataset: tests for AudioToTargetWithEmbeddingDataset

Tests can be started using the following command

pytest tests/collections/asr/test_asr_datasets.py::TestAudioDatasets

Collection: ASR

Changelog

Added datasets in audio_to_audio.py
Added unit tests in test_asr_datasets.py and test_audio_utils.py
Added an option in AudioSegment. segment_from_file to use a fixed offset

Usage

Usage is illustrated in unit tests, which can be executed using

pytest tests/collections/asr/test_asr_datasets.py::TestAudioDatasets

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

lgtm-com · 2022-10-18T23:28:44Z

This pull request introduces 1 alert when merging da6a76d into acb5073 - view on LGTM.com

new alerts:

1 for Unused local variable

lgtm-com · 2022-10-18T23:57:14Z

This pull request introduces 1 alert when merging 16a363e into acb5073 - view on LGTM.com

new alerts:

1 for Unused local variable

nemo/collections/asr/data/audio_to_audio.py

yzhang123 · 2022-10-19T15:40:45Z

nemo/collections/asr/data/audio_to_audio.py

+        sample_rate: desired sample rate for output samples
+        duration: Optional desired duration of output samples.
+                  If `None`, complete file will be loaded.
+                  If set, a random segment of `duration` seconds will be


does it make sense to load random segment of duration, or is there a way to load a file with fixed_offset and duration?

Usually audio-to-audio is trained with a fixed duration and selecting a random segment from audio. Test is usually performed on whole audio file. That's why I added support for these two essential use cases (random fixed length or whole audio).
I will and a fixed duration, fixed offset (non-random).

It would be great of the function docsting explains more about optional multichannel aspect.
I think diarization/buffer-ASR codes should use this function for removing duplications

Added loading fixed offset & fixed duration + more docstrings.

nemo/collections/common/parts/preprocessing/collections.py

yzhang123 · 2022-10-19T17:33:36Z

Thanks for this draft @anteju ! This is very helpful.
I wonder how the pipeline will look like if you add all audio based augmentations/perturbations. could you add an example?
Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

anteju · 2022-10-26T23:52:17Z

Thanks for this draft @anteju ! This is very helpful.
I wonder how the pipeline will look like if you add all audio based augmentations/perturbations. could you add an example?

The most straightforward way would be to add augmentor object to datasets, similarly to existing datasets.

Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

Could you please clarify multiple audios and multiple text?

yzhang123 · 2022-10-27T14:22:05Z

The most straightforward way would be to add augmentor object to datasets, similarly to existing datasets.
Yes, good idea!

Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

Could you please clarify multiple audios and multiple text?

Maybe have a example dataset class, which in get_item can return two or more audio files, and two or more transcripts? This is a most general class that comes to mind, and is POC for the general design you worked on. This would allow us to then easily adapt the class for TSASR, which returns two audios and 1 transcript

titu1994 · 2022-10-27T16:37:40Z

Augnentor should be a detached operation, even if it's part of the config. Ie don't use the way we do it right now where we have no control over augmentation and it's all random inside of Waveform Featurizer. Let's seperate and make the call to augmentation more controllable inside of the new data loaders

anteju · 2022-10-27T19:28:16Z

The most straightforward way would be to add augmentor object to datasets, similarly to existing datasets.
Yes, good idea!

Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

Could you please clarify multiple audios and multiple text?

Maybe have a example dataset class, which in get_item can return two or more audio files, and two or more transcripts? This is a most general class that comes to mind, and is POC for the general design you worked on. This would allow us to then easily adapt the class for TSASR, which returns two audios and 1 transcript

Re: audio signals
AudioDatasetWithReference is returning multiple audio signals: input and target. Target can be, for example, be provided as a list of signals which are concatenated along the channel dimension. This is also tested in test_audio_to_target_dataset_with_target_list.
AudioDatasetWithReference is returning multiple audio signals: input, target, and reference. This one should be exactly what's necessary for TSASR (reference signal = enrollment audio for TS).

Re: text
As discussed earlier, I'll take a look at that.

anteju · 2022-10-27T19:31:11Z

Augnentor should be a detached operation, even if it's part of the config. Ie don't use the way we do it right now where we have no control over augmentation and it's all random inside of Waveform Featurizer. Let's seperate and make the call to augmentation more controllable inside of the new data loaders

That is exactly the plan: to have augmentation inside the data loader (and not when loading audio, as in WaveformFeaturizer).

anteju · 2022-11-17T00:28:23Z

nemo/collections/asr/parts/preprocessing/segment.py

@@ -257,12 +257,22 @@ def from_file(

    @classmethod
    def segment_from_file(
-        cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None,
+        cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None


@XuesongYang, I've added an option to specify a fixed (non-randomized) offset.
The new parameter offset defaults to None and as before results in a random offset.

nit: offset can be enforced as a float type. All negative values mean no offsets. So we could make it default to -1.0 to specify no offsets. Then benefit is that we can have a cleaner type hint.

it seems an easy add-on with type hint for this function. Do you mind if enforcing that? Thanks.

lgtm-com · 2022-11-17T00:46:04Z

This pull request introduces 1 alert when merging 1bff667 into 563cc2f - view on LGTM.com

new alerts:

1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

XuesongYang

LGTM for segment_from_file() func. Added some comments accordingly. Thanks.

nemo/collections/asr/parts/preprocessing/segment.py

XuesongYang · 2022-11-22T00:22:42Z

nemo/collections/asr/parts/preprocessing/segment.py

@@ -257,12 +257,22 @@ def from_file(

    @classmethod
    def segment_from_file(
-        cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None,
+        cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None


nit: offset can be enforced as a float type. All negative values mean no offsets. So we could make it default to -1.0 to specify no offsets. Then benefit is that we can have a cleaner type hint.

XuesongYang · 2022-11-22T00:25:35Z

nemo/collections/asr/parts/preprocessing/segment.py

@@ -257,12 +257,22 @@ def from_file(

    @classmethod
    def segment_from_file(
-        cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None,
+        cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None


it seems an easy add-on with type hint for this function. Do you mind if enforcing that? Thanks.

XuesongYang · 2022-11-22T00:43:43Z

discussed offline. Approved for my parts and please hold off merging until other folks approve.

nemo/collections/common/parts/preprocessing/manifest.py

tango4j · 2022-11-18T02:32:29Z

nemo/collections/asr/data/audio_to_audio.py

+]
+
+
+def load_samples_synchronized(


+1 adding one vote for this comment. this function feels like doing more than one thing.

nemo/collections/asr/data/audio_to_audio.py

titu1994 · 2022-11-23T01:52:31Z

nemo/collections/asr/data/audio_to_audio.py

+]
+
+
+def load_samples_synchronized(


The idea was to make the function itself split into seperate components, each with private methods inside of the class, which can be overriden. Having a class which calls a monolithic function defeats the purpose of extensible code

Signed-off-by: Ante Jukić <[email protected]>

…ication Signed-off-by: Ante Jukić <[email protected]>

Signed-off-by: Ante Jukić <[email protected]>

nemo/collections/asr/data/audio_to_audio.py

Signed-off-by: Ante Jukić <[email protected]>

tango4j

The changes look good to me

* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: andrusenkoau <[email protected]>

* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]>

anteju requested review from titu1994, yzhang123 and krishnacpuvvada October 18, 2022 23:11

anteju force-pushed the dev/audio-datasets branch from da6a76d to 16a363e Compare October 18, 2022 23:43

anteju force-pushed the dev/audio-datasets branch 2 times, most recently from 88561ac to 0b127ad Compare October 19, 2022 03:39

yzhang123 reviewed Oct 19, 2022

View reviewed changes

anteju force-pushed the dev/audio-datasets branch 2 times, most recently from c73994d to 47ded05 Compare October 19, 2022 17:28

anteju force-pushed the dev/audio-datasets branch 2 times, most recently from ca69595 to 2dd5343 Compare October 27, 2022 00:31

anteju force-pushed the dev/audio-datasets branch from 2dd5343 to c42d9f1 Compare October 27, 2022 19:47

anteju mentioned this pull request Nov 8, 2022

[ASR] Audio processing base, multi-channel enhancement models #5356

Merged

8 tasks

anteju force-pushed the dev/audio-datasets branch from d989568 to 5c05957 Compare November 10, 2022 04:35

anteju requested review from tango4j and yzhang123 November 11, 2022 00:51

anteju force-pushed the dev/audio-datasets branch from 5c05957 to 7ea9b7b Compare November 15, 2022 22:54

anteju marked this pull request as ready for review November 16, 2022 01:52

anteju force-pushed the dev/audio-datasets branch from 7ea9b7b to 1bff667 Compare November 17, 2022 00:23

anteju commented Nov 17, 2022

View reviewed changes

anteju requested a review from XuesongYang November 17, 2022 00:28

anteju requested a review from titu1994 November 21, 2022 23:36

XuesongYang reviewed Nov 22, 2022

View reviewed changes

anteju requested a review from XuesongYang November 22, 2022 00:43

XuesongYang previously approved these changes Nov 22, 2022

View reviewed changes

tango4j requested changes Nov 22, 2022

View reviewed changes

anteju dismissed XuesongYang’s stale review via e0f166d November 22, 2022 01:33

anteju requested review from XuesongYang and tango4j November 22, 2022 01:34

titu1994 requested changes Nov 23, 2022

View reviewed changes

anteju added 5 commits November 29, 2022 15:04

AudioToAudio datasets and related test

e45c661

Signed-off-by: Ante Jukić <[email protected]>

Updated doc, created utility function in manifest to avoide code dupl…

02e124c

…ication Signed-off-by: Ante Jukić <[email protected]>

Remove unused import

98a2cc1

Signed-off-by: Ante Jukić <[email protected]>

Moved functionality to ASRAudioProcessor

8c4f3e7

Signed-off-by: Ante Jukić <[email protected]>

Addressed review comments

835b4e6

Signed-off-by: Ante Jukić <[email protected]>

anteju force-pushed the dev/audio-datasets branch from e0f166d to 835b4e6 Compare November 29, 2022 23:11

github-actions bot added the common label Nov 29, 2022

titu1994 previously approved these changes Nov 29, 2022

View reviewed changes

nemo/collections/asr/data/audio_to_audio.py Show resolved Hide resolved

tango4j previously approved these changes Nov 30, 2022

View reviewed changes

Removed unused local variable

0567aa5

Signed-off-by: Ante Jukić <[email protected]>

anteju dismissed stale reviews from tango4j and titu1994 via 0567aa5 November 30, 2022 01:26

anteju requested review from tango4j and titu1994 November 30, 2022 01:31

tango4j approved these changes Nov 30, 2022

View reviewed changes

titu1994 merged commit 5c1d59e into NVIDIA:main Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ASR] AudioToAudio datasets and related test #5196

[ASR] AudioToAudio datasets and related test #5196

anteju commented Oct 18, 2022

lgtm-com bot commented Oct 18, 2022

lgtm-com bot commented Oct 18, 2022

yzhang123 Oct 19, 2022

anteju Oct 26, 2022

tango4j Oct 28, 2022

anteju Nov 11, 2022 •

edited

Loading

yzhang123 commented Oct 19, 2022 •

edited

Loading

anteju commented Oct 26, 2022

yzhang123 commented Oct 27, 2022 •

edited

Loading

titu1994 commented Oct 27, 2022

anteju commented Oct 27, 2022 •

edited

Loading

anteju commented Oct 27, 2022

anteju Nov 17, 2022

XuesongYang Nov 22, 2022

XuesongYang Nov 22, 2022 •

edited

Loading

lgtm-com bot commented Nov 17, 2022

XuesongYang left a comment

XuesongYang Nov 22, 2022

XuesongYang Nov 22, 2022 •

edited

Loading

XuesongYang commented Nov 22, 2022

tango4j Nov 18, 2022

titu1994 Nov 23, 2022

tango4j left a comment

[ASR] AudioToAudio datasets and related test #5196

[ASR] AudioToAudio datasets and related test #5196

Conversation

anteju commented Oct 18, 2022

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

lgtm-com bot commented Oct 18, 2022

lgtm-com bot commented Oct 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anteju Nov 11, 2022 • edited Loading

Choose a reason for hiding this comment

yzhang123 commented Oct 19, 2022 • edited Loading

anteju commented Oct 26, 2022

yzhang123 commented Oct 27, 2022 • edited Loading

titu1994 commented Oct 27, 2022

anteju commented Oct 27, 2022 • edited Loading

anteju commented Oct 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XuesongYang Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

lgtm-com bot commented Nov 17, 2022

XuesongYang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XuesongYang Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

XuesongYang commented Nov 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tango4j left a comment

Choose a reason for hiding this comment

anteju Nov 11, 2022 •

edited

Loading

yzhang123 commented Oct 19, 2022 •

edited

Loading

yzhang123 commented Oct 27, 2022 •

edited

Loading

anteju commented Oct 27, 2022 •

edited

Loading

XuesongYang Nov 22, 2022 •

edited

Loading

XuesongYang Nov 22, 2022 •

edited

Loading