Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HuggingFace Datasets to NeMo ASR Dataset script #3513

Merged
merged 10 commits into from
Jan 28, 2022

Conversation

titu1994
Copy link
Collaborator

Changelog

  • Add script to convert HuggingFace ASR datasets to preprocessed NeMo compatible datasets.
  • These datasets can then be converted to tarred datasets as always.

NOTE:

  • The converted datasets DO NOT PERFORM TEXT PREPROCESING. That should be handled by the user at manifest level.
  • The offline mode dataset transformation is fast but requires 3 copies of the dataset in its entirety at any given time - so it needs a lot of disk space.

@titu1994 titu1994 requested a review from vsl9 January 25, 2022 23:25
@lgtm-com
Copy link

lgtm-com bot commented Jan 25, 2022

This pull request introduces 2 alerts when merging 229a8b3 into 3146fca - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jan 27, 2022

This pull request introduces 2 alerts when merging 10821e2 into d8354a2 - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import

## Usage - Offline Mode

python convert_hf_dataset_to_nemo.py \
output_dir=<Path to some storage drive that will holde preprocessed audio files> \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: holde => hold

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if not os.path.exists(cfg.split_output_dir):
os.makedirs(cfg.split_output_dir, exist_ok=True)

cfg.split = split
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line should be moved out of else scope (delete extra tab).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import tqdm
from datasets import Audio, IterableDataset, load_dataset
from hydra.core.config_store import ConfigStore
from omegaconf import OmegaConf, open_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_dict is unused (according to LGTM check)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for catching that. ended up not needing it.

Signed-off-by: smajumdar <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jan 27, 2022

This pull request introduces 1 alert when merging 5d59252 into 101977e - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Signed-off-by: smajumdar <[email protected]>
Copy link
Collaborator

@vsl9 vsl9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@titu1994 titu1994 merged commit 31e528c into NVIDIA:main Jan 28, 2022
@titu1994 titu1994 deleted the asr_hf_datasets branch January 28, 2022 17:08
nithinraok pushed a commit that referenced this pull request Feb 2, 2022
* First draft

Signed-off-by: smajumdar <[email protected]>

* Temp commit

Signed-off-by: smajumdar <[email protected]>

* Prepare HF to NeMo dataset preparation script

Signed-off-by: smajumdar <[email protected]>

* Improve conversion framework

Signed-off-by: smajumdar <[email protected]>

* Finalize HF dataset to NeMo ASR

Signed-off-by: smajumdar <[email protected]>

* Finalize HF dataset to NeMo ASR

Signed-off-by: smajumdar <[email protected]>

* Fix style

Signed-off-by: smajumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Fixed dangling variable

Signed-off-by: smajumdar <[email protected]>

Co-authored-by: Vitaly Lavrukhin <[email protected]>
fayejf pushed a commit that referenced this pull request Mar 2, 2022
* First draft

Signed-off-by: smajumdar <[email protected]>

* Temp commit

Signed-off-by: smajumdar <[email protected]>

* Prepare HF to NeMo dataset preparation script

Signed-off-by: smajumdar <[email protected]>

* Improve conversion framework

Signed-off-by: smajumdar <[email protected]>

* Finalize HF dataset to NeMo ASR

Signed-off-by: smajumdar <[email protected]>

* Finalize HF dataset to NeMo ASR

Signed-off-by: smajumdar <[email protected]>

* Fix style

Signed-off-by: smajumdar <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Fixed dangling variable

Signed-off-by: smajumdar <[email protected]>

Co-authored-by: Vitaly Lavrukhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants