Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text memmap dataset #4068

Merged
merged 65 commits into from
May 11, 2022
Merged

Conversation

michalivne
Copy link
Collaborator

@michalivne michalivne commented Apr 26, 2022

Signed-off-by: Micha Livne [email protected]

What does this PR do ?

Has mechanism to retire older ind files by updating internal idx version.

Indexing speed of 1443990774 samples in 147 files using 6 workers

Loading speed

[NeMo I 2022-04-29 00:23:22 text_memmap_dataset:85] Time loading 147 mem-mapped files: 0:00:04.395558
In [9]: len(ds)
Out[9]: 1443990774
# Timing without tokenizer
In [10]: %timeit -n 1000  ds[np.random.randint(len(ds))]
555 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# TIming with 'byte-level' tokenizer
In [20]: %timeit -n 1000  ds[np.random.randint(len(ds))]
724 µs ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Added TextMemMapDataset
  • Added CSVMemMapDataset
  • Retired MegatronDataset
  • Added nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py to preprocess indices (else happs on the fly at first run)
  • Added nemo/collections/nlp/data/machine_translation/sequence_to_sequence_dataset.py
  • Added scripts/nlp_language_modeling/build_index_memmap_data.py

Usage

Example for caching index files:

NeMo/scripts/nlp_language_modeling/build_index_memmap_data.py *.txt

Index files will be created when instantiating a memory mapped dataset if missing.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Apr 26, 2022

This pull request fixes 1 alert when merging 61daef3 into d97e0d3 - view on LGTM.com

fixed alerts:

  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Apr 27, 2022

This pull request introduces 1 alert and fixes 1 when merging 77cbd9b into d97e0d3 - view on LGTM.com

new alerts:

  • 1 for Syntax error

fixed alerts:

  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented Apr 27, 2022

This pull request introduces 1 alert and fixes 1 when merging 061beeb into da1b56c - view on LGTM.com

new alerts:

  • 1 for Syntax error

fixed alerts:

  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Apr 27, 2022

This pull request introduces 4 alerts and fixes 1 when merging 535a2f9 into 0d052c8 - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 1 for Syntax error

fixed alerts:

  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented Apr 27, 2022

This pull request introduces 7 alerts and fixes 1 when merging 0b09a96 into 0d052c8 - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 2 for Unused local variable

fixed alerts:

  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Apr 28, 2022

This pull request introduces 7 alerts and fixes 1 when merging 0f3810c into 70d9687 - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 2 for Unused local variable

fixed alerts:

  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented Apr 28, 2022

This pull request introduces 7 alerts and fixes 1 when merging 865d8c8 into 70d9687 - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 2 for Unused local variable

fixed alerts:

  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Apr 28, 2022

This pull request introduces 7 alerts and fixes 1 when merging fbc977f into 70d9687 - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 2 for Unused local variable

fixed alerts:

  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented May 9, 2022

This pull request introduces 2 alerts when merging fe85c7b into 1d64497 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented May 9, 2022

This pull request introduces 1 alert when merging a014a2f into 1d64497 - view on LGTM.com

new alerts:

  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented May 10, 2022

This pull request introduces 1 alert when merging f0b0d0e into 3816254 - view on LGTM.com

new alerts:

  • 1 for `__init__` method calls overridden method

michalivne and others added 5 commits May 11, 2022 01:21
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented May 10, 2022

This pull request introduces 3 alerts when merging f4cba69 into 3816254 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Explicit export is not defined
  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented May 10, 2022

This pull request introduces 2 alerts when merging f207758 into 3816254 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
MaximumEntropy
MaximumEntropy previously approved these changes May 10, 2022
Copy link
Contributor

@MaximumEntropy MaximumEntropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great PR!

Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented May 10, 2022

This pull request introduces 1 alert when merging c5b4c61 into 3816254 - view on LGTM.com

new alerts:

  • 1 for `__init__` method calls overridden method

@MaximumEntropy MaximumEntropy self-requested a review May 11, 2022 00:09
@MaximumEntropy MaximumEntropy merged commit 2c06602 into NVIDIA:main May 11, 2022
@MaximumEntropy MaximumEntropy deleted the dataset-memmap-text branch May 11, 2022 02:50
yidong72 pushed a commit that referenced this pull request May 12, 2022
* 1. Initial import.

Signed-off-by: Micha Livne <[email protected]>

* 1. Renamed file.

Signed-off-by: Micha Livne <[email protected]>

* 1. Removed MegatronDataset.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added CSVMemMapDataset.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in tokenizer.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added index building script.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added supported in glob when building indices.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added format version to .idx files.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added sanity checks.
2. Improved speed.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Consolidated MemmapSequenceToSequenceDataset and SequenceToSequenceDataset

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed usage of "SequenceToSequenceDataset"

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Style fix.

Signed-off-by: Micha Livne <[email protected]>

* 1. added control over ndex suffix files.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
titu1994 pushed a commit to titu1994/NeMo that referenced this pull request May 13, 2022
* 1. Initial import.

Signed-off-by: Micha Livne <[email protected]>

* 1. Renamed file.

Signed-off-by: Micha Livne <[email protected]>

* 1. Removed MegatronDataset.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added CSVMemMapDataset.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in tokenizer.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added index building script.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added supported in glob when building indices.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added format version to .idx files.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added sanity checks.
2. Improved speed.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Consolidated MemmapSequenceToSequenceDataset and SequenceToSequenceDataset

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed usage of "SequenceToSequenceDataset"

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Style fix.

Signed-off-by: Micha Livne <[email protected]>

* 1. added control over ndex suffix files.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants