Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only import PII constants during Curator import #61

Merged
merged 2 commits into from
May 13, 2024

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented May 10, 2024

One approach to fixing #59.
The root cause seems to come from the fact that importing space (which imports and calls a cupy cuda function) somehow impacts the state of the system and prevent a cluster starting up using all available GPUs on the machine.

Currently the only imports in Curator that lead to this situation is importing DEFAULT_LANGUAGE from the pii modules which transitively ends up importing presidio->spacy.

This pr moves these constants to a separate file so that we don't end up importing all other dependencies during Curator import.

…pacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>
@ayushdg ayushdg requested a review from VibhuJawa May 10, 2024 20:16
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, nitpicks around comments and code structure.

Thanks a ton for fixing this for us.

nemo_curator/modifiers/pii_modifier.py Outdated Show resolved Hide resolved
nemo_curator/pii/algorithm.py Show resolved Hide resolved
@ayushdg ayushdg marked this pull request as ready for review May 13, 2024 18:10
@VibhuJawa
Copy link
Collaborator

Tested locally, works, please go ahead and merge.

@ayushdg ayushdg merged commit 38d8ce7 into NVIDIA:main May 13, 2024
3 checks passed
rjzamora pushed a commit to rjzamora/NeMo-Curator that referenced this pull request May 14, 2024
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
ayushdg added a commit that referenced this pull request May 15, 2024
* update extract_partitioning_index with compat code

Signed-off-by: rjzamora <[email protected]>

* [Tutorials] Add a tutorial for PEFT data curation (#45)

This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: rjzamora <[email protected]>

* move compat code to _compat file

Signed-off-by: rjzamora <[email protected]>

* Only import PII constants during Curator import (#61)

* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>

* add unit test

Signed-off-by: rjzamora <[email protected]>

* add pytest.mark.gpu

Signed-off-by: rjzamora <[email protected]>

* move extract_partitioning_index import for now

Signed-off-by: rjzamora <[email protected]>

* test both cudf and pandas

Signed-off-by: rjzamora <[email protected]>

* spelling

Signed-off-by: rjzamora <[email protected]>

---------

Signed-off-by: rjzamora <[email protected]>
Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Mehran Maghoumi <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request May 16, 2024
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request May 16, 2024
…DIA#60)

* update extract_partitioning_index with compat code

Signed-off-by: rjzamora <[email protected]>

* [Tutorials] Add a tutorial for PEFT data curation (NVIDIA#45)

This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: rjzamora <[email protected]>

* move compat code to _compat file

Signed-off-by: rjzamora <[email protected]>

* Only import PII constants during Curator import (NVIDIA#61)

* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>

* add unit test

Signed-off-by: rjzamora <[email protected]>

* add pytest.mark.gpu

Signed-off-by: rjzamora <[email protected]>

* move extract_partitioning_index import for now

Signed-off-by: rjzamora <[email protected]>

* test both cudf and pandas

Signed-off-by: rjzamora <[email protected]>

* spelling

Signed-off-by: rjzamora <[email protected]>

---------

Signed-off-by: rjzamora <[email protected]>
Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Mehran Maghoumi <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
nicoleeeluo pushed a commit to nicoleeeluo/NeMo-Curator that referenced this pull request May 20, 2024
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>
ryantwolf added a commit that referenced this pull request May 24, 2024
* Init commit for tutorial notebook

Signed-off-by: Nicole Luo <[email protected]>

* Fix metadata inference with pandas and dask (#35)

* Fix metadata inference with pandas and dask

Signed-off-by: Ryan Wolf <[email protected]>

* Fix datatypes for task decontamination

Signed-off-by: Ryan Wolf <[email protected]>

* Use targetted import

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Disable PyTorch Compile Multiprocessing (#34)

* Move tokenizer import

Signed-off-by: Ryan Wolf <[email protected]>

* Reduce inductor threads

Signed-off-by: Ryan Wolf <[email protected]>

* Change env int to string

Signed-off-by: Ryan Wolf <[email protected]>

* Change location of env var

Signed-off-by: Ryan Wolf <[email protected]>

* Add comment linking issue

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Improve speed of AddId module (#36)

* Add fast id method

Signed-off-by: Ryan Wolf <[email protected]>

* Add type conversion

Signed-off-by: Ryan Wolf <[email protected]>

* Fix off by one errors in tests

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Make GPU dependencies optional (#27)

* Move GPU imports and make them optional

Signed-off-by: Ayush Dattagupta <[email protected]>

* Move gpu dependencies to a seperate install

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove unused import

Signed-off-by: Ayush Dattagupta <[email protected]>

* Switch to placeholder import that raises on usage

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove deprecated utils usage

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add cuML attribution

Signed-off-by: Ayush Dattagupta <[email protected]>

* Safe import tests, improve install instruction, update gha workflow

Signed-off-by: Ayush Dattagupta <[email protected]>

* Fix pytests due to loc bug

Signed-off-by: Ayush Dattagupta <[email protected]>

* update install instructions

Signed-off-by: Ayush Dattagupta <[email protected]>

* Raise on non module-not-found errors, update logging

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update logging to not change root logger

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Fix failing GPU tests with latest pandas bump (#41)

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Adds Nemo Curator K8s example (#40)

* [K8s]: Adds a helper script to create a dask cluster on k8s and includes
instructions for how to a Curator workload on k8s

Signed-off-by: Terry Kong <[email protected]>

* black formatting

Signed-off-by: Terry Kong <[email protected]>

* big_english -> my_dataset

Signed-off-by: Terry Kong <[email protected]>

* 24.01 -> 24.03 default container

Signed-off-by: Terry Kong <[email protected]>

* Add help kwarg to all flags

Signed-off-by: Terry Kong <[email protected]>

* Clarify why venv is needed

Signed-off-by: Terry Kong <[email protected]>

* fix precommit failures

Signed-off-by: Terry Kong <[email protected]>

---------

Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Move common dedup utils and remove unused code (#42)

* Refactor common utils and remove unused code

Signed-off-by: Ayush Dattagupta <[email protected]>

* More cleanup

Signed-off-by: Ayush Dattagupta <[email protected]>

* More updates/shuffling

Signed-off-by: Ayush Dattagupta <[email protected]>

* Move gpu_dedup scripts into subfolder

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove gpu_deduplication subfolder

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add readme to fuzzy dedup scripts section

Signed-off-by: Ayush Dattagupta <[email protected]>

* Fix typo and relative links

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove legacy script entrypoints

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove legacy scripts and add init file

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update GpuDeduplication.rst

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Fix lang id example (#37)

* Fix lang id example

Signed-off-by: Ryan Wolf <[email protected]>

* Add classifier unit tests

Signed-off-by: Ryan Wolf <[email protected]>

* Add test for failure

Signed-off-by: Ryan Wolf <[email protected]>

* Remove failure test

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Add dataset blending tool (#32)

* Add initial dataset blending function

Signed-off-by: Ryan Wolf <[email protected]>

* Add blend unit tests

Signed-off-by: Ryan Wolf <[email protected]>

* Add self parameter

Signed-off-by: Ryan Wolf <[email protected]>

* Fix return type of blend dataset

Signed-off-by: Ryan Wolf <[email protected]>

* Fix blending tests

Signed-off-by: Ryan Wolf <[email protected]>

* Change assert statement for very uneven blend

Signed-off-by: Ryan Wolf <[email protected]>

* Fix key error

Signed-off-by: Ryan Wolf <[email protected]>

* Add proper proportion blending test

Signed-off-by: Ryan Wolf <[email protected]>

* Add four dataset blend and clarify docs

Signed-off-by: Ryan Wolf <[email protected]>

* Add shuffle module

Signed-off-by: Ryan Wolf <[email protected]>

* Add blend example and tests

Signed-off-by: Ryan Wolf <[email protected]>

* Fix random method name

Signed-off-by: Ryan Wolf <[email protected]>

* Wrap return type in DocumentDataset

Signed-off-by: Ryan Wolf <[email protected]>

* Save result of column drop

Signed-off-by: Ryan Wolf <[email protected]>

* Change equality check for shuffle tests

Signed-off-by: Ryan Wolf <[email protected]>

* Fix expected order after shuffle

Signed-off-by: Ryan Wolf <[email protected]>

* Add more documents to shuffle test

Signed-off-by: Ryan Wolf <[email protected]>

* Add assert statement

Signed-off-by: Ryan Wolf <[email protected]>

* Add within partition shuffle

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor add rand column for shuffle

Signed-off-by: Ryan Wolf <[email protected]>

* Fix filename tests

Signed-off-by: Ryan Wolf <[email protected]>

* Add determinism handling for shuffle

Signed-off-by: Ryan Wolf <[email protected]>

* Change numpy random function

Signed-off-by: Ryan Wolf <[email protected]>

* Fix tests with new random method

Signed-off-by: Ryan Wolf <[email protected]>

* Remove length call from blending

Signed-off-by: Ryan Wolf <[email protected]>

* Improve scaling of blending function

Signed-off-by: Ryan Wolf <[email protected]>

* Fix blend tests

Signed-off-by: Ryan Wolf <[email protected]>

* Add blending script

Signed-off-by: Ryan Wolf <[email protected]>

* Add additional file paths call

Signed-off-by: Ryan Wolf <[email protected]>

* Add documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Reformat docs

Signed-off-by: Ryan Wolf <[email protected]>

* Remove backticks

Signed-off-by: Ryan Wolf <[email protected]>

* Add context manager for shuffle tests

Signed-off-by: Ryan Wolf <[email protected]>

* Add better deterministic shuffle path

Signed-off-by: Ryan Wolf <[email protected]>

* Update documentation and reset index

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* High level fuzzy duplicates module (#46)

* Initial pass at fuzzy dedup api

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update deprecated shuffle arg

Signed-off-by: Ayush Dattagupta <[email protected]>

* dask_cuda gpu only import

Signed-off-by: Ayush Dattagupta <[email protected]>

* Move fuzzy_dedup imports to optional

Signed-off-by: Ayush Dattagupta <[email protected]>

* more tests

Signed-off-by: Ayush Dattagupta <[email protected]>

* Move FuzzyDeDupConfig to it's own class

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add example script and config file, fix typo

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove slurm examples for gpu dedup

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add config module

Signed-off-by: Ayush Dattagupta <[email protected]>

* Rename FuzzyDeDupConfig and minhash_length to  FuzzyDuplicatesConfig, num_hashes

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comments and update example

Signed-off-by: Ayush Dattagupta <[email protected]>

* Write to same format as input in fuzzy dedup example

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Fix indexing in PII Modifier (#55)

* Fix pii index issue

Signed-off-by: Ryan Wolf <[email protected]>

* Add sequential wrapper

Signed-off-by: Ryan Wolf <[email protected]>

* Fix pii tests

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Disable string conversion globally (#56)

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Fix issue #43 (empty files creation) and improve reading/writing speed (#57)

This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* [Tutorials] Add a tutorial for PEFT data curation (#45)

This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Only import PII constants during Curator import (#61)

* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Deleting links

Signed-off-by: Nicoel Luo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Signed-off-by: Nicole Luo <[email protected]>

* Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example

Signed-off-by: Nicole Luo <[email protected]>

* Fixing Style

Signed-off-by: Nicole Luo <[email protected]>

* Updating container version

Signed-off-by: Nicole Luo <[email protected]>

* Fixing style

Signed-off-by: Nicole Luo <[email protected]>

* Update get_client() according to latest version; Update log path for map_bucket section

Signed-off-by: Nicole Luo <[email protected]>

---------

Signed-off-by: Nicole Luo <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: Nicoel Luo <[email protected]>
Signed-off-by: nicoleeeluo <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Terry Kong <[email protected]>
Co-authored-by: Miguel Martínez <[email protected]>
Co-authored-by: Mehran Maghoumi <[email protected]>
Co-authored-by: Ryan Wolf <[email protected]>
@ayushdg ayushdg deleted the gpu-spacy-import-oom branch June 3, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants