Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Sem-dedup #130

Merged
merged 101 commits into from
Jul 5, 2024
Merged

Enable Sem-dedup #130

merged 101 commits into from
Jul 5, 2024

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Jun 27, 2024

Description

This PR builds on top #118 and adds the following features on top of it:

  1. End to End Bash script
  2. Improved Readme
  3. Efficient and cleaned up compute_embeddings.py, clustering.py
  4. Dask Accelerated sort_clusters.py and semdedup.py

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

aschilling-nv and others added 21 commits June 27, 2024 03:18
* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <[email protected]>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <[email protected]>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <[email protected]>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Andrew Schilling <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Added links to tutorials

Signed-off-by: jgerh <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa VibhuJawa force-pushed the vjawa/dev_semdedup branch 2 times, most recently from 0524727 to 3179e24 Compare June 27, 2024 10:27
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@VibhuJawa
Copy link
Collaborator Author

@ryantwolf , Thanks for the careful review, I think I have addressed all your comments, please feel free to take another look.

Signed-off-by: Vibhu Jawa <[email protected]>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton @VibhuJawa, just a couple of nits and things I think you missed the first time around.

@faywang123
Copy link

faywang123 commented Jul 4, 2024

I have tested the most recent PR (using 10 data files with 12 clusters). The result is consistent with our original result. Thanks, Vibhu! This is the command:

python semdedup_example.py --input-data-dir /ads_ds3/data/SemDeDup_BenchMark/datasets/c4/realnewslike/modified --config-file configs_cf.yml

The content of configs_cf.yml:

cache_dir: "/ads_ds3/data/SemDeDup_BenchMark"
num_files: 10
id_col_name: 'id' 
id_col_type: 'int' 
input_column: 'text'
input_file_type: 'json'
embeddings_save_loc: "/ads_ds3/data/SemDeDup_BenchMark/embeddings_fbopt_c4_10_pr130"
embedding_model_name_or_path: 'facebook/opt-125m' 
embedding_batch_size: 32
embedding_max_mem_gb: 10

clustering_save_loc: "/ads_ds3/data/SemDeDup_BenchMark/results_fbopt_c4_10_pr130"
n_clusters: 12 # -- number of clusters
seed: 1234
max_iter: 100
# Kmeans can only be done with L2 using cuML.
Kmeans_with_cos_dist: False # Only False allowed

# -- which example to keep from each group of duplicates
which_to_keep: "hard"
# largest cluster size the memory is large enough to process. If the
# cluster size is larger than it, we will devide the cluster into small
# clusters and process each one separately.
largest_cluster_size_to_process: 100000
sim_metric: "cosine" # Only cosine is allowed.
eps_thresholds:  0.001 0.01
eps_to_extract: 0.001

@VibhuJawa
Copy link
Collaborator Author

Thanks so much for this @faywang123 . Appreciate all the help.

@ayushdg , Can we use @faywang123 test above and put in our testing.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits then we're set.

config/sem_dedup_config.yaml Outdated Show resolved Hide resolved
config/sem_dedup_config.yaml Outdated Show resolved Hide resolved
@faywang123
Copy link

After a final review and test, the PR looks good to me. Thanks, @VibhuJawa for all the hard work!

@VibhuJawa
Copy link
Collaborator Author

@ryantwolf , Addressed the nits, let me know

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work, so excited to have this be a part of NeMo Curator

Signed-off-by: Vibhu Jawa <[email protected]>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking but a couple of other suggestions:

  1. Adding an optional SemDedup import to the top level modules/__init__.py file for gpu only environments. Allowing users to do something like from nemo_curator import SemDedup
  2. Adding semantic deduplication in the list of features both in the README.md as well as a page in docs/user-guide

examples/semdedup_example.py Show resolved Hide resolved
nemo_curator/modules/semantic_dedup.py Show resolved Hide resolved
@VibhuJawa
Copy link
Collaborator Author

Not blocking but a couple of other suggestions:

  1. Adding an optional SemDedup import to the top level modules/__init__.py file for gpu only environments. Allowing users to do something like from nemo_curator import SemDedup

Done .

  1. Adding semantic deduplication in the list of features both in the README.md as well as a page in docs/user-guide

Added readme. docs/user-guide will be a followup.

Signed-off-by: Vibhu Jawa <[email protected]>
nemo_curator/modules/__init__.py Outdated Show resolved Hide resolved
nemo_curator/modules/__init__.py Outdated Show resolved Hide resolved
@VibhuJawa VibhuJawa merged commit e557ee3 into NVIDIA:main Jul 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants