-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Sem-dedup #130
Enable Sem-dedup #130
Conversation
d7c7b74
to
5d6a695
Compare
* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
0524727
to
3179e24
Compare
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@ryantwolf , Thanks for the careful review, I think I have addressed all your comments, please feel free to take another look. |
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @VibhuJawa, just a couple of nits and things I think you missed the first time around.
I have tested the most recent PR (using 10 data files with 12 clusters). The result is consistent with our original result. Thanks, Vibhu! This is the command: python semdedup_example.py --input-data-dir /ads_ds3/data/SemDeDup_BenchMark/datasets/c4/realnewslike/modified --config-file configs_cf.yml The content of configs_cf.yml:
|
Thanks so much for this @faywang123 . Appreciate all the help. @ayushdg , Can we use @faywang123 test above and put in our testing. |
Signed-off-by: Vibhu Jawa <[email protected]>
daf4b67
to
52480aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits then we're set.
After a final review and test, the PR looks good to me. Thanks, @VibhuJawa for all the hard work! |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@ryantwolf , Addressed the nits, let me know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incredible work, so excited to have this be a part of NeMo Curator
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking but a couple of other suggestions:
- Adding an optional SemDedup import to the top level
modules/__init__.py
file for gpu only environments. Allowing users to do something likefrom nemo_curator import SemDedup
- Adding semantic deduplication in the list of features both in the
README.md
as well as a page indocs/user-guide
Signed-off-by: Vibhu Jawa <[email protected]>
Done .
Added readme. |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Description
This PR builds on top #118 and adds the following features on top of it:
sort_clusters.py
andsemdedup.py
Checklist