Enable Sem-dedup #130

VibhuJawa · 2024-06-27T08:36:30Z

Description

This PR builds on top #118 and adds the following features on top of it:

End to End Bash script
Improved Readme
Efficient and cleaned up compute_embeddings.py, clustering.py
Dask Accelerated sort_clusters.py and semdedup.py

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

Signed-off-by: Vibhu Jawa <[email protected]>

nemo_curator/scripts/semdedup/clustering.py

nemo_curator/utils/script_utils.py

ryantwolf

Thanks a ton @VibhuJawa, just a couple of nits and things I think you missed the first time around.

nemo_curator/modules/config.py

faywang123 · 2024-07-04T06:15:57Z

I have tested the most recent PR (using 10 data files with 12 clusters). The result is consistent with our original result. Thanks, Vibhu! This is the command:

python semdedup_example.py --input-data-dir /ads_ds3/data/SemDeDup_BenchMark/datasets/c4/realnewslike/modified --config-file configs_cf.yml

The content of configs_cf.yml:

cache_dir: "/ads_ds3/data/SemDeDup_BenchMark"
num_files: 10
id_col_name: 'id' 
id_col_type: 'int' 
input_column: 'text'
input_file_type: 'json'
embeddings_save_loc: "/ads_ds3/data/SemDeDup_BenchMark/embeddings_fbopt_c4_10_pr130"
embedding_model_name_or_path: 'facebook/opt-125m' 
embedding_batch_size: 32
embedding_max_mem_gb: 10

clustering_save_loc: "/ads_ds3/data/SemDeDup_BenchMark/results_fbopt_c4_10_pr130"
n_clusters: 12 # -- number of clusters
seed: 1234
max_iter: 100
# Kmeans can only be done with L2 using cuML.
Kmeans_with_cos_dist: False # Only False allowed

# -- which example to keep from each group of duplicates
which_to_keep: "hard"
# largest cluster size the memory is large enough to process. If the
# cluster size is larger than it, we will devide the cluster into small
# clusters and process each one separately.
largest_cluster_size_to_process: 100000
sim_metric: "cosine" # Only cosine is allowed.
eps_thresholds:  0.001 0.01
eps_to_extract: 0.001

VibhuJawa · 2024-07-04T07:07:16Z

Thanks so much for this @faywang123 . Appreciate all the help.

@ayushdg , Can we use @faywang123 test above and put in our testing.

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf

Two nits then we're set.

config/sem_dedup_config.yaml

faywang123 · 2024-07-05T19:47:05Z

After a final review and test, the PR looks good to me. Thanks, @VibhuJawa for all the hard work!

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-07-05T19:56:04Z

@ryantwolf , Addressed the nits, let me know

ryantwolf

Incredible work, so excited to have this be a part of NeMo Curator

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg

Not blocking but a couple of other suggestions:

Adding an optional SemDedup import to the top level modules/__init__.py file for gpu only environments. Allowing users to do something like from nemo_curator import SemDedup
Adding semantic deduplication in the list of features both in the README.md as well as a page in docs/user-guide

examples/semdedup_example.py

nemo_curator/modules/semantic_dedup.py

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-07-05T22:55:40Z

Not blocking but a couple of other suggestions:

Adding an optional SemDedup import to the top level modules/__init__.py file for gpu only environments. Allowing users to do something like from nemo_curator import SemDedup

Done .

Adding semantic deduplication in the list of features both in the README.md as well as a page in docs/user-guide

Added readme. docs/user-guide will be a followup.

Signed-off-by: Vibhu Jawa <[email protected]>

nemo_curator/modules/__init__.py

Signed-off-by: Vibhu Jawa <[email protected]>

* Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Shuffle CC result on group before writing out (NVIDIA#110) Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst (NVIDIA#113) Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: Vibhu Jawa <[email protected]> * embed by cluster saved Signed-off-by: Vibhu Jawa <[email protected]> * id map script Signed-off-by: Vibhu Jawa <[email protected]> * test commit Signed-off-by: Vibhu Jawa <[email protected]> * add id map script Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <[email protected]> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <[email protected]> * Remove undo changes Signed-off-by: Vibhu Jawa <[email protected]> * Remove rename changes Signed-off-by: Vibhu Jawa <[email protected]> * Fix rename Signed-off-by: Vibhu Jawa <[email protected]> * Readme formatting Signed-off-by: Vibhu Jawa <[email protected]> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * configure max memory using a cli Signed-off-by: Vibhu Jawa <[email protected]> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <[email protected]> * Embedding fixes Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * Working end to end Signed-off-by: Vibhu Jawa <[email protected]> * Minor yaml fixes Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add type hints Signed-off-by: Vibhu Jawa <[email protected]> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <[email protected]> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <[email protected]> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <[email protected]> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <[email protected]> * Clean up scripts Signed-off-by: Vibhu Jawa <[email protected]> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Some more cleanup Signed-off-by: Vibhu Jawa <[email protected]> * Add copyright Signed-off-by: Vibhu Jawa <[email protected]> * Fix README.md Signed-off-by: Vibhu Jawa <[email protected]> * Address reviews Signed-off-by: Vibhu Jawa <[email protected]> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <[email protected]> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Fix bad merge Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <[email protected]> * Add sorting to clustering Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Fix Readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <[email protected]> * Add docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <[email protected]> * Use argparse to read config Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Append Readme Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews Signed-off-by: Vibhu Jawa <[email protected]> * Change config Signed-off-by: Vibhu Jawa <[email protected]> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <[email protected]> * fix docstring Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <[email protected]> * Minor import fix Signed-off-by: Vibhu Jawa <[email protected]> * Empty Commit Signed-off-by: Vibhu Jawa <[email protected]> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <[email protected]> * Fix init Signed-off-by: Vibhu Jawa <[email protected]> * Move comment Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: jgerh <[email protected]> Signed-off-by: avinashvem <[email protected]> Co-authored-by: Andrew Schilling <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: jgerh <[email protected]> Co-authored-by: avinashvem <[email protected]>

VibhuJawa force-pushed the vjawa/dev_semdedup branch from d7c7b74 to 5d6a695 Compare June 27, 2024 08:38

aschilling-nv and others added 21 commits June 27, 2024 03:18

Shuffle CC result on group before writing out (NVIDIA#110)

f19df32

Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

Update index.rst (NVIDIA#113)

42309e6

Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

first commit

33332a8

Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

mv under modules dir

c633677

Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

first commit

d9b8545

Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

mv under modules dir

dc135c4

Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

first commit

968a3eb

Signed-off-by: Vibhu Jawa <[email protected]>

mv under modules dir

f5c51bb

Signed-off-by: Vibhu Jawa <[email protected]>

embed by cluster saved

f286678

Signed-off-by: Vibhu Jawa <[email protected]>

id map script

103c366

Signed-off-by: Vibhu Jawa <[email protected]>

test commit

451fa2d

Signed-off-by: Vibhu Jawa <[email protected]>

add id map script

dec4913

Signed-off-by: Vibhu Jawa <[email protected]>

Cleanup compute_embeddings_crossfit.py

bbbe400

Signed-off-by: Vibhu Jawa <[email protected]>

Cleanup compute_embeddings_crossfit.py

5d56cd0

Signed-off-by: Vibhu Jawa <[email protected]>

Pre-commit style fixes

9ddf558

Signed-off-by: Vibhu Jawa <[email protected]>

clustering_dask_crossfit.py

4ebab04

Signed-off-by: Vibhu Jawa <[email protected]>

Minor clean up to sort_clusters_crossfit.py

eeee758

Signed-off-by: Vibhu Jawa <[email protected]>

cleanup semdedup_crossfit

79beb61

Signed-off-by: Vibhu Jawa <[email protected]>

Remove undo changes

e11bbd5

Signed-off-by: Vibhu Jawa <[email protected]>

Remove rename changes

3179e24

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa force-pushed the vjawa/dev_semdedup branch 2 times, most recently from 0524727 to 3179e24 Compare June 27, 2024 10:27

VibhuJawa added 6 commits June 27, 2024 03:29

Fix rename

cbc9960

Signed-off-by: Vibhu Jawa <[email protected]>

Readme formatting

57469cb

Signed-off-by: Vibhu Jawa <[email protected]>

add dask to semdedup_crossfit.py

f60fc01

Signed-off-by: Vibhu Jawa <[email protected]>

README.md updates

c0e36f2

Signed-off-by: Vibhu Jawa <[email protected]>

README.md updates

61b21fd

Signed-off-by: Vibhu Jawa <[email protected]>

README.md updates

94b70f0

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf reviewed Jul 3, 2024

View reviewed changes

nemo_curator/scripts/semdedup/clustering.py Show resolved Hide resolved

nemo_curator/utils/script_utils.py Show resolved Hide resolved

ryantwolf requested changes Jul 4, 2024

View reviewed changes

faywang123 reviewed Jul 4, 2024

View reviewed changes

nemo_curator/modules/config.py Outdated Show resolved Hide resolved

faywang123 reviewed Jul 4, 2024

View reviewed changes

nemo_curator/modules/config.py Outdated Show resolved Hide resolved

faywang123 reviewed Jul 4, 2024

View reviewed changes

nemo_curator/modules/config.py Outdated Show resolved Hide resolved

faywang123 reviewed Jul 4, 2024

View reviewed changes

nemo_curator/modules/config.py Outdated Show resolved Hide resolved

Address Reviews and docstrings

52480aa

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa force-pushed the vjawa/dev_semdedup branch from daf4b67 to 52480aa Compare July 5, 2024 18:07

ryantwolf requested changes Jul 5, 2024

View reviewed changes

config/sem_dedup_config.yaml Outdated Show resolved Hide resolved

config/sem_dedup_config.yaml Outdated Show resolved Hide resolved

VibhuJawa added 2 commits July 5, 2024 12:51

Address Reviews and make eps_thresholds a list of values

16ad760

Signed-off-by: Vibhu Jawa <[email protected]>

Minor import fix

584340a

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf approved these changes Jul 5, 2024

View reviewed changes

Empty Commit

01affbb

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg approved these changes Jul 5, 2024

View reviewed changes

examples/semdedup_example.py Show resolved Hide resolved

nemo_curator/modules/semantic_dedup.py Show resolved Hide resolved

Add modules to __init__ and README.md

eaee1e5

Signed-off-by: Vibhu Jawa <[email protected]>

Fix init

1c0f706

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg approved these changes Jul 5, 2024

View reviewed changes

nemo_curator/modules/__init__.py Outdated Show resolved Hide resolved

nemo_curator/modules/__init__.py Outdated Show resolved Hide resolved

ayushdg mentioned this pull request Jul 5, 2024

Add pytests for semantic dedup #141

Open

VibhuJawa added 2 commits July 5, 2024 16:19

Move comment

12373a7

Signed-off-by: Vibhu Jawa <[email protected]>

Empty commit to restart CI (which failed due to a download issue)

da909f3

Signed-off-by: Vibhu Jawa <[email protected]>

ayushdg approved these changes Jul 5, 2024

View reviewed changes

Empty commit to restart CI (which failed due to a download issue)

c2cd97c

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa merged commit e557ee3 into NVIDIA:main Jul 5, 2024
3 checks passed

sarahyurick mentioned this pull request Sep 9, 2024

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Sem-dedup #130

Enable Sem-dedup #130

VibhuJawa commented Jun 27, 2024 •

edited

Loading

ryantwolf left a comment

faywang123 commented Jul 4, 2024 •

edited by VibhuJawa

Loading

VibhuJawa commented Jul 4, 2024

ryantwolf left a comment

faywang123 commented Jul 5, 2024

VibhuJawa commented Jul 5, 2024

ryantwolf left a comment

ayushdg left a comment

VibhuJawa commented Jul 5, 2024

Enable Sem-dedup #130

Enable Sem-dedup #130

Conversation

VibhuJawa commented Jun 27, 2024 • edited Loading

Description

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

faywang123 commented Jul 4, 2024 • edited by VibhuJawa Loading

VibhuJawa commented Jul 4, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

faywang123 commented Jul 5, 2024

VibhuJawa commented Jul 5, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

VibhuJawa commented Jul 5, 2024

VibhuJawa commented Jun 27, 2024 •

edited

Loading

faywang123 commented Jul 4, 2024 •

edited by VibhuJawa

Loading