Add codepath for computing buckets without int conversion #326

ayushdg · 2024-10-25T15:22:19Z

Description

PR has 2 enhancements:

Improves performance for cases where users want to skip the fp check by skipping conversion of bucket_id's to integers, only needed by map_buckets and following steps in the fpcheck path.
Improves error messages/cases where the data contained no duplicates. Fixes [BUG] Fuzzy deduplication fails on datasets with no duplicates #67.

Usage

        lsh = LSH(
            ..., # same params as earlier
            buckets_as_int=False, # or true if planning to go via FP check.
        )

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>

praateekmahajan · 2024-11-04T19:10:21Z

nemo_curator/modules/fuzzy_dedup.py

+            import shutil
+
+            shutil.rmtree(write_path)


Not fir this PR, but just a highlight from our google docs convo, good place to leverage fsspec

Agreed. Decided to go via this route for now (since other places also use shutil). Aligned that the refactor to be more remote friendly should leverage fsspec utilities where possible.

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg · 2024-11-13T19:49:33Z

nemo_curator/modules/fuzzy_dedup.py

+
+                shutil.rmtree(write_path)
+
+        return are_buckets_empty


Variable for tracking if all the buckets were empty

ayushdg · 2024-11-13T19:49:57Z

nemo_curator/modules/fuzzy_dedup.py

+
+        return are_buckets_empty
+
+    def _write_bucket_parquet(


Reviewers ptal at this logic. I've tried to cover most edge cases

The only case I could think was if we ever have to worry about scalability here?

There is a non-zero cost to checking if the buckets are empty or not. I've tried to write check_empty_buckets in a way that it breaks on the first file of any where it finds non empty data, but this might be slow for large network based filesystems. It should however be faster than the current approach of persisting the data first and converting to int.

Once a non empty bucket is detected, that setting is persisted through the next set of iterations so the check is skipped in future iterations.

ayushdg · 2024-11-13T19:50:44Z

nemo_curator/modules/fuzzy_dedup.py

+            )
+        # Only check if buckets written so far are empty
+        if are_buckets_empty:
+            are_buckets_empty = check_empty_buckets(write_path)


The reason we need to do this in the first place is because there's no way to know if we're writing out an empty dataframe or not, unless we persist, or write it out, check the metadata and then overwrite on the next iteration.

ayushdg · 2024-11-13T19:52:39Z

nemo_curator/utils/fuzzy_dedup_utils/io_utils.py

+    ds = dataset(bucket_path, format="parquet")
+    for fragment in ds.get_fragments():
+        if fragment.metadata.num_rows > 0:
+            return False


This logic can probably be simplified by using a global metadata file when writing out the parquet dataset write_metadata_file=True. However this had some issues in 24.10: rapidsai/cudf#17177 and is only fixed in 24.12. Will open an issue to simplify this method once that's merged in.

praateekmahajan · 2024-11-14T14:24:21Z

nemo_curator/modules/fuzzy_dedup.py

+            print(
+                f"Stage{stage_num}: No potential duplicate documents found during LSH"
+            )
+            return None


Should this return None or an empty DocumentDataset with no id's

I prefer returning None. Empty DocumentDatasets might lead to unexplained errors downstream that might be tougher to debug/understand. Happy to hear counter points.
One thing that comes up from this is that I might update the examples/FuzzyDedup.py to handle the case where the result returned was None

Makes sense, but then for Sequential I think we might want to handle that behavior too?

I haven't seen Sequential being used directly with FuzzyDuplicates since the results cannot be processed downstream by any of the other modules without using to filter out the duplicates first. I'm not sure how to handle this use case. But longer term, we would probably want to add a FuzzyDeduplicate class that calls Fuzzy Duplicates and also handles removal.

Signed-off-by: Ayush Dattagupta <[email protected]>

sarahyurick · 2024-11-22T23:18:06Z

nemo_curator/modules/fuzzy_dedup.py

@@ -261,6 +262,7 @@ def __init__(
        num_hashes: int,
        num_buckets: int,
        buckets_per_shuffle: int = 1,
+        buckets_as_int: bool = False,


What do you think about calling this false_positive_check on the user facing side? I'm fine with then doing something like self.buckets_as_int = false_positive_check and referring to it as self.buckets_as_int everywhere else, but from a user perspective I think it might make it a little clearer about how to set this parameter.

I think it's a good suggestion. We can update the docstrings to indicate that it writes out data in a format required by false positive_check if set to true.

sarahyurick · 2024-11-22T23:33:30Z

nemo_curator/modules/fuzzy_dedup.py

+
+        return are_buckets_empty
+
+    def _write_bucket_parquet(


The only case I could think was if we ever have to worry about scalability here?

Add codepath for computing buckets without int conversion

ccb1e31

Signed-off-by: Ayush Dattagupta <[email protected]>

praateekmahajan reviewed Nov 4, 2024

View reviewed changes

ayushdg added 7 commits November 8, 2024 15:43

Merge branch 'main' into enh-lsh-noint

f2b1888

Signed-off-by: Ayush Dattagupta <[email protected]>

Merge branch 'main' into enh-lsh-noint

816940b

Signed-off-by: Ayush Dattagupta <[email protected]>

Refactor write logic into its own method

30f383c

Signed-off-by: Ayush Dattagupta <[email protected]>

Update cli script

d7a2617

Signed-off-by: Ayush Dattagupta <[email protected]>

Add tests

954a043

Signed-off-by: Ayush Dattagupta <[email protected]>

Update docs

3b51aad

Signed-off-by: Ayush Dattagupta <[email protected]>

Merge branch 'main' into enh-lsh-noint

d119740

ayushdg marked this pull request as ready for review November 13, 2024 18:21

ayushdg requested a review from VibhuJawa November 13, 2024 18:22

ayushdg commented Nov 13, 2024

View reviewed changes

praateekmahajan reviewed Nov 14, 2024

View reviewed changes

Update fuzzy_deduplication example

8dbc48a

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg mentioned this pull request Nov 19, 2024

Graceful handling when no LSH duplicates found. #381

Open

ayushdg requested a review from sarahyurick November 22, 2024 19:04

Merge branch 'main' of github.com:NVIDIA/NeMo-Curator into enh-lsh-noint

dccd964

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg added enhancement New feature or request gpuci Run GPU CI/CD on PR labels Nov 22, 2024

sarahyurick reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add codepath for computing buckets without int conversion #326

Add codepath for computing buckets without int conversion #326

ayushdg commented Oct 25, 2024 •

edited

Loading

praateekmahajan Nov 4, 2024

ayushdg Nov 13, 2024

ayushdg Nov 13, 2024

ayushdg Nov 13, 2024

sarahyurick Nov 22, 2024

ayushdg Nov 23, 2024

ayushdg Nov 13, 2024

ayushdg Nov 13, 2024

praateekmahajan Nov 14, 2024

ayushdg Nov 14, 2024

praateekmahajan Nov 15, 2024

ayushdg Nov 15, 2024

sarahyurick Nov 22, 2024

ayushdg Nov 23, 2024

sarahyurick Nov 22, 2024

Add codepath for computing buckets without int conversion #326

Are you sure you want to change the base?

Add codepath for computing buckets without int conversion #326

Conversation

ayushdg commented Oct 25, 2024 • edited Loading

Description

Usage

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushdg commented Oct 25, 2024 •

edited

Loading