Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add codepath for computing buckets without int conversion #326

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented Oct 25, 2024

Description

PR has 2 enhancements:

  1. Improves performance for cases where users want to skip the fp check by skipping conversion of bucket_id's to integers, only needed by map_buckets and following steps in the fpcheck path.
  2. Improves error messages/cases where the data contained no duplicates. Fixes [BUG] Fuzzy deduplication fails on datasets with no duplicates #67.

Usage

        lsh = LSH(
            ..., # same params as earlier
            buckets_as_int=False, # or true if planning to go via FP check.
        )

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Comment on lines 391 to 393
import shutil

shutil.rmtree(write_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fir this PR, but just a highlight from our google docs convo, good place to leverage fsspec

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Decided to go via this route for now (since other places also use shutil). Aligned that the refactor to be more remote friendly should leverage fsspec utilities where possible.

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
@ayushdg ayushdg marked this pull request as ready for review November 13, 2024 18:21

shutil.rmtree(write_path)

return are_buckets_empty
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable for tracking if all the buckets were empty


return are_buckets_empty

def _write_bucket_parquet(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers ptal at this logic. I've tried to cover most edge cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case I could think was if we ever have to worry about scalability here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a non-zero cost to checking if the buckets are empty or not. I've tried to write check_empty_buckets in a way that it breaks on the first file of any where it finds non empty data, but this might be slow for large network based filesystems. It should however be faster than the current approach of persisting the data first and converting to int.

Once a non empty bucket is detected, that setting is persisted through the next set of iterations so the check is skipped in future iterations.

)
# Only check if buckets written so far are empty
if are_buckets_empty:
are_buckets_empty = check_empty_buckets(write_path)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need to do this in the first place is because there's no way to know if we're writing out an empty dataframe or not, unless we persist, or write it out, check the metadata and then overwrite on the next iteration.

Comment on lines +212 to +215
ds = dataset(bucket_path, format="parquet")
for fragment in ds.get_fragments():
if fragment.metadata.num_rows > 0:
return False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic can probably be simplified by using a global metadata file when writing out the parquet dataset write_metadata_file=True. However this had some issues in 24.10: rapidsai/cudf#17177 and is only fixed in 24.12. Will open an issue to simplify this method once that's merged in.

print(
f"Stage{stage_num}: No potential duplicate documents found during LSH"
)
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return None or an empty DocumentDataset with no id's

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer returning None. Empty DocumentDatasets might lead to unexplained errors downstream that might be tougher to debug/understand. Happy to hear counter points.
One thing that comes up from this is that I might update the examples/FuzzyDedup.py to handle the case where the result returned was None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but then for Sequential I think we might want to handle that behavior too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen Sequential being used directly with FuzzyDuplicates since the results cannot be processed downstream by any of the other modules without using to filter out the duplicates first. I'm not sure how to handle this use case. But longer term, we would probably want to add a FuzzyDeduplicate class that calls Fuzzy Duplicates and also handles removal.

Signed-off-by: Ayush Dattagupta <[email protected]>
@ayushdg ayushdg added enhancement New feature or request gpuci Run GPU CI/CD on PR labels Nov 22, 2024
@@ -261,6 +262,7 @@ def __init__(
num_hashes: int,
num_buckets: int,
buckets_per_shuffle: int = 1,
buckets_as_int: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling this false_positive_check on the user facing side? I'm fine with then doing something like self.buckets_as_int = false_positive_check and referring to it as self.buckets_as_int everywhere else, but from a user perspective I think it might make it a little clearer about how to set this parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good suggestion. We can update the docstrings to indicate that it writes out data in a format required by false positive_check if set to true.


return are_buckets_empty

def _write_bucket_parquet(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case I could think was if we ever have to worry about scalability here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fuzzy deduplication fails on datasets with no duplicates
3 participants