Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does there have any example to remove duplicate docs using MinHash? #188

Open
zyh3826 opened this issue Jun 17, 2022 · 4 comments
Open

Comments

@zyh3826
Copy link

zyh3826 commented Jun 17, 2022

Does there have any example to remove duplicate docs using MinHash?

@ekzhu
Copy link
Owner

ekzhu commented Jun 20, 2022

Not yet. But maybe you can create one and add it as a pull request :)

I would start from creating MinHash of normalized (tokenized, lowercased, truncated, etc.) documents. Once you have N MinHash for N documents, you have two choices:

  1. Use brute force to compute the Jaccard similarity of all pairs of MinHash to find documents that have very high similarity (e.g., >0.95 Jaccard)
  2. Use MinHashLSH index. Insert all the MinHash into the index, and then query each MinHash to find highly similar candidates (except itself), compute their Jaccard similarities, and (exactly, or use MinHash), and find ones with very high similarity.

The second option is faster, the first option is more accurate.

@rupeshkumaar
Copy link
Contributor

rupeshkumaar commented Oct 27, 2023

Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.

@ekzhu
Copy link
Owner

ekzhu commented Dec 1, 2023

Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.

Sounds good. I believe this also addresses #205. You can submit a PR and we can go from there.

@rupeshkumaar
Copy link
Contributor

rupeshkumaar commented Mar 12, 2024

I am planning to work on this project in my free time. So, a few questions

  1. Do we need to add it as a class method attached to the MinHash class, since it will be a util kind of method (good to have) so, could we keep it separate?
  2. The other one was, which method do you think? should we go for since Merging (Identically Specified) MinHashLSH objects #205 is implemented so we can choose either of the methods? Since it is going to be a tradeoff between speed and accuracy.

If you have any other suggestions, please let me know. @ekzhu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants