-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does there have any example to remove duplicate docs using MinHash? #188
Comments
Not yet. But maybe you can create one and add it as a pull request :) I would start from creating MinHash of normalized (tokenized, lowercased, truncated, etc.) documents. Once you have N MinHash for N documents, you have two choices:
The second option is faster, the first option is more accurate. |
Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one. |
Sounds good. I believe this also addresses #205. You can submit a PR and we can go from there. |
I am planning to work on this project in my free time. So, a few questions
If you have any other suggestions, please let me know. @ekzhu |
Does there have any example to remove duplicate docs using MinHash?
The text was updated successfully, but these errors were encountered: