-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choice of np.uint64? #212
Comments
If I recall correctly the reason to use np.uint64 is to avoid overflow when calculating the permutation. Currently the setting of prim and hash size is not customizable. Would you be interested in making it customizable to enable different primes and hash range size? |
Yes I am. I'll prepare a PR for that. I've been testing with a repo related to this : text-dedup, However most of my testing was done on OSCAR-2301 per language slices. would you like me to test on other datasets as well? If you have a good dataset in my and what to look for in it post dedup, I'll test it as well. preferably <10GiB datasets due to my system being limited. Once testing on my end shows good results i can bring it to you for wider testing/merging. To add, there's a lot of space to explore here.
I had some more ideas but I think theyll come back to me when i start hacking |
This sounds like a great plan. I think we can start with a simple option to have user choose their hash function, hash value type, hash range size and prime. This will require a validation function that checks for any issues with the customized parameters. It also requires update to serialization and deserialization, and storage layer considerations. |
So I'll tell you my findings as of now.
So, we should either change the code to take advantage of this (
|
Re 1. Thanks for the investigation. I think a good case for performance can be made here by using Regarding using different primes. Is it possible to make this customizable as part of MinHash parameters? Re 2 and 3. It is obvious that the built-in SHA hash functions are not the fastest. That's why hash functions can be user-defined. The reason for not implementing xxhash functions in this library was to minimizing dependencies -- we didn't want to have more dependencies than numpy and scipy. I think it may be better to show case different hash functions examples in the documentation. See some existing ones here: https://ekzhu.com/datasketch/minhash.html#use-different-hash-functions How's using 256 different hash functions comparing to using 1 hash function and 256 permutations? I am surprised the former would be faster. |
RE : Primes
RE : Hash Functions
|
Looking at the Gist you linked earlier. I noticed you didn't specify data tye for the numbers. Using uint32 with multiplication leads to overflow, right? Correct me if I am wrong: biggest_prime = np.uint32((1 << 32) - 5)
a = np.uint32((1 <<32) - 6) # draw from the range [1, biggest_prime]
b = np.uint32((1 << 32) - 7) # draw from [0, biggest_prime]
hv = np.uint32((1 <<32) - 8) # draw from [0, 2^32]
(a * hv + b)
|
You are correct. |
Do you mean that the integer overflow won't lead to correctness issue? |
It depends on what you mean by correctness. Will it be backwards compatible? No. However in my implementation but multiplier If you are asking if they will also result in good quality hashes, the answer is yes* but explanation is more involved. |
sorry for my last "fermat last theorem style" response
the addition is important, as the simplified version (it is unclear if my That is, while the former can replicate a 32bit quality hash from a 32bit quality hash, the latter at worse case scenario can recover a 31bit quality hash from a 32bit quality hash. (This is not general but regarding input hashes of a specific form). Each permutation is not meant to produce hashes, but serve as hash functions (turn one hash into another). Of course, my understanding or logic might be flawed! Also I would like to reiterate that I'm not (only) advocating for my version (ax mod b). Atleast the main point I had for opening this issue, was that since we modulo with maxhash(which makes any number np.uint32) we can use np.uint32 for all later operations. This brings in a lot of space and speed benefits with theoretically same quality as before (and atleast for my PR results the results are identical as well) |
datasketch/datasketch/minhash.py
Line 12 in ebe4ca4
in this implementation of minhash, it seems like the hasher is using 32 bits (
sha1_hash32
)why is the
_max_hash = np.uint64((1 << 32) - 1)
usingnp.uint64
?I tried experiments with
np.uint32
with the mersenne primenp.uint64((1 << 31) - 1)
and it seems there arent much difference in the results.If I understand correctly, this will automatically halve memory consumption as well.
Is there a reason to insist on
np.uint64
?The text was updated successfully, but these errors were encountered: