Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choice of np.uint64? #212

Open
chris-ha458 opened this issue Aug 1, 2023 · 11 comments
Open

Choice of np.uint64? #212

chris-ha458 opened this issue Aug 1, 2023 · 11 comments
Labels

Comments

@chris-ha458
Copy link
Contributor

_max_hash = np.uint64((1 << 32) - 1)

in this implementation of minhash, it seems like the hasher is using 32 bits (sha1_hash32)
why is the _max_hash = np.uint64((1 << 32) - 1) using np.uint64 ?
I tried experiments with np.uint32 with the mersenne prime np.uint64((1 << 31) - 1) and it seems there arent much difference in the results.
If I understand correctly, this will automatically halve memory consumption as well.

Is there a reason to insist on np.uint64?

@ekzhu
Copy link
Owner

ekzhu commented Aug 1, 2023

If I recall correctly the reason to use np.uint64 is to avoid overflow when calculating the permutation. Currently the setting of prim and hash size is not customizable. Would you be interested in making it customizable to enable different primes and hash range size?

@ekzhu ekzhu added the question label Aug 1, 2023
@chris-ha458
Copy link
Contributor Author

Yes I am. I'll prepare a PR for that.
I think i will change the dtype to 32bits and use the 2^31 - 1 mersenne prime (hash is already 32 bits).

I've been testing with a repo related to this : text-dedup,
There I have been testing np.uint32 and it doesnt seem to have a significant downstream performance impact while reducing memory footprint greatly. In that repo or this one, I'd likely leave the current behavior default while documenting this option.

However most of my testing was done on OSCAR-2301 per language slices. would you like me to test on other datasets as well? If you have a good dataset in my and what to look for in it post dedup, I'll test it as well. preferably <10GiB datasets due to my system being limited.

Once testing on my end shows good results i can bring it to you for wider testing/merging.

To add, there's a lot of space to explore here.

  • If we decide to keep np.uint64, we can keep 64 bits from sha1 as well
  • Modulo behavior to prevent clipping while using np.uint32
  • alternative hashes (revive Use Murmur3 hash function as default, remove hashobj parameter. #73). python xxhash is very well maintained and can directly output int. For dtype u32 i'll definitely use xxhash where possible since we are throwing backwards compatibility out the window.

I had some more ideas but I think theyll come back to me when i start hacking

@ekzhu
Copy link
Owner

ekzhu commented Aug 1, 2023

This sounds like a great plan. I think we can start with a simple option to have user choose their hash function, hash value type, hash range size and prime. This will require a validation function that checks for any issues with the customized parameters. It also requires update to serialization and deserialization, and storage layer considerations.

@chris-ha458
Copy link
Contributor Author

So I'll tell you my findings as of now.

  1. it seems that the choice of np.uint64 was made to allow building a universal hash function with modulo math involving the mersenne prime 2^61 -1. Although it might be possible to conduct faster modulo math involving mersenne primes, this does not hold true for python. The only time where there is a meaningful difference is when using 2^61 -1 prime with arrays.

Gist that shows this

python math
3.974414e-02
4.002404e-02
3.886199e-02
3.884387e-02
numpy single value math
4.995418e-01
8.717644e-01
4.897993e-01
4.919927e-01
numpy array math
2.367258e-03
3.345251e-03
1.433372e-03
1.444340e-03

So, we should either change the code to take advantage of this (np.mod then np.bitwiseand) or just do away with the mersenne prime and replace it with the largest prime that fits in np.uint32 (2**32 -5). I support the latter and it seems not to harm downstream performance in dedups (again, I'm testing on another repo that closely follows this repo so take this with some salt)

  1. My investigations on hash functions tell me that atleast the way it is applied here, hashing is not the bottleneck. xxh3 is really fast and with an optimized implementation using multithreading, it could be theoretically as fast as or even faster than permutation, but that would complicate the codebase greatly and introduce a great deal of complexity. Compared to the current universal hashing method, actually using 256 different hashing methods is only of marginal benefit considering the cost. I would still like to implement xxh here since it would make this ugliness go away.
    struct.unpack('<I', hashlib.sha1(data).digest()[:4])[0] into xxhash.xxh3_64(data) & 0xFFFFFFFF

  2. I think i could copy minhash.py (into minXhash.py perhaps?) and i could implement the changes there and we think of how to pull the changes back to minhash.py. How does this sound to you?

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

Re 1. Thanks for the investigation. I think a good case for performance can be made here by using np.mod instead of Python math mod. Hopefully it doesn't change the hash values and not affecting backward compatibility with older saved MinHash.

Regarding using different primes. Is it possible to make this customizable as part of MinHash parameters?

Re 2 and 3. It is obvious that the built-in SHA hash functions are not the fastest. That's why hash functions can be user-defined. The reason for not implementing xxhash functions in this library was to minimizing dependencies -- we didn't want to have more dependencies than numpy and scipy. I think it may be better to show case different hash functions examples in the documentation. See some existing ones here: https://ekzhu.com/datasketch/minhash.html#use-different-hash-functions

How's using 256 different hash functions comparing to using 1 hash function and 256 permutations? I am surprised the former would be faster.

@chris-ha458
Copy link
Contributor Author

RE : Primes

  • Unless we are dealing with some bitwise-efficient division math (this is what dividing 32bit integers with a 64bit 2^61 mersenne hash could be fast in the wikipage is referring to) i cannot see the benefit of using anyprime other than the largest prime that the datatype can support. any smaller prime will result in a compressed range of output and inevitable collisions due to that suboptimal choice.
  • I can see the benefit of different hash choices linked with different datatypes though. In another implementation i have been working on using 16bit datatype with the largest 16bit hash resulted in 0.75 percentage point (92.4 -> 91.75 percent) degradation of accuracy for 75% memory consumption reduction (due to np.uint64 -> np.uint16).
  • In each config i default to either the legacy option or the largest prime the datatype could fit since other choices did not make sense to me.
  • That being said I see no reason that it would be difficult or impossible to implement though.

RE : Hash Functions

  • Reading your code more, i understand that decision further. I think when I'm done with some actual implementations otherwise, I could address this issue on the documentation side as you suggest
  • I seem to have failed to communicate properly. The assumption is that, using 256 different "perfect" hashing functions would be better than 256 "suboptimal" hashing functions. What I was trying to do was to replace the 256 hashfunctions we get by XOR with "real"256 hash functions.
  • Whtat I got wrong was first, np.array matmul to permute one has into 256 hashes is REAL FAST. it's almost exactly what numpy was built for.
  • Secondly, as long as the first hash is a "perfect" hash and we have a proper permutation, the 256 permutated hashes are "Perfect" as well. so there is no need to try to replace it in the first place.

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

Looking at the Gist you linked earlier. I noticed you didn't specify data tye for the numbers. Using uint32 with multiplication leads to overflow, right? Correct me if I am wrong:

biggest_prime = np.uint32((1 << 32) - 5)
a = np.uint32((1 <<32) - 6) # draw from the range [1, biggest_prime]
b = np.uint32((1 << 32) - 7) # draw from [0, biggest_prime]
hv = np.uint32((1 <<32) - 8) # draw from [0, 2^32]
(a * hv + b)
<ipython-input-9-306e01da66f1>:1: RuntimeWarning: overflow encountered in scalar multiply
  a * hv + b
<ipython-input-9-306e01da66f1>:1: RuntimeWarning: overflow encountered in scalar add
  a * hv + b

@chris-ha458
Copy link
Contributor Author

chris-ha458 commented Aug 7, 2023

You are correct.
In case of overflow with integers (Especially with unsigned ints such as , the behavior is wraparound which is basically modulo math. (many approaches, including the mersenne division trick revolves around trying to avoid the performance cost of modulo math.)
I think experiments show that it can be just as fast, or even faster than division.
I think my approach approximates original proposal of Carter and Wegman described in the universal hashing wikipage :
image

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

In case of overflow with integers (Especially with unsigned ints such as , the behavior is wraparound which is basically modulo math. (many approaches, including the mersenne division trick revolves around trying to avoid the performance cost of modulo math.)

Do you mean that the integer overflow won't lead to correctness issue?

@chris-ha458
Copy link
Contributor Author

It depends on what you mean by correctness.

Will it be backwards compatible? No.
Data types are different. Uint32 vs original uint64(even though it only contains 32bits of information.)
Will the resulting bits, when taking the 32 least significant bits, be identical. Yes
(This is what I've done in my first iteration of the PR. The equality tests pass)

However in my implementation but multiplier a and adder b
Are sampled from a new distribution randint(0,2**32-1)
So numbers are different.

If you are asking if they will also result in good quality hashes, the answer is yes* but explanation is more involved.

@chris-ha458
Copy link
Contributor Author

sorry for my last "fermat last theorem style" response
I was trying to access https://link.springer.com/chapter/10.1007/3-540-48340-3_24
but failed to do so, so I will discuss in terms of the paper that the wiki article referes to.

(ax +b) mod mersenneprime mod max_hash produces mathematically provable universal hashes.
(class H1 in the paper)
That is, if we use the above formula to produce a new hash from an older one, the new hash is also provably as good as the old one (collision resistance etc)

the addition is important, as the simplified version (ax) mod mersenneprime mod max_hash (i think this was class h2) lacking the addition BEFORE the modulo is of inferior hash quality by as much as a factor of 2. (page 7, last paragraph of the pdf)

(it is unclear if my + b afterwards repair it to the original assumption. I think it does, but i am furthering my argument on the assumption that it does not.)

That is, while the former can replicate a 32bit quality hash from a 32bit quality hash, the latter at worse case scenario can recover a 31bit quality hash from a 32bit quality hash. (This is not general but regarding input hashes of a specific form).

Each permutation is not meant to produce hashes, but serve as hash functions (turn one hash into another).
That is, each a,b pair and ax+b modulo operation is a new projection from the original hash function to another.
Basically we are using it to build new hash_functions, not building hashes themselves (although the result is the same but the underlying assumptions are not)
Since we usually choose 256~1000 from this, i feel that 2^31 (2_147_483_648 or 2billion) is "good enough".

Of course, my understanding or logic might be flawed!
Please feel free to point out any errors or concerns here.

Also I would like to reiterate that I'm not (only) advocating for my version (ax mod b).

Atleast the main point I had for opening this issue, was that since we modulo with maxhash(which makes any number np.uint32) we can use np.uint32 for all later operations. This brings in a lot of space and speed benefits with theoretically same quality as before (and atleast for my PR results the results are identical as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants