Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Murmur3 hash function as default, remove hashobj parameter. #73

Closed
wants to merge 4 commits into from

Conversation

ekzhu
Copy link
Owner

@ekzhu ekzhu commented Nov 20, 2018

@aastafiev @ae-foster @vmarkovtsev

I had to explain to more than a few people why MinHash by default use SHA1 hash function, or any of the secure hash functions, because these functions tend to be much slower than non-secure ones like Murmur3. The hashlib built-in library only offers secure hash functions, and the digest method outputs bytes, which is awkward, as I have to use struct.unpack to convert to integer.

# This is how we are doing the hashing right now.
import hashlib, struct
timeit struct.unpack("<I", hashlib.sha1(b"foo").digest()[:4])[0]
# 1 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# This is what I am trying to do.
import mmh3
timeit mmh3.hash("foo") & 0xffffffff
# 199 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I am looking at using mmh3 python library to replace the use of SHA1, and removing hashobj parameter all together (MinHash and HyperLogLog) to make the default fast. Do you think the performance gain is worth it? How much of your code am I going to break?

@vmarkovtsev
Copy link
Contributor

@ekzhu I wonder why murmur and not xxhash which is faster, or Google farmhash which is even faster than xx? Both have tested Python packages...

@aastafiev
Copy link
Contributor

Good idea and I agree with @vmarkovtsev. But, installing farmhash on windows is not as easy as on Linux. For example, I couldn't easy install it just from "pip install pyfarmhash". So our users will have troubles during installing too and we should support more on this. So, if this is blocking, I suggest using xxhash. Moreover it's already has "intdigest() – return the current digest as an integer".

@vmarkovtsev
Copy link
Contributor

We could have swappable hashing backends and try to load the best package which is installed. It would also help to measure the performance gains.

@ekzhu
Copy link
Owner Author

ekzhu commented Nov 20, 2018

Thanks guys. I will look more into it.

@ekzhu
Copy link
Owner Author

ekzhu commented Nov 20, 2018

Python built-in SHA1:
SHA1

Murmur3 from mmh3.
Murmur3

xxhash
xxhash

farmhash
farmhash

Take aways:

  1. The performance is mainly determined by num_perm. The relative performance difference is not that big, however. Of course I haven't tested on bigger dataset.
  2. The main difference is in accuracy, and farmhash is crazily good -- likely much more random.
  3. I think the accuracy difference is big enough to justify a code change, especially for num_perm under 50. This is going to save a lot of CPU time since a much smaller num_perm is required to achieve the same accuracy.

I think making the hash function swappable is a good idea. Need to replace hashobj with a more generic interface.

@ekzhu ekzhu closed this Dec 28, 2018
@ekzhu ekzhu deleted the mmh3 branch December 28, 2018 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants