Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73

Open
midjji opened this issue Sep 3, 2019 · 5 comments

Comments

@midjji
Copy link

midjji commented Sep 3, 2019

I tried to test but it might take a while...

@Spencer-Doak
Copy link

Spencer-Doak commented Sep 3, 2019

Not that I can speak authoritatively on the matter, but I would hazard a guess that the output is not necessarily collision free. In 2012, some researchers scrutinized MurmurHash 3 as if it was a cryptographically strong hashing algorithm. They were able to "break" the hashing algorithm such that they could produce collisions within the output regardless of the seed used. This is noted on the MurmurHash Wikipedia page, which cites a blog post that documents the vulnerability. Of course, though, MurmurHash was never designed to be a cryptographic hashing algorithm, so one would not expect it to hold up to that level of security. I have not read over the details of the vulnerability myself to see what input sizes they were using to construct the collisions. Nonetheless, perhaps that source can help you find an answer to your question.

@midjji
Copy link
Author

midjji commented Sep 3, 2019 via email

@aappleby
Copy link
Owner

aappleby commented Sep 3, 2019

Murmur's not a crypto hash, so it won't resist intentionally trying to generate collisions.

That said, its mixing is thorough enough that in general use you should be able to use any subset of the output bits and get uniform distributions.

@midjji
Copy link
Author

midjji commented Sep 4, 2019 via email

@sebres
Copy link

sebres commented Sep 4, 2019

That is no reason why the checksum should have collisions for short
strings.

There is no reasons to hash something shorter as 128 bit using 128-bit hash, unless your subset contains also longer strings, but then you should take care about the collisions in the whole subset and not about the short strings only.

This isn't about cryptography, this is about minimizing collisions when using the hash as a checksum.

Well, it is not (really). But again, then the count of possible collisions is interesting for the whole subset and the hashing with such large hash on subset containing only short strings does not make sense at all, no matter for which purposes one would do that.
Usage of 128-bit hash as a checksum for values up to 128-bit is irrational.
Unless we'd speak about that in cryptographic sense (and as already said above mm3 is not a matter for that).

Fixing it is trivial, at least for up to 124 bits. Just put the data first, then write the number of bytes written at the end.

I don't think such "fix" is expected at all, because "stealing" (overwriting with length) of some bits would not necessary improve the algorithm in sense of distribution quantity.

But the question is rather, if you nevertheless need it:
Why you cannot do this by yourself before supply your string (buffer) to MM3 hashing?
Or either why you cannot use your length as a seed (all MM3 algorithms have parameter seed)?

void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );
void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );

Additionally note:

  • it would be then another algorithm;
  • it can produce some unexpected impact on some other strings (you'd simply relocate the collisions from one strings to another) so previously good distribution on some subsets could get worse (remember due to birthday paradox you've only 264 to catch a collision on directly distributed 128-bit hash if we speak about random data).
  • if you don't have random data (e. g. something continuously growing) you don't need a hash but rather some mixer algorithm; but again 128-bit checksum makes few sense here, either it depends more on your subset.
  • mm3-128 has basically good distribution also on short values (and the length of string is involved indirectly), for example here is a result of hashing \x00 bytes (with length N [1 .. 16], without using of length as seed):
 N |                            Bytes | MM3-128 (x86)
---|----------------------------------|---------------------------------
 1 |                               00 | 88c4adec54d201b954d201b954d201b9
 2 |                             0000 | 04a872bbedcd774bedcd774bedcd774b
 3 |                           000000 | e0d93642acf40e87acf40e87acf40e87
 4 |                         00000000 | cc066f1f9e5178409e5178409e517840
 5 |                       0000000000 | 50a68ecfd01a6609d01a6609d01a6609
 6 |                     000000000000 | 777fa95660bde92360bde92360bde923
 7 |                   00000000000000 | 0d45d85efb848988fb848988fb848988
 8 |                 0000000000000000 | e028ae414772b0844772b0844772b084
 9 |               000000000000000000 | 5ad58a7e543371085433710854337108
10 |             00000000000000000000 | 64010da262e8bc1762e8bc1762e8bc17
11 |           0000000000000000000000 | 2f35ebd169f8166569f8166569f81665
12 |         000000000000000000000000 | 332d18d156b5986456b5986456b59864
13 |       00000000000000000000000000 | 583cbe60ca53c80fca53c80fca53c80f
14 |     0000000000000000000000000000 | a8e046b5855ca909855ca909855ca909
15 |   000000000000000000000000000000 | 3553d0af909796639097966390979663
16 | 00000000000000000000000000000000 | 5a4075d66b2d3d27d3926c2feb228a07

 N |                            Bytes | MM3-128 (x64)
---|----------------------------------|---------------------------------
 1 |                               00 | 4610abe56eff5cb551622daa78f83583
 2 |                             0000 | 3044b81a706c5de818f96bcc37e8a35b
 3 |                           000000 | 79d54dd1bf7137480af5e7f1b766291d
 4 |                         00000000 | cfa0f7ddd84c76bc589623161cf526f1
 5 |                       0000000000 | 3df460ff3e17b53a17874fba56e69767
 6 |                     000000000000 | 7d480f9fa80ec469719af4070b74d89d
 7 |                   00000000000000 | f402c55ac5dec98f2de586f681711c02
 8 |                 0000000000000000 | 28df63b7cc57c3cbf2557dfcc4e8fe52
 9 |               000000000000000000 | 73269217e5476f20f1fa3fc86728ca0c
10 |             00000000000000000000 | 5b3d684f8c57ce161ba63bef94931146
11 |           0000000000000000000000 | 056e0d6c8921404673c2da0104c39955
12 |         000000000000000000000000 | a4d8ece9d7c0dfe3803bbf8eb6f0853f
13 |       00000000000000000000000000 | a10ea8b22762995abb1575409cfb7dc6
14 |     0000000000000000000000000000 | 028b7708fcbbed1e8393f0698afe46ea
15 |   000000000000000000000000000000 | 6ce113b115a56871195953c2230f8db2
16 | 00000000000000000000000000000000 | 4bbd1bf27da918d6b465a9eccd791cb6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants