-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Fix murmur3_32 with UTF-8 encoding for input with non-BMP character #5649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
(obsolete) |
0d95453 to
72da93a
Compare
00dd9db to
ee27fda
Compare
|
@lowasser seems like you authored portions of the code being changed here. |
|
I think we have a concern that fixing this bug may produce problems for systems that were persisting the old wrong hashcode, for example using it as part of a key in persistent storage. We recognize that the current computed hashcode is wrong, and that it is inconsistent with the hashcode you get if you hash the same string in a different but supposedly equivalent way. We haven't yet figured out the best way forward. |
|
@eamonnmcmanus thank you for your reply.
i sympathize with concern. Actually this is this context where we were able to identify the bug.
Thanks for making it clear. Note that it's not only about Guava APIs that should return same results (but they do not).
I don't think not fixing a bug is a long term option for anyone, would you agree? I think we can consider
can you think of any other options? |
| assertStringHash(0x8a5c3699, "surrogate pair: \uD83D\uDCB0", Charsets.UTF_8); | ||
|
|
||
| assertStringHash(0, "", Charsets.UTF_16); | ||
| assertStringHash(0xae9d4799, "k", Charsets.UTF_16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtle problem with this. (At least, it took me a while to figure it out.) A byte array from the UTF_16 encoding starts with a two-byte Byte Order Mark (BOM) which is either fe ff or ff fe and indicates the endianness that the remaining bytes will use. It's not specified which endianness a given Java platform will use, so it's not correct to hash a string using the UTF_16 encoding if you want the result to be portable. I discovered this because some non-public tests were using the opposite endianness and failing. I rewrote these test cases to use UTF_16LE and updated the hashcodes. (You don't need to do that here but if we go ahead with this change then the modified test will be in the version that we use.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eamonnmcmanus indeed, this isn't portable. Thanks for catching this.
UTF-16LE is as good, as the only point of this test coverage is to exercise the code path that was not optimized for UTF-8.
If there is any value in changing anything in this PR, i am happy to apply changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a reference to this conversation, and I dug up Android issue 37074504 (fix), last seen as "https://code.google.com/p/android/issues/detail?id=196848" in ByteStreamsTest. I think what Éamonn is seeing is a bug in the very old version of Android that we test with.
UTF-16LE is still the right pragmatic fix -- and probably the one I should have employed in ByteStreamsTest!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, changed to UTF-16LE (and updated the expected hashes accordingly).
This explicitly verifies consistency between `hashString` and `hashBytes` for known inputs, what is additional to what `testStringInputsUtf8` does.
murmur3_32 has a special handling for UTF-8 encoding, so testing one other encoding. UTF-16LE variant is chosen, so that the result does not depend on endianness of the system.
ee27fda to
6ac0909
Compare
|
Per #5649 (comment) i changed UTF-16 to UTFLE. I also rebased on current master to resolve conflicts and let the CI run. |
|
@eamonnmcmanus @cpovirk is there anything i can do to help move this forward? |
Deciding what to do here is on our queue of things to do. The fix in this PR will form the basis of whatever we decide on, but we don't yet know what that will be. In answer to the specific question, I think you've given us everything we need. Thanks! |
|
I updated #5648 with our current plans. |
|
@eamonnmcmanus thanks for the update. i still think it would be good to merge preparatory commits from this PR ("Test ..."). What do you think? do you want me to change this PR to introduce |
|
Thanks @findepi for finding this bug and preparing such a thorough fix! That fix will be the basis for the change we make, but it is more straightforward for us if we make the change in Google's internal repo and then push it out. We'll be sure to credit you appropriately. I expect the change to land within the next week. |
Fixes #5648