Fix murmur3_32 with UTF-8 encoding for input with non-BMP character #5649

findepi · 2021-07-16T17:32:27Z

findepi · 2021-07-16T17:36:45Z

~~Admittedly, this as is not the most optimal approach, i will try to figure out how to fix the code without bailing out.~~
~~However, even with current state i think it's an improvement ("correctness first").~~

(obsolete)

findepi · 2021-07-16T17:53:49Z

@lowasser i realized the proper fix is a one-liner.

@lowasser @cpovirk can you PTAL?

findepi · 2021-07-22T17:47:47Z

@lowasser seems like you authored portions of the code being changed here.
would you like to drop your two cents on the PR?

eamonnmcmanus · 2021-07-22T20:54:32Z

I think we have a concern that fixing this bug may produce problems for systems that were persisting the old wrong hashcode, for example using it as part of a key in persistent storage. We recognize that the current computed hashcode is wrong, and that it is inconsistent with the hashcode you get if you hash the same string in a different but supposedly equivalent way. We haven't yet figured out the best way forward.

findepi · 2021-07-23T07:53:37Z

@eamonnmcmanus thank you for your reply.

I think we have a concern that fixing this bug may produce problems for systems that were persisting the old wrong hashcode, for example using it as part of a key in persistent storage.

i sympathize with concern. Actually this is this context where we were able to identify the bug.
Guava-based code was not producing same hash value as non-Guava-based code.

We recognize that the current computed hashcode is wrong, and that it is inconsistent with the hashcode you get if you hash the same string in a different but supposedly equivalent way.

Thanks for making it clear.

Note that it's not only about Guava APIs that should return same results (but they do not).
It's also about interoperability between applications that use Guava for calculating murmur3 and those that do not.
For example, as in apache/iceberg#2836 (comment), it would be hard to implement 'compatible' hashing for Python.

We haven't yet figured out the best way forward.

I don't think not fixing a bug is a long term option for anyone, would you agree?
What are the options you're considering?

I think we can consider

just fix the bug
make the hashString throw for surrogate pairs, to force applications to realize there is a problem; and then, after few releases, fix the bug
keep bogus implementation next to Hashing#murmur3_32() (eg as Hashing#murmur3_32_bogus)

can you think of any other options?

eamonnmcmanus · 2021-07-27T15:36:45Z

guava-tests/test/com/google/common/hash/Murmur3Hash32Test.java

+    assertStringHash(0x8a5c3699, "surrogate pair: \uD83D\uDCB0", Charsets.UTF_8);
+
+    assertStringHash(0, "", Charsets.UTF_16);
+    assertStringHash(0xae9d4799, "k", Charsets.UTF_16);


There's a subtle problem with this. (At least, it took me a while to figure it out.) A byte array from the UTF_16 encoding starts with a two-byte Byte Order Mark (BOM) which is either fe ff or ff fe and indicates the endianness that the remaining bytes will use. It's not specified which endianness a given Java platform will use, so it's not correct to hash a string using the UTF_16 encoding if you want the result to be portable. I discovered this because some non-public tests were using the opposite endianness and failing. I rewrote these test cases to use UTF_16LE and updated the hashcodes. (You don't need to do that here but if we go ahead with this change then the modified test will be in the version that we use.)

@eamonnmcmanus indeed, this isn't portable. Thanks for catching this.
UTF-16LE is as good, as the only point of this test coverage is to exercise the code path that was not optimized for UTF-8.

If there is any value in changing anything in this PR, i am happy to apply changes.

I saw a reference to this conversation, and I dug up Android issue 37074504 (fix), last seen as "https://code.google.com/p/android/issues/detail?id=196848" in ByteStreamsTest. I think what Éamonn is seeing is a bug in the very old version of Android that we test with.

UTF-16LE is still the right pragmatic fix -- and probably the one I should have employed in ByteStreamsTest!

Done, changed to UTF-16LE (and updated the expected hashes accordingly).

This explicitly verifies consistency between `hashString` and `hashBytes` for known inputs, what is additional to what `testStringInputsUtf8` does.

murmur3_32 has a special handling for UTF-8 encoding, so testing one other encoding. UTF-16LE variant is chosen, so that the result does not depend on endianness of the system.

findepi · 2021-08-04T19:52:37Z

Per #5649 (comment) i changed UTF-16 to UTFLE.

I also rebased on current master to resolve conflicts and let the CI run.
#5654 was merged which added a couple more test cases.

findepi · 2021-08-08T18:35:14Z

@eamonnmcmanus @cpovirk is there anything i can do to help move this forward?

eamonnmcmanus · 2021-08-09T16:08:28Z

@eamonnmcmanus @cpovirk is there anything i can do to help move this forward?

Deciding what to do here is on our queue of things to do. The fix in this PR will form the basis of whatever we decide on, but we don't yet know what that will be. In answer to the specific question, I think you've given us everything we need. Thanks!

eamonnmcmanus · 2021-09-02T18:23:19Z

I updated #5648 with our current plans.

findepi · 2021-09-05T18:02:47Z

@eamonnmcmanus thanks for the update.

i still think it would be good to merge preparatory commits from this PR ("Test ..."). What do you think?

do you want me to change this PR to introducemurmur3_32_fixed as per #5648 (comment)?

eamonnmcmanus · 2021-09-06T17:44:53Z

Thanks @findepi for finding this bug and preparing such a thorough fix! That fix will be the basis for the change we make, but it is more straightforward for us if we make the change in Google's internal repo and then push it out. We'll be sure to credit you appropriately. I expect the change to land within the next week.

@findepi

The bug was found by @findepi who also contributed the fix and the new tests via #5649. RELNOTES=n/a PiperOrigin-RevId: 386953108

@findepi

The bug was found by @findepi who also contributed the fix and the new tests via #5649. RELNOTES=`hash`: Deprecated buggy `murmur3_32`, and introduced `murmur3_32_fixed`. PiperOrigin-RevId: 386953108

@findepi

The bug was found by @findepi who also contributed the fix and the new tests via #5649. Fixes #5648. RELNOTES=`hash`: Deprecated buggy `murmur3_32`, and introduced `murmur3_32_fixed`. PiperOrigin-RevId: 386953108

@findepi

The bug was found by @findepi who also contributed the fix and the new tests via #5649. Fixes #5648. RELNOTES=`hash`: Deprecated buggy `murmur3_32`, and introduced `murmur3_32_fixed`. PiperOrigin-RevId: 395463974

google-cla bot added the cla: yes label Jul 16, 2021

findepi mentioned this pull request Jul 16, 2021

Incorrect hash result from murmur3_32 with String input containing surrogate pairs #5648

Closed

findepi force-pushed the findepi/murmur32bmp branch from 0d95453 to 72da93a Compare July 16, 2021 17:51

findepi force-pushed the findepi/murmur32bmp branch 2 times, most recently from 00dd9db to ee27fda Compare July 20, 2021 09:37

findepi mentioned this pull request Jul 21, 2021

[Python] support BucketByteBuffer and BucketUUID apache/iceberg#2836

Merged

nick-someone added P3 no SLO package=hash type=defect Bug, not working as expected labels Jul 26, 2021

eamonnmcmanus self-assigned this Jul 26, 2021

eamonnmcmanus reviewed Jul 27, 2021

View reviewed changes

findepi added 4 commits August 4, 2021 21:47

Test murmur3_32 hashBytes for known inputs as well

fe5eb09

This explicitly verifies consistency between `hashString` and `hashBytes` for known inputs, what is additional to what `testStringInputsUtf8` does.

Test murmur3_32 putString and putBytes for known inputs as well

1c48d1f

Test murmur3_32 with UTF_16 as well

e85ac63

murmur3_32 has a special handling for UTF-8 encoding, so testing one other encoding. UTF-16LE variant is chosen, so that the result does not depend on endianness of the system.

Fix murmur3_32 with UTF-8 encoding for input with non-BMP character

6ac0909

findepi force-pushed the findepi/murmur32bmp branch from ee27fda to 6ac0909 Compare August 4, 2021 19:51

eamonnmcmanus closed this Sep 6, 2021

findepi deleted the findepi/murmur32bmp branch September 6, 2021 20:06

copybara-service bot mentioned this pull request Sep 7, 2021

Deprecate buggy murmur3_32 and introduce murmur3_32_fixed. #5657

Closed

copybara-service bot pushed a commit that referenced this pull request Sep 7, 2021

Fix a bug in hashing strings with non-BMP chars using Murmur3_32.

15d7e82

The bug was found by @findepi who also contributed the fix and the new tests via #5649. RELNOTES=n/a PiperOrigin-RevId: 386953108

Fix murmur3_32 with UTF-8 encoding for input with non-BMP character #5649

Fix murmur3_32 with UTF-8 encoding for input with non-BMP character #5649

Uh oh!

Conversation

findepi commented Jul 16, 2021

Uh oh!

findepi commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findepi commented Jul 16, 2021

Uh oh!

findepi commented Jul 22, 2021

Uh oh!

eamonnmcmanus commented Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findepi commented Jul 23, 2021

Uh oh!

eamonnmcmanus Jul 27, 2021

Choose a reason for hiding this comment

Uh oh!

findepi Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

cpovirk Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

findepi Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

findepi commented Aug 4, 2021

Uh oh!

findepi commented Aug 8, 2021

Uh oh!

eamonnmcmanus commented Aug 9, 2021

Uh oh!

eamonnmcmanus commented Sep 2, 2021

Uh oh!

findepi commented Sep 5, 2021

Uh oh!

eamonnmcmanus commented Sep 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

findepi commented Jul 16, 2021 •

edited

Loading

eamonnmcmanus commented Jul 22, 2021 •

edited

Loading