Skip to content

Conversation

@findepi
Copy link
Contributor

@findepi findepi commented Jul 16, 2021

Fixes #5648

@google-cla google-cla bot added the cla: yes label Jul 16, 2021
@findepi
Copy link
Contributor Author

findepi commented Jul 16, 2021

Admittedly, this as is not the most optimal approach, i will try to figure out how to fix the code without bailing out.
However, even with current state i think it's an improvement ("correctness first").

(obsolete)

@findepi
Copy link
Contributor Author

findepi commented Jul 16, 2021

@lowasser i realized the proper fix is a one-liner.

@lowasser @cpovirk can you PTAL?

@findepi
Copy link
Contributor Author

findepi commented Jul 22, 2021

@lowasser seems like you authored portions of the code being changed here.
would you like to drop your two cents on the PR?

@eamonnmcmanus
Copy link
Member

eamonnmcmanus commented Jul 22, 2021

I think we have a concern that fixing this bug may produce problems for systems that were persisting the old wrong hashcode, for example using it as part of a key in persistent storage. We recognize that the current computed hashcode is wrong, and that it is inconsistent with the hashcode you get if you hash the same string in a different but supposedly equivalent way. We haven't yet figured out the best way forward.

@findepi
Copy link
Contributor Author

findepi commented Jul 23, 2021

@eamonnmcmanus thank you for your reply.

I think we have a concern that fixing this bug may produce problems for systems that were persisting the old wrong hashcode, for example using it as part of a key in persistent storage.

i sympathize with concern. Actually this is this context where we were able to identify the bug.
Guava-based code was not producing same hash value as non-Guava-based code.

We recognize that the current computed hashcode is wrong, and that it is inconsistent with the hashcode you get if you hash the same string in a different but supposedly equivalent way.

Thanks for making it clear.

Note that it's not only about Guava APIs that should return same results (but they do not).
It's also about interoperability between applications that use Guava for calculating murmur3 and those that do not.
For example, as in apache/iceberg#2836 (comment), it would be hard to implement 'compatible' hashing for Python.

We haven't yet figured out the best way forward.

I don't think not fixing a bug is a long term option for anyone, would you agree?
What are the options you're considering?

I think we can consider

  • just fix the bug
  • make the hashString throw for surrogate pairs, to force applications to realize there is a problem; and then, after few releases, fix the bug
  • keep bogus implementation next to Hashing#murmur3_32() (eg as Hashing#murmur3_32_bogus)

can you think of any other options?

@nick-someone nick-someone added P3 no SLO package=hash type=defect Bug, not working as expected labels Jul 26, 2021
@eamonnmcmanus eamonnmcmanus self-assigned this Jul 26, 2021
assertStringHash(0x8a5c3699, "surrogate pair: \uD83D\uDCB0", Charsets.UTF_8);

assertStringHash(0, "", Charsets.UTF_16);
assertStringHash(0xae9d4799, "k", Charsets.UTF_16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle problem with this. (At least, it took me a while to figure it out.) A byte array from the UTF_16 encoding starts with a two-byte Byte Order Mark (BOM) which is either fe ff or ff fe and indicates the endianness that the remaining bytes will use. It's not specified which endianness a given Java platform will use, so it's not correct to hash a string using the UTF_16 encoding if you want the result to be portable. I discovered this because some non-public tests were using the opposite endianness and failing. I rewrote these test cases to use UTF_16LE and updated the hashcodes. (You don't need to do that here but if we go ahead with this change then the modified test will be in the version that we use.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eamonnmcmanus indeed, this isn't portable. Thanks for catching this.
UTF-16LE is as good, as the only point of this test coverage is to exercise the code path that was not optimized for UTF-8.

If there is any value in changing anything in this PR, i am happy to apply changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a reference to this conversation, and I dug up Android issue 37074504 (fix), last seen as "https://code.google.com/p/android/issues/detail?id=196848" in ByteStreamsTest. I think what Éamonn is seeing is a bug in the very old version of Android that we test with.

UTF-16LE is still the right pragmatic fix -- and probably the one I should have employed in ByteStreamsTest!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed to UTF-16LE (and updated the expected hashes accordingly).

findepi added 4 commits August 4, 2021 21:47
This explicitly verifies consistency between `hashString` and
`hashBytes` for known inputs, what is additional to what
`testStringInputsUtf8` does.
murmur3_32 has a special handling for UTF-8 encoding, so testing one
other encoding.

UTF-16LE variant is chosen, so that the result does not depend on
endianness of the system.
@findepi findepi force-pushed the findepi/murmur32bmp branch from ee27fda to 6ac0909 Compare August 4, 2021 19:51
@findepi
Copy link
Contributor Author

findepi commented Aug 4, 2021

Per #5649 (comment) i changed UTF-16 to UTFLE.

I also rebased on current master to resolve conflicts and let the CI run.
#5654 was merged which added a couple more test cases.

@findepi
Copy link
Contributor Author

findepi commented Aug 8, 2021

@eamonnmcmanus @cpovirk is there anything i can do to help move this forward?

@eamonnmcmanus
Copy link
Member

@eamonnmcmanus @cpovirk is there anything i can do to help move this forward?

Deciding what to do here is on our queue of things to do. The fix in this PR will form the basis of whatever we decide on, but we don't yet know what that will be. In answer to the specific question, I think you've given us everything we need. Thanks!

@eamonnmcmanus
Copy link
Member

I updated #5648 with our current plans.

@findepi
Copy link
Contributor Author

findepi commented Sep 5, 2021

@eamonnmcmanus thanks for the update.

i still think it would be good to merge preparatory commits from this PR ("Test ..."). What do you think?

do you want me to change this PR to introducemurmur3_32_fixed as per #5648 (comment)?

@eamonnmcmanus
Copy link
Member

Thanks @findepi for finding this bug and preparing such a thorough fix! That fix will be the basis for the change we make, but it is more straightforward for us if we make the change in Google's internal repo and then push it out. We'll be sure to credit you appropriately. I expect the change to land within the next week.

copybara-service bot pushed a commit that referenced this pull request Sep 7, 2021
The bug was found by @findepi who also contributed the fix and the new tests via #5649.

RELNOTES=n/a
PiperOrigin-RevId: 386953108
copybara-service bot pushed a commit that referenced this pull request Sep 7, 2021
The bug was found by @findepi who also contributed the fix and the new tests via #5649.

RELNOTES=`hash`: Deprecated buggy `murmur3_32`, and introduced `murmur3_32_fixed`.
PiperOrigin-RevId: 386953108
copybara-service bot pushed a commit that referenced this pull request Sep 8, 2021
The bug was found by @findepi who also contributed the fix and the new tests via #5649.

Fixes #5648.

RELNOTES=`hash`: Deprecated buggy `murmur3_32`, and introduced `murmur3_32_fixed`.
PiperOrigin-RevId: 386953108
copybara-service bot pushed a commit that referenced this pull request Sep 8, 2021
The bug was found by @findepi who also contributed the fix and the new tests via #5649.

Fixes #5648.

RELNOTES=`hash`: Deprecated buggy `murmur3_32`, and introduced `murmur3_32_fixed`.
PiperOrigin-RevId: 395463974
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes P3 no SLO package=hash type=defect Bug, not working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect hash result from murmur3_32 with String input containing surrogate pairs

4 participants