Skip to content

Conversation

@jun-he
Copy link
Collaborator

@jun-he jun-he commented Jul 18, 2021

Follow up #2689 (comment)

@jun-he
Copy link
Collaborator Author

jun-he commented Jul 18, 2021

@TGooch44 and @rymurr can you help to review it? Thanks.

@rymurr
Copy link
Contributor

rymurr commented Jul 19, 2021

Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?

@TGooch44
Copy link
Contributor

Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?

It looks like the following matches the expected output(which I guess is different than the Java output):

>>> import mmh3
>>> (mmh3.hash("💰".encode("utf-8")) & 2147483647) % 32
12

@jun-he
Copy link
Collaborator Author

jun-he commented Jul 20, 2021

@rymurr @TGooch44 they might mismatch. I will take a look at all the hashcode generated in Python versus Java. This might be a problem for many.

@rymurr
Copy link
Contributor

rymurr commented Jul 20, 2021

I guess python is doing the 'right' thing. The question for me is: will Java start doing the right thing or maintain the wrong thing for backwards compatibility. Python should, I think, be consistent w/ Java.

@TGooch44
Copy link
Contributor

TGooch44 commented Jul 20, 2021 via email

@findepi
Copy link
Member

findepi commented Jul 21, 2021

Elevating the 'wrong' behavior to be 'the standard' will be hard for non-Java languages.
Please take a look at my proposed fix in Guava google/guava#5649 for this.
In order to match Iceberg Java implementation, one would perhaps need to translate that code (without the fix) into Python.

Waiting for Java bucketing version to be fixed seems reasonable though. This avoids correctness issues -- no support for bucketing in Python, means no bugs at all.

FWIW, in Trino, bucketing did not have the bug (that's how we found #2837), so 'correctly' bucketed data can be out there too.

@rymurr
Copy link
Contributor

rymurr commented Jul 22, 2021

Elevating the 'wrong' behavior to be 'the standard' will be hard for non-Java languages.

I agree. I think the last thing we want to do is start duplicating guava bugs in other languages. It is hard and unnecessary.

Shall we wait for the guava fix to be merged and propagated to Iceberg then?

@TGooch44
Copy link
Contributor

Waiting seems like the best option here.

@findepi
Copy link
Member

findepi commented Jul 22, 2021

Shall we wait for the guava fix to be merged and propagated to Iceberg then?

Per @rdblue 's #2837 (comment) i posted a proposed fix #2849.
If it gets merged, the work here would be unblocked.

@jun-he
Copy link
Collaborator Author

jun-he commented Jul 23, 2021

Thanks @findepi for the fix.

@TGooch44 @rymurr I think python should just hash the UTF-8 bytes. I will run a quick check to see if Java and Python match in various cases.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

is there any place in tests where hash values could be asserted?

@jun-he
Copy link
Collaborator Author

jun-he commented Jul 26, 2021

@findepi I added additional tests to check the hash values.
It matched the result in Java test

@TGooch44 TGooch44 merged commit a54ba55 into apache:master Jul 28, 2021
minchowang pushed a commit to minchowang/iceberg that referenced this pull request Aug 2, 2021
* [Python] support BucketByteBuffer and BucketUUID

* Add additional unit tests for bucket hash methods.
jun-he added a commit to jun-he/incubator-iceberg that referenced this pull request Aug 9, 2021
* [Python] support BucketByteBuffer and BucketUUID

* Add additional unit tests for bucket hash methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants