[Python] support BucketByteBuffer and BucketUUID #2836

jun-he · 2021-07-18T00:50:54Z

jun-he · 2021-07-18T00:51:53Z

@TGooch44 and @rymurr can you help to review it? Thanks.

rymurr · 2021-07-19T14:13:43Z

Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?

TGooch44 · 2021-07-19T18:52:11Z

Thanks @jun-he ! This looks good. However I am worried about #2837 and if python and Java might return different results. What do you think?

It looks like the following matches the expected output(which I guess is different than the Java output):

>>> import mmh3
>>> (mmh3.hash("💰".encode("utf-8")) & 2147483647) % 32
12

jun-he · 2021-07-20T06:13:49Z

@rymurr @TGooch44 they might mismatch. I will take a look at all the hashcode generated in Python versus Java. This might be a problem for many.

rymurr · 2021-07-20T09:06:03Z

I guess python is doing the 'right' thing. The question for me is: will Java start doing the right thing or maintain the wrong thing for backwards compatibility. Python should, I think, be consistent w/ Java.

TGooch44 · 2021-07-20T14:27:26Z

@rymurr Unfortunately, I think you're right that matching Java is more important than being correct according to spec. We may want to wait on this until the java community decides if they are going to fix this or not. Seems like they're trying to decide how likely existing users are to have been impacted.

…

On Tue, Jul 20, 2021, 2:06 AM Ryan Murray ***@***.***> wrote: I guess python is doing the 'right' thing. The question for me is: will Java start doing the right thing or maintain the wrong thing for backwards compatibility. Python should, I think, be consistent w/ Java. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2836 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMETHZDOQENBDO5ZRB3JCLTYU4ANANCNFSM5ARSWF7A> .

findepi · 2021-07-21T11:54:39Z

Elevating the 'wrong' behavior to be 'the standard' will be hard for non-Java languages.
Please take a look at my proposed fix in Guava google/guava#5649 for this.
In order to match Iceberg Java implementation, one would perhaps need to translate that code (without the fix) into Python.

Waiting for Java bucketing version to be fixed seems reasonable though. This avoids correctness issues -- no support for bucketing in Python, means no bugs at all.

FWIW, in Trino, bucketing did not have the bug (that's how we found #2837), so 'correctly' bucketed data can be out there too.

rymurr · 2021-07-22T09:21:12Z

Elevating the 'wrong' behavior to be 'the standard' will be hard for non-Java languages.

I agree. I think the last thing we want to do is start duplicating guava bugs in other languages. It is hard and unnecessary.

Shall we wait for the guava fix to be merged and propagated to Iceberg then?

TGooch44 · 2021-07-22T11:25:22Z

Waiting seems like the best option here.

findepi · 2021-07-22T14:59:33Z

Shall we wait for the guava fix to be merged and propagated to Iceberg then?

Per @rdblue 's #2837 (comment) i posted a proposed fix #2849.
If it gets merged, the work here would be unblocked.

jun-he · 2021-07-23T01:49:46Z

Thanks @findepi for the fix.

@TGooch44 @rymurr I think python should just hash the UTF-8 bytes. I will run a quick check to see if Java and Python match in various cases.

findepi

code lgtm

is there any place in tests where hash values could be asserted?

jun-he · 2021-07-26T02:46:50Z

@findepi I added additional tests to check the hash values.
It matched the result in Java test

* [Python] support BucketByteBuffer and BucketUUID * Add additional unit tests for bucket hash methods.

[Python] support BucketByteBuffer and BucketUUID

3dd78c7

github-actions bot added the python label Jul 18, 2021

findepi mentioned this pull request Jul 23, 2021

Fix murmur3_32 with UTF-8 encoding for input with non-BMP character google/guava#5649

Closed

findepi reviewed Jul 25, 2021

View reviewed changes

Add additional unit tests for bucket hash methods.

cecad17

TGooch44 merged commit a54ba55 into apache:master Jul 28, 2021

minchowang pushed a commit to minchowang/iceberg that referenced this pull request Aug 2, 2021

[Python] support BucketByteBuffer and BucketUUID (apache#2836)

2d5f833

* [Python] support BucketByteBuffer and BucketUUID * Add additional unit tests for bucket hash methods.

jun-he added a commit to jun-he/incubator-iceberg that referenced this pull request Aug 9, 2021

[Python] support BucketByteBuffer and BucketUUID (apache#2836)

0a9c1a0

* [Python] support BucketByteBuffer and BucketUUID * Add additional unit tests for bucket hash methods.

jun-he mentioned this pull request Mar 28, 2022

Python: Add bucket transform #4416

Merged

[Python] support BucketByteBuffer and BucketUUID #2836

[Python] support BucketByteBuffer and BucketUUID #2836

Uh oh!

Conversation

jun-he commented Jul 18, 2021

Uh oh!

jun-he commented Jul 18, 2021

Uh oh!

rymurr commented Jul 19, 2021

Uh oh!

TGooch44 commented Jul 19, 2021

Uh oh!

jun-he commented Jul 20, 2021

Uh oh!

rymurr commented Jul 20, 2021

Uh oh!

TGooch44 commented Jul 20, 2021 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findepi commented Jul 21, 2021

Uh oh!

rymurr commented Jul 22, 2021

Uh oh!

TGooch44 commented Jul 22, 2021

Uh oh!

findepi commented Jul 22, 2021

Uh oh!

jun-he commented Jul 23, 2021

Uh oh!

findepi left a comment

Choose a reason for hiding this comment

Uh oh!

jun-he commented Jul 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TGooch44 commented Jul 20, 2021 via email •

edited

Loading