Skip to content

Conversation

@jun-he
Copy link
Collaborator

@jun-he jun-he commented Mar 28, 2022

Per discussion in #3450 (comment), open a new PR for bucket transform first.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jun-he!

Is there any way to re-use the existing hash function from the Java code? I know that's kind of hacky. But I think it would be best to ensure that they are in sync if at all possible by enforcing it through code sharing or some kind of explicit testing of the two.

We had an issue a while back with the hash function not being entirely correct for certain unicode characters (I believe it was 3 code point characters). While that was not good, at least data was in theory all affected the same way. I worry about hash functions differing depending on language / library used.

I was able to find this on differences between the Python and Java mumur3 hash implementations and their differences (and how to make them consistent). TLDR is that they diverge due to the big-endian vs little-endian representation of the data.

Maybe this could with testing to ensure the Java and Python implementations are consistent?

https://stackoverflow.com/questions/29932956/murmur3-hash-different-result-between-python-and-java-implementation

As frameworks like Spark will soon support storage based joins for things like true bucket joins, it's of the utmost importance that the bucket hash functions return the same result in both Java and Python. 🙂

@jun-he
Copy link
Collaborator Author

jun-he commented Mar 28, 2022

@kbendick Thanks for the comment.

I knew Jython can but not sure if cython can re-use Java implementation. Or might change both Java and Python to use a native C/C++ implementation?

Currently, I tried to keep python tests same as the Java tests, which are based on the examples in the spec.

In the past, we discussed a related topic in a PR. One agreed point in the discussion is that it might not be good to start duplicating guava bugs in other languages.

@jun-he jun-he requested a review from rdblue March 28, 2022 05:31
@jun-he
Copy link
Collaborator Author

jun-he commented Mar 28, 2022

@rdblue open a new one just for bucketing. Can you take another look? Thanks.

@kbendick
Copy link
Contributor

Oh that’s right. I do remember that discussion @jun-he. Thanks for the reminder.

We did eventually convert to the corrected version for Java, right? I think just having a handful of pre-chosen Unicode characters and other things that should be equal across the two would be a great start then. Perhaps with an explicit link to each other so the tests are specifically for interoperability.

Especially the (admittedly rare) Unicode characters that were originally causing issues, as well as other (non-English / ASCII characters) and numbers etc.

@jun-he
Copy link
Collaborator Author

jun-he commented Mar 28, 2022

@kbendick yep, totally agree to create an additional set of data to make sure the hash value to be the same for both Java and Python. I will create a doc (eventually should add it to the iceberg doc) and work with you together to make a list. Then add tests in both Java and Python in separate PRs.

return self.apply(value)

def apply(self, value):
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samredai, do you think we should convert this to an ABC?

return "null"
return str(value)

def dedup_name(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make this a @property.

def result_type(self, source: IcebergType) -> IcebergType:
raise NotImplementedError()

def preserves_order(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want this to be @property?

return (self.hash(value) & IntegerType.max) % self._num_buckets

def can_transform(self, source: IcebergType) -> bool:
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should fall back to the parent's implementation rather than throwing NotImplementedError.


def hash(self, value: Decimal) -> int:
value_tuple = value.as_tuple()
unscaled_value = int(("-" if value_tuple.sign else "") + "".join([str(d) for d in value_tuple.digits]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should call iceberg.conversions.decimal_to_unscaled to get the unscaled value.

And min_num_bytes should be ((unscaled_value).bit_length() + 7) // 8. We can make that a utility function as well.

value_tuple = value.as_tuple()
unscaled_value = int(("-" if value_tuple.sign else "") + "".join([str(d) for d in value_tuple.digits]))
number_of_bytes = int(math.ceil(unscaled_value.bit_length() / 8))
value_in_bytes = unscaled_value.to_bytes(length=number_of_bytes, byteorder="big")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also use signed=True like conversions:

unscaled_value.to_bytes(min_num_bytes, "big", signed=True)

def test_decimal_bucket(test_input, test_type, scale_factor, expected_hash, expected):
getcontext().prec = 38
assert transforms.bucket(test_type, 100).hash(test_input.quantize(scale_factor)) == expected_hash
assert transforms.bucket(test_type, 100).apply(test_input.quantize(scale_factor)) == expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is needed. No decimal arithmetic is necessary because we can convert directly to the unscaled value using the conversion that @samredai added. There should be no need to change precision and quantize anything, so we don't need tests for it.

@rdblue rdblue changed the title Support bucket transform. Python: Add bucket transform Apr 3, 2022
@rdblue rdblue merged commit 5ba34da into apache:master Apr 3, 2022
@rdblue
Copy link
Contributor

rdblue commented Apr 3, 2022

Thanks, @jun-he! There are minor comments, but we can follow up to fix those the tests that are present match the spec and everything looks good. Let's move on to the next set of transforms!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants