Python: Add bucket transform #4416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

rdblue merged 3 commits into apache:master from jun-he:jun/bucketing

Apr 3, 2022

Collaborator

jun-he commented Mar 28, 2022 •

edited

Loading

Per discussion in #3450 (comment), open a new PR for bucket transform first.

github-actions bot added the python label

jun-he force-pushed the jun/bucketing branch from b276d9c to ef634b7 Compare

March 28, 2022 02:10


          Support bucket transform.

44ca7b9

jun-he force-pushed the jun/bucketing branch from ef634b7 to 44ca7b9 Compare

March 28, 2022 04:02

kbendick reviewed

View reviewed changes

Contributor

kbendick left a comment •

edited

Loading

Thanks @jun-he!

Is there any way to re-use the existing hash function from the Java code? I know that's kind of hacky. But I think it would be best to ensure that they are in sync if at all possible by enforcing it through code sharing or some kind of explicit testing of the two.

We had an issue a while back with the hash function not being entirely correct for certain unicode characters (I believe it was 3 code point characters). While that was not good, at least data was in theory all affected the same way. I worry about hash functions differing depending on language / library used.

I was able to find this on differences between the Python and Java mumur3 hash implementations and their differences (and how to make them consistent). TLDR is that they diverge due to the big-endian vs little-endian representation of the data.

Maybe this could with testing to ensure the Java and Python implementations are consistent?

https://stackoverflow.com/questions/29932956/murmur3-hash-different-result-between-python-and-java-implementation

As frameworks like Spark will soon support storage based joins for things like true bucket joins, it's of the utmost importance that the bucket hash functions return the same result in both Java and Python. 🙂


          fix the docs

731828a

jun-he force-pushed the jun/bucketing branch from f603cf4 to 731828a Compare

March 28, 2022 05:08

Collaborator Author

jun-he commented Mar 28, 2022 •

edited

Loading

@kbendick Thanks for the comment.

I knew Jython can but not sure if cython can re-use Java implementation. Or might change both Java and Python to use a native C/C++ implementation?

Currently, I tried to keep python tests same as the Java tests, which are based on the examples in the spec.

In the past, we discussed a related topic in a PR. One agreed point in the discussion is that it might not be good to start duplicating guava bugs in other languages.

jun-he requested a review from rdblue

March 28, 2022 05:31

Collaborator Author

jun-he commented Mar 28, 2022

@rdblue open a new one just for bucketing. Can you take another look? Thanks.

Contributor

kbendick commented Mar 28, 2022

Oh that’s right. I do remember that discussion @jun-he. Thanks for the reminder.

We did eventually convert to the corrected version for Java, right? I think just having a handful of pre-chosen Unicode characters and other things that should be equal across the two would be a great start then. Perhaps with an explicit link to each other so the tests are specifically for interoperability.

Especially the (admittedly rare) Unicode characters that were originally causing issues, as well as other (non-English / ASCII characters) and numbers etc.

Collaborator Author

jun-he commented Mar 28, 2022

@kbendick yep, totally agree to create an additional set of data to make sure the hash value to be the same for both Java and Python. I will create a doc (eventually should add it to the iceberg doc) and work with you together to make a list. Then add tests in both Java and Python in separate PRs.

rdblue reviewed

View reviewed changes

python/src/iceberg/transforms.py

+                      return self.apply(value)
+                  def apply(self, value):
+                      raise NotImplementedError()

Contributor

rdblue Apr 3, 2022

@samredai, do you think we should convert this to an ABC?

rdblue reviewed

View reviewed changes

python/src/iceberg/transforms.py

+                          return "null"
+                      return str(value)
+                  def dedup_name(self) -> str:

Contributor

rdblue Apr 3, 2022

We may want to make this a @property.

rdblue reviewed

View reviewed changes

python/src/iceberg/transforms.py

+                  def result_type(self, source: IcebergType) -> IcebergType:
+                      raise NotImplementedError()
+                  def preserves_order(self) -> bool:

Contributor

rdblue Apr 3, 2022

Might want this to be @property?

rdblue reviewed

View reviewed changes

python/src/iceberg/transforms.py

+                      return (self.hash(value) & IntegerType.max) % self._num_buckets
+                  def can_transform(self, source: IcebergType) -> bool:
+                      raise NotImplementedError()

Contributor

rdblue Apr 3, 2022

I think this should fall back to the parent's implementation rather than throwing NotImplementedError.

rdblue reviewed

View reviewed changes

python/src/iceberg/transforms.py

+                  def hash(self, value: Decimal) -> int:
+                      value_tuple = value.as_tuple()
+                      unscaled_value = int(("-" if value_tuple.sign else "") + "".join([str(d) for d in value_tuple.digits]))

Contributor

rdblue Apr 3, 2022

This should call iceberg.conversions.decimal_to_unscaled to get the unscaled value.

And min_num_bytes should be ((unscaled_value).bit_length() + 7) // 8. We can make that a utility function as well.

rdblue reviewed

View reviewed changes

python/src/iceberg/transforms.py

+                      value_tuple = value.as_tuple()
+                      unscaled_value = int(("-" if value_tuple.sign else "") + "".join([str(d) for d in value_tuple.digits]))
+                      number_of_bytes = int(math.ceil(unscaled_value.bit_length() / 8))
+                      value_in_bytes = unscaled_value.to_bytes(length=number_of_bytes, byteorder="big")

Contributor

rdblue Apr 3, 2022

This should also use signed=True like conversions:

unscaled_value.to_bytes(min_num_bytes, "big", signed=True)

rdblue reviewed

View reviewed changes

python/tests/test_transforms.py

+              def test_decimal_bucket(test_input, test_type, scale_factor, expected_hash, expected):
+                  getcontext().prec = 38
+                  assert transforms.bucket(test_type, 100).hash(test_input.quantize(scale_factor)) == expected_hash
+                  assert transforms.bucket(test_type, 100).apply(test_input.quantize(scale_factor)) == expected

Contributor

rdblue Apr 3, 2022

I don't think this test is needed. No decimal arithmetic is necessary because we can convert directly to the unscaled value using the conversion that @samredai added. There should be no need to change precision and quantize anything, so we don't need tests for it.

rdblue approved these changes

View reviewed changes


          Merge branch 'master' into jun/bucketing

a705a2e

rdblue changed the title ~~Support bucket transform.~~ Python: Add bucket transform

rdblue merged commit 5ba34da into apache:master

Contributor

rdblue commented Apr 3, 2022

Thanks, @jun-he! There are minor comments, but we can follow up to fix those the tests that are present match the spec and everything looks good. Let's move on to the next set of transforms!

rdblue mentioned this pull request

Python: Refactor to use common decimal and datetime util #4480

Merged

jun-he deleted the jun/bucketing branch

April 4, 2022 00:51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels