-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Add truncate transform #5030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Found at python/cpython#85160 that singledispatchmethod significantly slower than singledispatch. As the community is working to fix it, wondering if we should just keep singledispatchmethod for simplicity. |
python/src/iceberg/utils/decimal.py
Outdated
| """ | ||
| unscaled_value = decimal_to_unscaled(value) | ||
| applied_value = unscaled_value - (((unscaled_value % width) + width) % width) | ||
| return Decimal(f"{applied_value}e{value.as_tuple().exponent}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should create a decimal from the truncated unscaled value, rather than parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to use unscaled_to_decimal(applied_value, -value.as_tuple().exponent)
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up @jun-he Some small comments below 👍🏻
python/src/iceberg/transforms.py
Outdated
| self._width = width | ||
|
|
||
| @property | ||
| def width(self): |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
python/src/iceberg/transforms.py
Outdated
| raise ValueError(f"Cannot truncate value: {value}") | ||
|
|
||
| @_truncate_value.register(int) | ||
| def _(self, value): |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko I tried it before but mypy threw an error
an python/src/iceberg/transforms.py:332: error: Argument 1 has incompatible type "Callable[[TruncateTransform[S], int], int]"; expected "Callable[..., S]"
Wondering if there is a way to solve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated it using singledispatch here and then use Any for value.
python/src/iceberg/transforms.py
Outdated
| raise ValueError(f"Cannot truncate type: {self._type} for value: {value}") | ||
|
|
||
| @_truncate_value.register(str) | ||
| def _(self, value): |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar error here that
python/src/iceberg/transforms.py:340: error: Argument 1 has incompatible type "Callable[[TruncateTransform[S], str], str]"; expected "Callable[..., S]"
python/src/iceberg/transforms.py
Outdated
| return value[0 : min(self._width, len(value))] | ||
|
|
||
| @_truncate_value.register(bytes) | ||
| def _(self, value): |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
python/src/iceberg/transforms.py
Outdated
| @_truncate_value.register(int) | ||
| def _(self, value): | ||
| """Truncate a given int value into a given width if feasible.""" | ||
| if type(self._type) in {IntegerType, LongType}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the validation in processing itself feels a bit weird to me, shouldn't we check this when initializing the transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation here is to catch the case that the caller passes an int into Non-IntegerType, e.g. StringType truncate transformer. But the initialization might not the value type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we should do this check. We can assume that the type matches because of how we bind expressions. And this is called in a tight loop, so additional checks are going to cause the library to slow down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to not validating inside of the right loop.
I’d like to see general validation of the source_type (though I guess singledispatch takes care of that somewhat, it would be nice to see it a bit more explicitly but that’s just my opinion),
Also validating in the constructor or elsewhere that the width is greater than zero.
But this function will be called in a tight loop and it’s best to avoid expensive checks here.
python/src/iceberg/transforms.py
Outdated
| raise ValueError(f"Cannot truncate type: {self._type}") | ||
|
|
||
| @_truncate_value.register(Decimal) | ||
| def _(self, value): |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
145812c to
d663659
Compare
|
Hey @jun-he [I'm using singledispatch over singledispatchmethod]: iceberg/python/src/iceberg/avro/reader.py Lines 250 to 317 in 2f550cd
For the same reason as you mentioned (performance). Using singledispatch also allows you to set the types in the function signature, which is nice for static analysis. Maybe another reason to go for singledispatch over singledispatchmethod? :) |
python/src/iceberg/transforms.py
Outdated
| return self._type | ||
|
|
||
| def apply(self, value: Optional[S]) -> Optional[S]: | ||
| return self._truncate_value(value) if value is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm just trying to popularize the Walrus operator:
| return self._truncate_value(value) if value is not None else None | |
| return truncated if (truncated := self._truncate_value(value)) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return self._truncate_value(value)? If that's None, then returning it will return None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I see why: the _truncate_value method doesn't use Optional. So it should only be called if value is not None. I think the original is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering the benefit to use Walrus operator in this case. It seems unnecessarily add a new variable truncated.
python/src/iceberg/transforms.py
Outdated
| @_truncate_value.register(str) | ||
| def _(self, value): | ||
| """Truncate a given string to a given width.""" | ||
| return value[0 : min(self._width, len(value))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears correct to me:
truncate("abc\u2603de") # => "abc\u2603d"We should make sure this or another multi-byte character test is in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the tests [at]huaxingao added somewhat recently for parquet bloom filter has Chinese characters that I believe are specifically multi-byte for this very purpose.
python/src/iceberg/transforms.py
Outdated
| def _(self, value): | ||
| """Truncate a given binary bytes into a given width.""" | ||
| if isinstance(self._type, BinaryType): | ||
| return value[0 : min(self._width, len(value))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also correct:
truncate(bytes("abc\u2603de", "utf-8")) # => b'abc\xe2\x98'| Decimal: A truncated Decimal instance | ||
| """ | ||
| unscaled_value = decimal_to_unscaled(value) | ||
| applied_value = unscaled_value - (((unscaled_value % width) + width) % width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expression doesn't need to be as complex. The purpose of this is to handle negative numbers in languages where % will return a negative value. Python returns positive values:
-4 % 5 # => 1
((-4 % 5) + 5) % 5 # => 1Note that if -4 % 5 resulted in -4 because abs(-4) < 5 then we would need the more complex expression.
| @pytest.mark.parametrize( | ||
| "type_var,value,expected_human_str,expected", | ||
| [ | ||
| (BinaryType(), b"\x00\x01\x02\x03", "AAECAw==", b"\x00"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should include the test cases I posted in comments above to validate that a bytes is truncated by the number of bytes and str is truncated by the number of unicode code points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added both tests with a slight change (removing abc prefix).
|
Thanks, @jun-he! I left a thorough review. I think this is close. The implementations look correct, but there are a few things to fix. |
3c25c13 to
9919249
Compare
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, but looks great @jun-he. Thanks!
| def _(self, _: IcebergType, value: int) -> str: | ||
| return datetime.to_human_timestamptz(value) | ||
| @singledispatch | ||
| def _human_string(value: Any, _type: IcebergType) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having two singledispatches, we could also turn everything into one where we match on the type first. This would simplify the logic a bit.
| def _human_string(value: Any, _type: IcebergType) -> str: | |
| def _human_string(_type: IcebergType, value: Any) -> str: |
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I'm going to merge this to unblock the refactor and we can clean up the rest afterwards. Thanks, @jun-he!
To split the PR #3450, open a new PR for truncate transform here.