Skip to content

Conversation

@jun-he
Copy link
Collaborator

@jun-he jun-he commented Jun 14, 2022

To split the PR #3450, open a new PR for truncate transform here.

@jun-he jun-he changed the title add truncate transform [Python] add truncate transform Jun 14, 2022
@jun-he
Copy link
Collaborator Author

jun-he commented Jun 14, 2022

Found at python/cpython#85160 that singledispatchmethod significantly slower than singledispatch. As the community is working to fix it, wondering if we should just keep singledispatchmethod for simplicity.

@jun-he jun-he requested review from kbendick and rdblue June 14, 2022 14:55
"""
unscaled_value = decimal_to_unscaled(value)
applied_value = unscaled_value - (((unscaled_value % width) + width) % width)
return Decimal(f"{applied_value}e{value.as_tuple().exponent}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should create a decimal from the truncated unscaled value, rather than parsing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to use unscaled_to_decimal(applied_value, -value.as_tuple().exponent)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @jun-he Some small comments below 👍🏻

self._width = width

@property
def width(self):

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

raise ValueError(f"Cannot truncate value: {value}")

@_truncate_value.register(int)
def _(self, value):

This comment was marked as outdated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko I tried it before but mypy threw an error

an python/src/iceberg/transforms.py:332: error: Argument 1 has incompatible type "Callable[[TruncateTransform[S], int], int]"; expected "Callable[..., S]"

Wondering if there is a way to solve it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it using singledispatch here and then use Any for value.

raise ValueError(f"Cannot truncate type: {self._type} for value: {value}")

@_truncate_value.register(str)
def _(self, value):

This comment was marked as outdated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar error here that

python/src/iceberg/transforms.py:340: error: Argument 1 has incompatible type "Callable[[TruncateTransform[S], str], str]"; expected "Callable[..., S]"

return value[0 : min(self._width, len(value))]

@_truncate_value.register(bytes)
def _(self, value):

This comment was marked as outdated.

@_truncate_value.register(int)
def _(self, value):
"""Truncate a given int value into a given width if feasible."""
if type(self._type) in {IntegerType, LongType}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the validation in processing itself feels a bit weird to me, shouldn't we check this when initializing the transform?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation here is to catch the case that the caller passes an int into Non-IntegerType, e.g. StringType truncate transformer. But the initialization might not the value type.

Copy link
Contributor

@rdblue rdblue Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should do this check. We can assume that the type matches because of how we bind expressions. And this is called in a tight loop, so additional checks are going to cause the library to slow down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to not validating inside of the right loop.

I’d like to see general validation of the source_type (though I guess singledispatch takes care of that somewhat, it would be nice to see it a bit more explicitly but that’s just my opinion),

Also validating in the constructor or elsewhere that the width is greater than zero.

But this function will be called in a tight loop and it’s best to avoid expensive checks here.

raise ValueError(f"Cannot truncate type: {self._type}")

@_truncate_value.register(Decimal)
def _(self, value):

This comment was marked as outdated.

@jun-he jun-he force-pushed the jun/add-truncate-transform branch 2 times, most recently from 145812c to d663659 Compare June 27, 2022 04:54
@jun-he jun-he requested a review from rdblue June 27, 2022 04:58
@Fokko
Copy link
Contributor

Fokko commented Jun 27, 2022

Hey @jun-he [I'm using singledispatch over singledispatchmethod]:

@singledispatch
def primitive_reader(primitive: PrimitiveType) -> Reader:
raise ValueError(f"Unknown type: {primitive}")
@primitive_reader.register(FixedType)
def _(primitive: FixedType) -> Reader:
return FixedReader(primitive.length)
@primitive_reader.register(DecimalType)
def _(primitive: DecimalType) -> Reader:
return DecimalReader(primitive.precision, primitive.scale)
@primitive_reader.register(BooleanType)
def _(_: BooleanType) -> Reader:
return BooleanReader()
@primitive_reader.register(IntegerType)
def _(_: IntegerType) -> Reader:
return IntegerReader()
@primitive_reader.register(LongType)
def _(_: LongType) -> Reader:
return LongReader()
@primitive_reader.register(FloatType)
def _(_: FloatType) -> Reader:
return FloatReader()
@primitive_reader.register(DoubleType)
def _(_: DoubleType) -> Reader:
return DoubleReader()
@primitive_reader.register(DateType)
def _(_: DateType) -> Reader:
return DateReader()
@primitive_reader.register(TimeType)
def _(_: TimeType) -> Reader:
return TimeReader()
@primitive_reader.register(TimestampType)
def _(_: TimestampType) -> Reader:
return TimestampReader()
@primitive_reader.register(TimestamptzType)
def _(_: TimestamptzType) -> Reader:
return TimestamptzReader()
@primitive_reader.register(StringType)
def _(_: StringType) -> Reader:
return StringReader()
@primitive_reader.register(BinaryType)
def _(_: StringType) -> Reader:
return BinaryReader()

For the same reason as you mentioned (performance). Using singledispatch also allows you to set the types in the function signature, which is nice for static analysis. Maybe another reason to go for singledispatch over singledispatchmethod? :)

return self._type

def apply(self, value: Optional[S]) -> Optional[S]:
return self._truncate_value(value) if value is not None else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm just trying to popularize the Walrus operator:

Suggested change
return self._truncate_value(value) if value is not None else None
return truncated if (truncated := self._truncate_value(value)) else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return self._truncate_value(value)? If that's None, then returning it will return None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see why: the _truncate_value method doesn't use Optional. So it should only be called if value is not None. I think the original is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering the benefit to use Walrus operator in this case. It seems unnecessarily add a new variable truncated.

@_truncate_value.register(str)
def _(self, value):
"""Truncate a given string to a given width."""
return value[0 : min(self._width, len(value))]
Copy link
Contributor

@rdblue rdblue Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears correct to me:

truncate("abc\u2603de") # => "abc\u2603d"

We should make sure this or another multi-byte character test is in the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tests [at]huaxingao added somewhat recently for parquet bloom filter has Chinese characters that I believe are specifically multi-byte for this very purpose.

def _(self, value):
"""Truncate a given binary bytes into a given width."""
if isinstance(self._type, BinaryType):
return value[0 : min(self._width, len(value))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also correct:

truncate(bytes("abc\u2603de", "utf-8")) # => b'abc\xe2\x98'

Decimal: A truncated Decimal instance
"""
unscaled_value = decimal_to_unscaled(value)
applied_value = unscaled_value - (((unscaled_value % width) + width) % width)
Copy link
Contributor

@rdblue rdblue Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expression doesn't need to be as complex. The purpose of this is to handle negative numbers in languages where % will return a negative value. Python returns positive values:

-4 % 5             # => 1
((-4 % 5) + 5) % 5 # => 1

Note that if -4 % 5 resulted in -4 because abs(-4) < 5 then we would need the more complex expression.

@pytest.mark.parametrize(
"type_var,value,expected_human_str,expected",
[
(BinaryType(), b"\x00\x01\x02\x03", "AAECAw==", b"\x00"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include the test cases I posted in comments above to validate that a bytes is truncated by the number of bytes and str is truncated by the number of unicode code points.

Copy link
Collaborator Author

@jun-he jun-he Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added both tests with a slight change (removing abc prefix).

@rdblue rdblue changed the title [Python] add truncate transform Python: Add truncate transform Jun 27, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2022

Thanks, @jun-he! I left a thorough review. I think this is close. The implementations look correct, but there are a few things to fix.

@jun-he jun-he force-pushed the jun/add-truncate-transform branch from 3c25c13 to 9919249 Compare June 30, 2022 06:59
@jun-he jun-he requested a review from rdblue June 30, 2022 07:03
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, but looks great @jun-he. Thanks!

def _(self, _: IcebergType, value: int) -> str:
return datetime.to_human_timestamptz(value)
@singledispatch
def _human_string(value: Any, _type: IcebergType) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having two singledispatches, we could also turn everything into one where we match on the type first. This would simplify the logic a bit.

Suggested change
def _human_string(value: Any, _type: IcebergType) -> str:
def _human_string(_type: IcebergType, value: Any) -> str:

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'm going to merge this to unblock the refactor and we can clean up the rest afterwards. Thanks, @jun-he!

@rdblue rdblue merged commit f72442f into apache:master Jun 30, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants