Python: Add truncate transform #5030

jun-he · 2022-06-14T07:05:55Z

To split the PR #3450, open a new PR for truncate transform here.

jun-he · 2022-06-14T14:55:36Z

Found at python/cpython#85160 that singledispatchmethod significantly slower than singledispatch. As the community is working to fix it, wondering if we should just keep singledispatchmethod for simplicity.

rdblue · 2022-06-14T15:01:27Z

python/src/iceberg/utils/decimal.py

+    """
+    unscaled_value = decimal_to_unscaled(value)
+    applied_value = unscaled_value - (((unscaled_value % width) + width) % width)
+    return Decimal(f"{applied_value}e{value.as_tuple().exponent}")


This should create a decimal from the truncated unscaled value, rather than parsing.

changed it to use unscaled_to_decimal(applied_value, -value.as_tuple().exponent)

Fokko

Thanks for picking this up @jun-he Some small comments below 👍🏻

jun-he · 2022-06-27T02:45:58Z

python/src/iceberg/transforms.py

+        self._width = width
+
+    @property
+    def width(self):


jun-he · 2022-06-27T04:01:54Z

python/src/iceberg/transforms.py

+        raise ValueError(f"Cannot truncate value: {value}")
+
+    @_truncate_value.register(int)
+    def _(self, value):


@Fokko I tried it before but mypy threw an error

an python/src/iceberg/transforms.py:332: error: Argument 1 has incompatible type "Callable[[TruncateTransform[S], int], int]"; expected "Callable[..., S]"

Wondering if there is a way to solve it?

updated it using singledispatch here and then use Any for value.

jun-he · 2022-06-27T04:02:46Z

python/src/iceberg/transforms.py

+            raise ValueError(f"Cannot truncate type: {self._type} for value: {value}")
+
+    @_truncate_value.register(str)
+    def _(self, value):


Similar error here that

python/src/iceberg/transforms.py:340: error: Argument 1 has incompatible type "Callable[[TruncateTransform[S], str], str]"; expected "Callable[..., S]"

python/src/iceberg/transforms.py

+        return value[0 : min(self._width, len(value))]
+
+    @_truncate_value.register(bytes)
+    def _(self, value):


Fokko · 2022-06-17T08:09:47Z

python/src/iceberg/transforms.py

+    @_truncate_value.register(int)
+    def _(self, value):
+        """Truncate a given int value into a given width if feasible."""
+        if type(self._type) in {IntegerType, LongType}:


Having the validation in processing itself feels a bit weird to me, shouldn't we check this when initializing the transform?

The validation here is to catch the case that the caller passes an int into Non-IntegerType, e.g. StringType truncate transformer. But the initialization might not the value type.

I don't think that we should do this check. We can assume that the type matches because of how we bind expressions. And this is called in a tight loop, so additional checks are going to cause the library to slow down.

+1 to not validating inside of the right loop.

I’d like to see general validation of the source_type (though I guess singledispatch takes care of that somewhat, it would be nice to see it a bit more explicitly but that’s just my opinion),

Also validating in the constructor or elsewhere that the width is greater than zero.

But this function will be called in a tight loop and it’s best to avoid expensive checks here.

python/src/iceberg/transforms.py

+            raise ValueError(f"Cannot truncate type: {self._type}")
+
+    @_truncate_value.register(Decimal)
+    def _(self, value):


Fokko · 2022-06-27T20:42:58Z

Hey @jun-he [I'm using singledispatch over singledispatchmethod]:

iceberg/python/src/iceberg/avro/reader.py

Lines 250 to 317 in 2f550cd

    
           @singledispatch 
        
           def primitive_reader(primitive: PrimitiveType) -> Reader: 
        
               raise ValueError(f"Unknown type: {primitive}") 
        
           @primitive_reader.register(FixedType) 
        
           def _(primitive: FixedType) -> Reader: 
        
               return FixedReader(primitive.length) 
        
           @primitive_reader.register(DecimalType) 
        
           def _(primitive: DecimalType) -> Reader: 
        
               return DecimalReader(primitive.precision, primitive.scale) 
        
           @primitive_reader.register(BooleanType) 
        
           def _(_: BooleanType) -> Reader: 
        
               return BooleanReader() 
        
           @primitive_reader.register(IntegerType) 
        
           def _(_: IntegerType) -> Reader: 
        
               return IntegerReader() 
        
           @primitive_reader.register(LongType) 
        
           def _(_: LongType) -> Reader: 
        
               return LongReader() 
        
           @primitive_reader.register(FloatType) 
        
           def _(_: FloatType) -> Reader: 
        
               return FloatReader() 
        
           @primitive_reader.register(DoubleType) 
        
           def _(_: DoubleType) -> Reader: 
        
               return DoubleReader() 
        
           @primitive_reader.register(DateType) 
        
           def _(_: DateType) -> Reader: 
        
               return DateReader() 
        
           @primitive_reader.register(TimeType) 
        
           def _(_: TimeType) -> Reader: 
        
               return TimeReader() 
        
           @primitive_reader.register(TimestampType) 
        
           def _(_: TimestampType) -> Reader: 
        
               return TimestampReader() 
        
           @primitive_reader.register(TimestamptzType) 
        
           def _(_: TimestamptzType) -> Reader: 
        
               return TimestamptzReader() 
        
           @primitive_reader.register(StringType) 
        
           def _(_: StringType) -> Reader: 
        
               return StringReader() 
        
           @primitive_reader.register(BinaryType) 
        
           def _(_: StringType) -> Reader: 
        
               return BinaryReader()

For the same reason as you mentioned (performance). Using singledispatch also allows you to set the types in the function signature, which is nice for static analysis. Maybe another reason to go for singledispatch over singledispatchmethod? :)

Fokko · 2022-06-27T20:48:33Z

python/src/iceberg/transforms.py

+        return self._type
+
+    def apply(self, value: Optional[S]) -> Optional[S]:
+        return self._truncate_value(value) if value is not None else None


Nit: I'm just trying to popularize the Walrus operator:

Suggested change

return self._truncate_value(value) if value is not None else None

return truncated if (truncated := self._truncate_value(value)) else None

Why not just return self._truncate_value(value)? If that's None, then returning it will return None.

Actually, I see why: the _truncate_value method doesn't use Optional. So it should only be called if value is not None. I think the original is correct.

Wondering the benefit to use Walrus operator in this case. It seems unnecessarily add a new variable truncated.

python/src/iceberg/transforms.py

rdblue · 2022-06-27T23:22:35Z

python/src/iceberg/transforms.py

+    @_truncate_value.register(str)
+    def _(self, value):
+        """Truncate a given string to a given width."""
+        return value[0 : min(self._width, len(value))]


This appears correct to me:

truncate("abc\u2603de") # => "abc\u2603d"

We should make sure this or another multi-byte character test is in the tests.

Some of the tests [at]huaxingao added somewhat recently for parquet bloom filter has Chinese characters that I believe are specifically multi-byte for this very purpose.

rdblue · 2022-06-27T23:24:51Z

python/src/iceberg/transforms.py

+    def _(self, value):
+        """Truncate a given binary bytes into a given width."""
+        if isinstance(self._type, BinaryType):
+            return value[0 : min(self._width, len(value))]


This is also correct:

truncate(bytes("abc\u2603de", "utf-8")) # => b'abc\xe2\x98'

rdblue · 2022-06-27T23:29:32Z

python/src/iceberg/utils/decimal.py

+        Decimal: A truncated Decimal instance
+    """
+    unscaled_value = decimal_to_unscaled(value)
+    applied_value = unscaled_value - (((unscaled_value % width) + width) % width)


This expression doesn't need to be as complex. The purpose of this is to handle negative numbers in languages where % will return a negative value. Python returns positive values:

-4 % 5 # => 1 ((-4 % 5) + 5) % 5 # => 1

Note that if -4 % 5 resulted in -4 because abs(-4) < 5 then we would need the more complex expression.

rdblue · 2022-06-27T23:39:36Z

python/tests/test_transforms.py

+@pytest.mark.parametrize(
+    "type_var,value,expected_human_str,expected",
+    [
+        (BinaryType(), b"\x00\x01\x02\x03", "AAECAw==", b"\x00"),


This should include the test cases I posted in comments above to validate that a bytes is truncated by the number of bytes and str is truncated by the number of unicode code points.

I added both tests with a slight change (removing abc prefix).

rdblue · 2022-06-27T23:40:21Z

Thanks, @jun-he! I left a thorough review. I think this is close. The implementations look correct, but there are a few things to fix.

Fokko

One minor comment, but looks great @jun-he. Thanks!

Fokko · 2022-06-30T07:33:42Z

python/src/iceberg/transforms.py

-    def _(self, _: IcebergType, value: int) -> str:
-        return datetime.to_human_timestamptz(value)
+@singledispatch
+def _human_string(value: Any, _type: IcebergType) -> str:


Instead of having two singledispatches, we could also turn everything into one where we match on the type first. This would simplify the logic a bit.

Suggested change

def _human_string(value: Any, _type: IcebergType) -> str:

def _human_string(_type: IcebergType, value: Any) -> str:

rdblue

Looks good! I'm going to merge this to unblock the refactor and we can clean up the rest afterwards. Thanks, @jun-he!

jun-he changed the title ~~add truncate transform~~ [Python] add truncate transform Jun 14, 2022

github-actions bot added the python label Jun 14, 2022

jun-he requested review from kbendick and rdblue June 14, 2022 14:55

rdblue reviewed Jun 14, 2022

View reviewed changes

Fokko reviewed Jun 17, 2022

View reviewed changes

jun-he force-pushed the jun/add-truncate-transform branch 2 times, most recently from 145812c to d663659 Compare June 27, 2022 04:54

jun-he requested a review from rdblue June 27, 2022 04:58

Fokko reviewed Jun 27, 2022

View reviewed changes

python/src/iceberg/transforms.py Show resolved Hide resolved

rdblue reviewed Jun 27, 2022

View reviewed changes

rdblue changed the title ~~[Python] add truncate transform~~ Python: Add truncate transform Jun 27, 2022

jun-he added 3 commits June 29, 2022 23:33

add truncate transform

c8eed22

address the comments

0e53c22

address the comments

9919249

jun-he force-pushed the jun/add-truncate-transform branch from 3c25c13 to 9919249 Compare June 30, 2022 06:59

jun-he requested a review from rdblue June 30, 2022 07:03

Fokko approved these changes Jun 30, 2022

View reviewed changes

rdblue approved these changes Jun 30, 2022

View reviewed changes

rdblue merged commit f72442f into apache:master Jun 30, 2022

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Python: Add truncate transform (apache#5030)

9c47065

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Python: Add truncate transform (apache#5030)

17ed760

	return self._truncate_value(value) if value is not None else None
	return truncated if (truncated := self._truncate_value(value)) else None

	def _human_string(value: Any, _type: IcebergType) -> str:
	def _human_string(_type: IcebergType, value: Any) -> str:

Python: Add truncate transform #5030

Python: Add truncate transform #5030

Uh oh!

Conversation

jun-he commented Jun 14, 2022

Uh oh!

jun-he commented Jun 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Fokko commented Jun 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Jun 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jun-he Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 27, 2022

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

rdblue Jun 27, 2022 •

edited

Loading

rdblue Jun 27, 2022 •

edited

Loading

rdblue Jun 27, 2022 •

edited

Loading

jun-he Jun 30, 2022 •

edited

Loading