Python: Avro write #7873

maxdebayser · 2023-06-21T18:27:29Z

@Fokko

This PR addresses issue #7255 adding code to:

convert from iceberg schema classes to Avro schema
serialize iceberg data classes to Avro files
write avro files

It also adds tests for the added classes and function as well as a direct validation with the fastavro library.

This commit also fixes some small bugs uncovered by the new tests

Fokko

This is looking great @maxdebayser

python/pyiceberg/avro/encoder.py

python/pyiceberg/avro/resolver.py

python/pyiceberg/avro/writer.py

python/tests/avro/test_encoder.py

python/tests/avro/test_file.py

JonasJ-ap

Thank you for your great work!

python/pyiceberg/avro/file.py

Fokko

Sorry for the long wait, 0.4.0 was taking some time. This PR LGTM, thanks @maxdebayser for working on this!

rdblue · 2023-07-13T21:33:37Z

python/pyiceberg/avro/encoder.py

+
+        It stores the number of milliseconds from midnight, 00:00:00.000
+        """
+        self.write_int(int(time_object_to_micros(dt) / 1000))


I don't think this is needed. Iceberg doesn't allow writing millisecond precision timestamps.

rdblue · 2023-07-13T21:33:48Z

python/pyiceberg/avro/encoder.py

+        """A string is encoded as a long followed by that many bytes of UTF-8 encoded character data."""
+        self.write_bytes(s.encode("utf-8"))
+
+    def write_date_int(self, d: date) -> None:


Why not simply write_date?

rdblue · 2023-07-13T21:34:10Z

python/pyiceberg/avro/encoder.py

+
+        It stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
+        """
+        self.write_int(date_to_days(d))


Is this method needed? I thought our internal representation was already int and not datetime.date.

I think we can get rid of it, and make it part of the write tree

rdblue · 2023-07-13T21:34:37Z

python/pyiceberg/avro/encoder.py

+
+    def write_bytes_fixed(self, b: bytes) -> None:
+        """Writes fixed number of bytes."""
+        self.write(struct.pack(f"{len(b)}s", b))


self.write already accepts bytes. Why does this need to use struct.pack?

rdblue · 2023-07-13T21:35:18Z

python/pyiceberg/avro/encoder.py

+            bits_to_write = packed_bits >> (8 * index)
+            self.write(bytearray([bits_to_write & 0xFF]))
+
+    def write_decimal_fixed(self, datum: decimal.Decimal, size: int) -> None:


@Fokko, if I remember correctly, we replaced these implementations with more native Python in the read path. We can probably do the same thing here for faster encoding and simpler code.

Yeah, here it is:

unscaled_datum = int.from_bytes(data, byteorder="big", signed=True) return unscaled_to_decimal(unscaled_datum, scale)

Maybe there's an encoder equivalent to int.from_bytes that we can use to simplify this.

Thanks, that's much nicer indeed.

rdblue · 2023-07-13T21:43:40Z

python/pyiceberg/utils/datetime.py


+def time_object_to_micros(t: time) -> int:
+    """Converts an datetime.time object to microseconds from midnight."""
+    return int(t.hour * 60 * 60 * 1e6 + t.minute * 60 * 1e6 + t.second * 1e6 + t.microsecond)


I think this should use the same logic as line 67 above. There are a couple of good things about that form:

It stays in int and doesn't introduce any floating point values. That avoids the cast and, more importantly, avoids any floating point math that may introduce errors.

It's easier to read and understand that it is correct.

rdblue · 2023-07-13T21:44:13Z

python/pyiceberg/avro/encoder.py

+        """
+        self.write_int(time_object_to_micros(dt))
+
+    def write_timestamp_millis_long(self, dt: datetime) -> None:


Again, no need for the millis.

rdblue · 2023-07-13T21:49:28Z

python/pyiceberg/avro/encoder.py

+        """
+        self.write_int(int(datetime_to_micros(dt) / 1000))
+
+    def write_timestamp_micros_long(self, dt: datetime) -> None:


There is no check here for datetime's zone. I think we need to validate that there is no zone. We may also need a write_timestamptz method. And can we rename this to write_timestamp?

I've removed these methods. The encoder should only accept the physical types. I've added the check to the write tree.

rdblue · 2023-07-13T21:54:46Z

python/pyiceberg/avro/encoder.py

+from pyiceberg.utils.datetime import date_to_days, datetime_to_micros, time_object_to_micros
+
+
+class BinaryEncoder:


The decoder has a UUID method that is missing here. Might be good to add it so that we have a round-trip test.

Great one, I'll add it to the fastavro roundtrip test as well.

rdblue · 2023-07-13T23:32:36Z

python/pyiceberg/utils/datetime.py

    return (((t.hour * 60 + t.minute) * 60) + t.second) * 1_000_000 + t.microsecond


+def time_object_to_micros(t: time) -> int:


Looks like we did a poor job naming the time_to_micros function because the other similar ones are date_str_to_days so this would be time_str_to_micros.

It would be nice to fix this by adding both types to time_to_micros and adding a time_str_to_micros method.

rdblue · 2023-07-14T00:02:36Z

python/pyiceberg/utils/schema_conversion.py

+        return {"type": "array", "element-id": self.last_list_field_id, "items": element_result}
+
+    def map(self, map_type: MapType, key_result: AvroType, value_result: AvroType) -> AvroType:
+        if isinstance(key_result, StringType):


@Fokko, should we just use an Iceberg map in all cases? Why fall back to an Avro map?

If I would read an Avro schema, that has a list of records, that's for me harder to read than just a native Avro map.

rdblue · 2023-07-14T00:06:20Z

python/pyiceberg/utils/schema_conversion.py

+            # Avro Maps does not support other keys than a String,
+            return {
+                "type": "map",
+                "values": value_result,


I think this needs to set key-id and value-id properties.

rdblue · 2023-07-14T00:07:24Z

python/pyiceberg/utils/schema_conversion.py

+
+    def visit_timestamptz(self, timestamptz_type: TimestamptzType) -> AvroType:
+        # Iceberg only supports micro's
+        return {"type": "long", "logicalType": "timestamp-micros"}


This needs to set adjust-to-utc to true to signal that the value is a timestamptz.

Great catch, this actually uncovered another bug. Thanks!

rdblue · 2023-07-14T00:08:43Z

python/pyiceberg/utils/schema_conversion.py

+        return "string"
+
+    def visit_uuid(self, uuid_type: UUIDType) -> AvroType:
+        return {"type": "string", "logicalType": "uuid"}


UUIDs are stored as 16-byte fixed, not strings. See https://github.com/apache/iceberg/blob/master/format/spec.md#appendix-a-format-specific-requirements

rdblue · 2023-07-14T00:11:23Z

python/tests/avro/test_encoder.py

+    assert output.getbuffer() == struct.pack("??", True, False)
+
+
+def test_write_int() -> None:


I think this should also have a suite of tests that validates round-trip serialization using encoder and decoder. That would have caught the UUID issue because the decoder is implemented correctly.

rdblue · 2023-07-14T00:12:01Z

python/tests/avro/test_encoder.py

+    _5byte_input = 2510416930
+    _6byte_input = 734929016866
+    _7byte_input = 135081528772642
+    _8byte_input = 35124861473277986


Why are there no tests for negative values or odd numbers?

Fokko and others added 4 commits June 19, 2023 13:52

Python: Avro write support

8c967ab

Add class to write Avro files and add PoC update_table api call

b0ab83a

Add tests for the avro readers and writers

cdaa19d

This commit also fixes some small bugs uncovered by the new tests

Merge remote-tracking branch 'iceberg/master' into avro_write

331379a

github-actions bot added the python label Jun 21, 2023

maxdebayser added 2 commits June 21, 2023 16:27

Appease pylint and pydocstyle

33e67a3

Appease pre-commit hooks

b797fe9

Fokko changed the title ~~Avro write~~ Python: Avro write Jun 22, 2023

Fokko mentioned this pull request Jun 22, 2023

Python: Avro write support #7491

Closed

Fokko added this to the PyIceberg 0.5.0 release milestone Jun 22, 2023

Fokko reviewed Jun 27, 2023

View reviewed changes

maxdebayser added 3 commits June 27, 2023 13:41

Merge remote-tracking branch 'iceberg/master' into avro_write

785ffc2

Address PR review comments

3a5efc5

Appease pre-commit hooks

bcd8406

JonasJ-ap reviewed Jul 3, 2023

View reviewed changes

python/pyiceberg/avro/file.py Outdated Show resolved Hide resolved

maxdebayser added 2 commits July 4, 2023 09:59

Merge remote-tracking branch 'iceberg/master' into avro_write

011d3c0

Add additional metadata to avro output file

fd3a7d0

Fokko approved these changes Jul 6, 2023

View reviewed changes

Fokko merged commit 46f57cc into apache:master Jul 6, 2023

Fokko mentioned this pull request Jul 6, 2023

Python: Avro write support #7255

Closed

JonasJ-ap mentioned this pull request Jul 8, 2023

Python: ManifestWriter and ManifestListWriter #8012

Closed

rdblue reviewed Jul 13, 2023

View reviewed changes

rdblue reviewed Jul 14, 2023

View reviewed changes

Fokko mentioned this pull request Jul 14, 2023

Python: Add more tests for the Avro writer #8067

Merged

HonahX mentioned this pull request Aug 3, 2023

Python: Restore Inferring Iceberg UUIDType from parquet files #8215

Merged

HonahX mentioned this pull request Sep 23, 2023

Python: ManifestWriter and ManifestListWriter #8622

Merged

		from pyiceberg.utils.datetime import date_to_days, datetime_to_micros, time_object_to_micros


		class BinaryEncoder:

		return (((t.hour * 60 + t.minute) * 60) + t.second) * 1_000_000 + t.microsecond


		def time_object_to_micros(t: time) -> int:

		assert output.getbuffer() == struct.pack("??", True, False)


		def test_write_int() -> None:

Python: Avro write #7873

Python: Avro write #7873

Uh oh!

Conversation

maxdebayser commented Jun 21, 2023

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JonasJ-ap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdblue Jul 14, 2023 •

edited

Loading