Skip to content

Conversation

@maxdebayser
Copy link
Contributor

@Fokko

This PR addresses issue #7255 adding code to:

  • convert from iceberg schema classes to Avro schema
  • serialize iceberg data classes to Avro files
  • write avro files

It also adds tests for the added classes and function as well as a direct validation with the fastavro library.

@Fokko Fokko changed the title Avro write Python: Avro write Jun 22, 2023
@Fokko Fokko added this to the PyIceberg 0.5.0 release milestone Jun 22, 2023
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @maxdebayser

Copy link
Contributor

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your great work!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long wait, 0.4.0 was taking some time. This PR LGTM, thanks @maxdebayser for working on this!

@Fokko Fokko merged commit 46f57cc into apache:master Jul 6, 2023
@Fokko Fokko mentioned this pull request Jul 6, 2023
It stores the number of milliseconds from midnight, 00:00:00.000
"""
self.write_int(int(time_object_to_micros(dt) / 1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. Iceberg doesn't allow writing millisecond precision timestamps.

"""A string is encoded as a long followed by that many bytes of UTF-8 encoded character data."""
self.write_bytes(s.encode("utf-8"))

def write_date_int(self, d: date) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply write_date?

It stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
"""
self.write_int(date_to_days(d))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method needed? I thought our internal representation was already int and not datetime.date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of it, and make it part of the write tree


def write_bytes_fixed(self, b: bytes) -> None:
"""Writes fixed number of bytes."""
self.write(struct.pack(f"{len(b)}s", b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.write already accepts bytes. Why does this need to use struct.pack?

bits_to_write = packed_bits >> (8 * index)
self.write(bytearray([bits_to_write & 0xFF]))

def write_decimal_fixed(self, datum: decimal.Decimal, size: int) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, if I remember correctly, we replaced these implementations with more native Python in the read path. We can probably do the same thing here for faster encoding and simpler code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, here it is:

        unscaled_datum = int.from_bytes(data, byteorder="big", signed=True)
        return unscaled_to_decimal(unscaled_datum, scale)

Maybe there's an encoder equivalent to int.from_bytes that we can use to simplify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's much nicer indeed.


def time_object_to_micros(t: time) -> int:
"""Converts an datetime.time object to microseconds from midnight."""
return int(t.hour * 60 * 60 * 1e6 + t.minute * 60 * 1e6 + t.second * 1e6 + t.microsecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use the same logic as line 67 above. There are a couple of good things about that form:

  1. It stays in int and doesn't introduce any floating point values. That avoids the cast and, more importantly, avoids any floating point math that may introduce errors.
  2. It's easier to read and understand that it is correct.

"""
self.write_int(time_object_to_micros(dt))

def write_timestamp_millis_long(self, dt: datetime) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no need for the millis.

"""
self.write_int(int(datetime_to_micros(dt) / 1000))

def write_timestamp_micros_long(self, dt: datetime) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no check here for datetime's zone. I think we need to validate that there is no zone. We may also need a write_timestamptz method. And can we rename this to write_timestamp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed these methods. The encoder should only accept the physical types. I've added the check to the write tree.

from pyiceberg.utils.datetime import date_to_days, datetime_to_micros, time_object_to_micros


class BinaryEncoder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoder has a UUID method that is missing here. Might be good to add it so that we have a round-trip test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great one, I'll add it to the fastavro roundtrip test as well.

return (((t.hour * 60 + t.minute) * 60) + t.second) * 1_000_000 + t.microsecond


def time_object_to_micros(t: time) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we did a poor job naming the time_to_micros function because the other similar ones are date_str_to_days so this would be time_str_to_micros.

It would be nice to fix this by adding both types to time_to_micros and adding a time_str_to_micros method.

return {"type": "array", "element-id": self.last_list_field_id, "items": element_result}

def map(self, map_type: MapType, key_result: AvroType, value_result: AvroType) -> AvroType:
if isinstance(key_result, StringType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, should we just use an Iceberg map in all cases? Why fall back to an Avro map?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I would read an Avro schema, that has a list of records, that's for me harder to read than just a native Avro map.

# Avro Maps does not support other keys than a String,
return {
"type": "map",
"values": value_result,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to set key-id and value-id properties.


def visit_timestamptz(self, timestamptz_type: TimestamptzType) -> AvroType:
# Iceberg only supports micro's
return {"type": "long", "logicalType": "timestamp-micros"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to set adjust-to-utc to true to signal that the value is a timestamptz.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, this actually uncovered another bug. Thanks!

return "string"

def visit_uuid(self, uuid_type: UUIDType) -> AvroType:
return {"type": "string", "logicalType": "uuid"}
Copy link
Contributor

@rdblue rdblue Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert output.getbuffer() == struct.pack("??", True, False)


def test_write_int() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also have a suite of tests that validates round-trip serialization using encoder and decoder. That would have caught the UUID issue because the decoder is implemented correctly.

_5byte_input = 2510416930
_6byte_input = 734929016866
_7byte_input = 135081528772642
_8byte_input = 35124861473277986
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there no tests for negative values or odd numbers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants