Python: Add more tests for the Avro writer #8067

Fokko · 2023-07-14T10:54:31Z

Based on the feedback of Ryan on #7873! Thanks, this actually also uncovered some bugs in the read path.

python/tests/avro/test_encoder.py

python/pyiceberg/avro/encoder.py

rdblue · 2023-07-31T17:46:36Z

python/pyiceberg/avro/reader.py

 class DateReader(Reader):
-    def read(self, decoder: BinaryDecoder) -> int:
-        return decoder.read_int()
+    def read(self, decoder: BinaryDecoder) -> date:


I don't think this should produce a date. Our internal representation for date/time objects are:

int days from epoch for date

int (long) micros from epoch for timestamp and timestamptz

int micros from midnight for time

We do not want to use native date/time representations internally because they require a lot of extra logic to compare and work with in other ways.

Yes, I reverted this, and also fixed this for the datetime, that would actually break. I've also added some integration tests to cover this.

rdblue · 2023-07-31T17:47:08Z

python/pyiceberg/avro/writer.py

 class DateWriter(Writer):
    def write(self, encoder: BinaryEncoder, val: Any) -> None:
-        encoder.write_date_int(val)
+        encoder.write_int(date_to_days(val))


Here as well, the encoders and decoders should not use datetime objects.

This is correct, right? The encoders only work with physical types. The write_date_int accepted a date, and now we do the conversion in the writer itself.

It's fine to have a writer that work with date, but right now we need one that works with int because that's our internal representation. If someone passes a filter that is d < DATE '2023-08-01' for example, we'll receive that as something like this: LessThan(Ref("d"), IntLiteral(1234)). We want to compare integers rather than dates because that's faster and more reliable. When we read/write Avro, we want to go to that internal representation.

python/pyiceberg/avro/writer.py

python/pyiceberg/utils/datetime.py

rdblue · 2023-07-31T20:15:05Z

python/pyiceberg/utils/schema_conversion.py

 LOGICAL_FIELD_TYPE_MAPPING: Dict[Tuple[str, str], PrimitiveType] = {
    ("date", "int"): DateType(),
-    ("time-millis", "int"): TimeType(),
+    ("time-micros", "long"): TimeType(),


We also want to be able to read time-millis right? We have readers for it in Java where we just multiple by 1_000 when we get the value.

This isn't part of the Iceberg spec, I'd rather leave it out unless you have strong concerns here.

I'm fine either way. Eventually, we'll need to read files that were written for Hive and imported into an Iceberg table, so we are generally more permissive with reads. Since this is mostly about Iceberg metadata right now, we don't need to worry about it.

python/pyiceberg/utils/schema_conversion.py

python/tests/avro/test_encoder.py

…into fd-add-testssss

python/dev/provision.py

rdblue · 2023-08-02T19:15:17Z

python/pyiceberg/avro/reader.py

-        return decoder.read_decimal_from_bytes(self.precision, self.scale)
+        data = decoder.read(decoder.read_int())
+        unscaled_datum = int.from_bytes(data, byteorder="big", signed=True)
+        return unscaled_to_decimal(unscaled_datum, self.scale)


Why move this here? The encoder still has decimal logic. I don't mind either way whether decimal is handled by the encoder/decoder or in the reader/writer, but it seems like we should be consistent.

Also, this reader assumes that the decimal is stored as variable-length binary and not fixed (because it is reading the length). The spec requires that decimals are written in Avro as fixed, length minBytesRequired(P): https://iceberg.apache.org/spec/#avro

Why move this here? The encoder still has decimal logic. I don't mind either way whether decimal is handled by the encoder/decoder or in the reader/writer, but it seems like we should be consistent.

I've moved it to the decimal.py, and created a function bytes_to_decimal, since we already have decimal_to_bytes.

Also, this reader assumes that the decimal is stored as variable-length binary and not fixed (because it is reading the length). The spec requires that decimals are written in Avro as fixed, length minBytesRequired(P): https://iceberg.apache.org/spec/#avro

😱 Ah, this was always wrong, thanks for catching this. Let me add a test there

rdblue · 2023-08-06T21:17:06Z

python/pyiceberg/avro/reader.py

-        data = decoder.read(decoder.read_int())
-        unscaled_datum = int.from_bytes(data, byteorder="big", signed=True)
-        return unscaled_to_decimal(unscaled_datum, self.scale)
+        data = decoder.read(decimal_required_bytes(self.precision))


decimal_required_bytes(self.precision) should be done in the constructor and reused, right? Or does a frozen dataclass make that hard?

The frozen class makes it a bit awkward indeed, but did it anyway

rdblue · 2023-08-06T21:17:55Z

python/pyiceberg/avro/writer.py


    def write(self, encoder: BinaryEncoder, val: Any) -> None:
-        return encoder.write_decimal_bytes(val)
+        return encoder.write(decimal_to_bytes(val, byte_length=decimal_required_bytes(self.precision)))


Like the comment above, we don't need to calculate the size every time since this is in a tight loop.

rdblue · 2023-08-06T21:35:05Z

python/pyiceberg/utils/decimal.py

+
+
+@lru_cache
+def decimal_required_bytes(precision: int) -> int:


I think I prefer the way Java does this because it would be a couple of list comprehensions rather than computing the max precision for each byte length for each call.

MAX_PRECISION = tuple( math.floor(math.log10(math.fabs(math.pow(2, 8 * l - 1) - 1))) for l in range(24) ) REQUIRED_LENGTH = tuple( next(l for l in range(24) if p <= MAX_PRECISION[l]) for p in range(40) ) def required_bytes(precision: int) -> int: if precision <= 0 or precision >= len(REQUIRED_LENGTH): raise ValueError(...) return REQUIRED_LENGTH[precision]

It is cached, and likely that you would call it with the same argument many time, therefore the @lru_cache

I also like the tuples very much, I'll go with those

rdblue · 2023-08-06T21:40:04Z

python/tests/utils/test_decimal.py

+    assert decimal_required_bytes(precision=8) == 4
+    assert decimal_required_bytes(precision=10) == 5
+    assert decimal_required_bytes(precision=32) == 14
+    assert decimal_required_bytes(precision=40) == 17


We should technically only go to 38 because that's the max precision for 16 bytes. I'm not sure why the Java code went to 40. That's one more entry than needed in the REQUIRED_BYTES array.

rdblue

Overall looks good. Using @lru_cache vs storing the length on the reader/writer is minor, as is the way the required length is calculated.

Fokko · 2023-08-07T19:28:00Z

Thanks again @rdblue for the comprehensive review, appreciate it! 🙏🏻

rustyconover · 2023-08-11T02:32:51Z

Hi Friends,

Tonight I'm trying to rebase my Avro reader branch and this PR really wasn't easy to follow.

The PR change description didn't really describe what was decided to be changed without reading all of the comments from the review.

In the future can you please split things like this up with clearer logic written in the change message especially decisions around architecture (such as don't use native date/datetime types). It took me a while to understand why tests were removed.

I'd like to see the commits/PRs say things like:

"Removing use of native datetime types in the Avro reader path, removing relevant tests."

Talking about fixing bugs in the read path just didn't give the justifications why the changes were made.

Fokko · 2023-08-11T06:23:04Z

Sorry @rustyconover for the confusion here. The PR grew a bit bigger than originally intended, but I can see how it can be confusing. The decoder/encoded should only handle primitive types, and therefore I removed them, but it shouldn't have happened under an ambiguous PR like this one.

Python: Add more tests for the Avro writer

8cc9578

github-actions bot added the python label Jul 14, 2023

Fokko added 2 commits July 14, 2023 14:00

Fix the tests

2c56348

WIP

8a8727c

Fokko commented Jul 19, 2023

View reviewed changes

python/tests/avro/test_encoder.py Outdated Show resolved Hide resolved

Merge branch 'master' of github.com:apache/iceberg into fd-add-testssss

6cfaa43

Fokko force-pushed the fd-add-testssss branch from 8ca0894 to 6cfaa43 Compare July 31, 2023 13:50

Update

19aa56d