Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jul 14, 2023

Based on the feedback of Ryan on #7873! Thanks, this actually also uncovered some bugs in the read path.

cc @maxdebayser

@Fokko Fokko force-pushed the fd-add-testssss branch from 8ca0894 to 6cfaa43 Compare July 31, 2023 13:50
class DateReader(Reader):
def read(self, decoder: BinaryDecoder) -> int:
return decoder.read_int()
def read(self, decoder: BinaryDecoder) -> date:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should produce a date. Our internal representation for date/time objects are:

  • int days from epoch for date
  • int (long) micros from epoch for timestamp and timestamptz
  • int micros from midnight for time

We do not want to use native date/time representations internally because they require a lot of extra logic to compare and work with in other ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I reverted this, and also fixed this for the datetime, that would actually break. I've also added some integration tests to cover this.

class DateWriter(Writer):
def write(self, encoder: BinaryEncoder, val: Any) -> None:
encoder.write_date_int(val)
encoder.write_int(date_to_days(val))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, the encoders and decoders should not use datetime objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, right? The encoders only work with physical types. The write_date_int accepted a date, and now we do the conversion in the writer itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to have a writer that work with date, but right now we need one that works with int because that's our internal representation. If someone passes a filter that is d < DATE '2023-08-01' for example, we'll receive that as something like this: LessThan(Ref("d"), IntLiteral(1234)). We want to compare integers rather than dates because that's faster and more reliable. When we read/write Avro, we want to go to that internal representation.

LOGICAL_FIELD_TYPE_MAPPING: Dict[Tuple[str, str], PrimitiveType] = {
("date", "int"): DateType(),
("time-millis", "int"): TimeType(),
("time-micros", "long"): TimeType(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to be able to read time-millis right? We have readers for it in Java where we just multiple by 1_000 when we get the value.

Copy link
Contributor Author

@Fokko Fokko Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't part of the Iceberg spec, I'd rather leave it out unless you have strong concerns here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. Eventually, we'll need to read files that were written for Hive and imported into an Iceberg table, so we are generally more permissive with reads. Since this is mostly about Iceberg metadata right now, we don't need to worry about it.

return decoder.read_decimal_from_bytes(self.precision, self.scale)
data = decoder.read(decoder.read_int())
unscaled_datum = int.from_bytes(data, byteorder="big", signed=True)
return unscaled_to_decimal(unscaled_datum, self.scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this here? The encoder still has decimal logic. I don't mind either way whether decimal is handled by the encoder/decoder or in the reader/writer, but it seems like we should be consistent.

Also, this reader assumes that the decimal is stored as variable-length binary and not fixed (because it is reading the length). The spec requires that decimals are written in Avro as fixed, length minBytesRequired(P): https://iceberg.apache.org/spec/#avro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this here? The encoder still has decimal logic. I don't mind either way whether decimal is handled by the encoder/decoder or in the reader/writer, but it seems like we should be consistent.

I've moved it to the decimal.py, and created a function bytes_to_decimal, since we already have decimal_to_bytes.

Also, this reader assumes that the decimal is stored as variable-length binary and not fixed (because it is reading the length). The spec requires that decimals are written in Avro as fixed, length minBytesRequired(P): https://iceberg.apache.org/spec/#avro

😱 Ah, this was always wrong, thanks for catching this. Let me add a test there

data = decoder.read(decoder.read_int())
unscaled_datum = int.from_bytes(data, byteorder="big", signed=True)
return unscaled_to_decimal(unscaled_datum, self.scale)
data = decoder.read(decimal_required_bytes(self.precision))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decimal_required_bytes(self.precision) should be done in the constructor and reused, right? Or does a frozen dataclass make that hard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frozen class makes it a bit awkward indeed, but did it anyway


def write(self, encoder: BinaryEncoder, val: Any) -> None:
return encoder.write_decimal_bytes(val)
return encoder.write(decimal_to_bytes(val, byte_length=decimal_required_bytes(self.precision)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the comment above, we don't need to calculate the size every time since this is in a tight loop.



@lru_cache
def decimal_required_bytes(precision: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the way Java does this because it would be a couple of list comprehensions rather than computing the max precision for each byte length for each call.

MAX_PRECISION = tuple( math.floor(math.log10(math.fabs(math.pow(2, 8 * l - 1) - 1))) for l in range(24) )
REQUIRED_LENGTH = tuple( next(l for l in range(24) if p <= MAX_PRECISION[l]) for p in range(40) )

def required_bytes(precision: int) -> int:
    if precision <= 0 or precision >= len(REQUIRED_LENGTH):
        raise ValueError(...)
    return REQUIRED_LENGTH[precision]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cached, and likely that you would call it with the same argument many time, therefore the @lru_cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the tuples very much, I'll go with those

assert decimal_required_bytes(precision=8) == 4
assert decimal_required_bytes(precision=10) == 5
assert decimal_required_bytes(precision=32) == 14
assert decimal_required_bytes(precision=40) == 17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should technically only go to 38 because that's the max precision for 16 bytes. I'm not sure why the Java code went to 40. That's one more entry than needed in the REQUIRED_BYTES array.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Using @lru_cache vs storing the length on the reader/writer is minor, as is the way the required length is calculated.

@Fokko Fokko merged commit 427d23b into apache:master Aug 7, 2023
@Fokko Fokko deleted the fd-add-testssss branch August 7, 2023 19:27
@Fokko
Copy link
Contributor Author

Fokko commented Aug 7, 2023

Thanks again @rdblue for the comprehensive review, appreciate it! 🙏🏻

@rustyconover
Copy link
Contributor

Hi Friends,

Tonight I'm trying to rebase my Avro reader branch and this PR really wasn't easy to follow.

The PR change description didn't really describe what was decided to be changed without reading all of the comments from the review.

In the future can you please split things like this up with clearer logic written in the change message especially decisions around architecture (such as don't use native date/datetime types). It took me a while to understand why tests were removed.

I'd like to see the commits/PRs say things like:

"Removing use of native datetime types in the Avro reader path, removing relevant tests."

Talking about fixing bugs in the read path just didn't give the justifications why the changes were made.

@Fokko
Copy link
Contributor Author

Fokko commented Aug 11, 2023

Sorry @rustyconover for the confusion here. The PR grew a bit bigger than originally intended, but I can see how it can be confusing. The decoder/encoded should only handle primitive types, and therefore I removed them, but it shouldn't have happened under an ambiguous PR like this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants