-
Notifications
You must be signed in to change notification settings - Fork 3k
Data: Fix Parquet and Avro defaults date/time representation #11811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| readPlan.add(Pair.of(pos, ValueReaders.constant(field.initialDefault()))); | ||
| readPlan.add( | ||
| Pair.of( | ||
| pos, ValueReaders.constant(convert.apply(field.type(), field.initialDefault())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a question: should we also convert the constants passed by idToConstant?
I think we should but not yet. The problem is that this conversion probably can't be done twice because of assumptions in the conversion (like casting strings to UTF8String in Spark) so it is worth a separate PR to refactor all of the conversions to be done in one place. That will also be different for each file format so it could get messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that in a separate PR indeed 👍
| assertEquals( | ||
| field.type(), | ||
| GenericDataUtil.internalToGeneric(field.type(), field.initialDefault()), | ||
| actual.getField(field.name())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the test bug. Rather than asserting that the value was passed back unchanged, it should have delegated to assertEquals that checks the representation of the value.
The expected value (default) also needs to be converted to generic because the generic data model tests use the generic data model as the source representation.
9ad4789 to
3f90aa8
Compare
| List<Pair<Integer, ValueReader<?>>> readPlan = | ||
| ValueReaders.buildReadPlan(expected, record, fieldReaders, idToConstant); | ||
| ValueReaders.buildReadPlan( | ||
| expected, record, fieldReaders, idToConstant, GenericDataUtil::internalToGeneric); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is only needed for the generic data model and not for the internal data model because defaults are already using the correct classes for internal.
|
|
||
| protected abstract void writeAndValidate(Schema schema) throws IOException; | ||
|
|
||
| protected void writeAndValidate(Schema writeSchema, Schema expectedSchema) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To deduplicate code, I moved the new default test cases to the DataTest and AvroDataTest classes and disable them with assumptions where defaults aren't needed or supported.
| } | ||
|
|
||
| @Override | ||
| protected void writeAndValidate(Schema writeSchema, Schema expectedSchema) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding default tests to the Avro read path as well.
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits around unboxing, but looks good 👍 Thanks for cleaning up the tests
|
|
||
| switch (type.typeId()) { | ||
| case DATE: | ||
| return DateTimeUtil.dateFromDays((Integer) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return DateTimeUtil.dateFromDays((Integer) value); | |
| return DateTimeUtil.dateFromDays((int) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value is an Object, so I cast to the object rather than the primitive. Looks like you're right that we can cast directly and unbox at the same time based on the method signature... but I typically don't do that for style reasons. If you cast to the primitive, then there's a risk of introducing an unnecessary failure if the value is null and the receiving function accepts an Integer that can be null. I think casting to the object version is better to avoid those problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is similar, but the other way around:
As you mentioned, the method accepts an int:
https://github.com/apache/iceberg/blob/91a1505d09cebcd1d088ac53cd42732c343883de/api/src/main/java/org/ap
ache/iceberg/util/DateTimeUtil.java#L49-L51
Casting to an Integer gives the impression to me that a null is acceptable here, while it is not. I'm not very strong on this one, let's pull in @nastra as a tiebreaker :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ryan's reasoning. Casting to int will throw a NPE because it will first cast it to Integer and then throw the NPE when doing Integer.intValue(). That being said, it's safer to always cast to the object type instead of the primitive in case the value can be null
| case DATE: | ||
| return DateTimeUtil.dateFromDays((Integer) value); | ||
| case TIME: | ||
| return DateTimeUtil.timeFromMicros((Long) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return DateTimeUtil.timeFromMicros((Long) value); | |
| return DateTimeUtil.timeFromMicros((long) value); |
| return DateTimeUtil.timeFromMicros((Long) value); | ||
| case TIMESTAMP: | ||
| if (((Types.TimestampType) type).shouldAdjustToUTC()) { | ||
| return DateTimeUtil.timestamptzFromMicros((Long) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return DateTimeUtil.timestamptzFromMicros((Long) value); | |
| return DateTimeUtil.timestamptzFromMicros((long) value); |
| if (((Types.TimestampType) type).shouldAdjustToUTC()) { | ||
| return DateTimeUtil.timestamptzFromMicros((Long) value); | ||
| } else { | ||
| return DateTimeUtil.timestampFromMicros((Long) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return DateTimeUtil.timestampFromMicros((Long) value); | |
| return DateTimeUtil.timestampFromMicros((long) value); |
| switch (type.typeId()) { | ||
| case DECIMAL: | ||
| return Decimal.apply((BigDecimal) value); | ||
| case UUID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| readPlan.add(Pair.of(pos, ValueReaders.constant(field.initialDefault()))); | ||
| readPlan.add( | ||
| Pair.of( | ||
| pos, ValueReaders.constant(convert.apply(field.type(), field.initialDefault())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that in a separate PR indeed 👍
|
Thanks for reviewing, @Fokko! |

This updates the Avro and Parquet readers for Spark and Iceberg's generic data model to produce the expected date/time classes. It also adds a test for primitive type defaults.