-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Align ContentFile partition JSON with REST spec #14702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this @geruh !
|
IMHO i think we should get this in 1.10.1 if the 1.10.1 is not frozen cc @huaxingao |
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
| // Handle partition struct object format, which serializes by field ID and skips | ||
| // null partition values | ||
| Preconditions.checkState( | ||
| partitionNode.size() <= fields.size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might wanna assert now that what is missing was null ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s hard to draw a line there, only because the "legacy" object form omits null values. This makes it hard to know the difference between something that was intentionally Null and something that we forgot to serialize. for example, PartitionData("1000": "a", "1001":null) would serialize to {"1000": "a"}.
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@geruh Could you please address #14702 (comment) and then we can merge. |
|
Thanks @geruh for the PR! Thanks @singhpk234 for the review! |
|
@geruh could you cherry-pick the change to 1.10.x? |
| List<Types.NestedField> fields = partitionType.fields(); | ||
| for (int pos = 0; pos < fields.size(); ++pos) { | ||
| Types.NestedField field = fields.get(pos); | ||
| Object partitionValue = partitionData.get(pos, Object.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can use field.type().javaClass() here, instead of `Object.class).
| PartitionData partitionData = new PartitionData(partitionType); | ||
|
|
||
| if (partitionNode.isArray()) { | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment like this
In 1.11 and after, partition data is serialized as an array with just partition field values.
| partitionData.set(pos, partitionValue); | ||
| } | ||
| } else if (partitionNode.isObject()) { | ||
| // Handle partition struct object format, which serializes by field ID and skips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe update the comment to clarify when the serialization format changed. e.g.
In 1.10 and before, partition data is serialized as a struct with partition field ID and partition field value. Null partition field values are skipped.
@huaxingao We shouldn't include this in the 1.10.1 patch release. It change the serialization for the file scan task which is checkpointed in Flink state. @geruh we should also add a note for the 1.11.0 release. When a Flink job upgrade the Iceberg version to 1.11.0, it shouldn't roll back to 1.10 or lower due to this serialization change. |
Summary
This PR aligns the
ContentFilepartition field JSON serialization inContentFileParserwith the REST spec, that specifies partition data as an ordered array of values.The REST spec uses an array of primitives to represent partitions, and that was discussed here in #9717 (comment) when the spec was designed. However, while testing it looks like
ContentFileParserhasn't had the updates to reflect this yet and was usingSingleValueParser, which produces a field ID map like{"1000": 1}instead of an array[1].Testing
Serialize a partitioned data file:
Unpartitioned table
non array:
cc: @singhpk234 @amogh-jahagirdar