Skip to content

Conversation

@geruh
Copy link
Contributor

@geruh geruh commented Nov 27, 2025

Summary

This PR aligns the ContentFile partition field JSON serialization in ContentFileParser with the REST spec, that specifies partition data as an ordered array of values.

The REST spec uses an array of primitives to represent partitions, and that was discussed here in #9717 (comment) when the spec was designed. However, while testing it looks like ContentFileParser hasn't had the updates to reflect this yet and was using SingleValueParser, which produces a field ID map like {"1000": 1} instead of an array [1].

Testing

Serialize a partitioned data file:

> ContentFileParser.toJson(dataFile, spec)

{"spec-id":0,"content":"DATA","file-path":"data.parquet", "file-format":"PARQUET","partition":[1],"file-size-in-bytes":10, "record-count":1,"sort-order-id":0....}

Unpartitioned table

{"partition":[]}

non array:

> ContentFileParser.partitionFromJson({"partition": {"1000": 1}}, specs)

  java.lang.IllegalArgumentException: Invalid partition data for content file: non-array...

cc: @singhpk234 @amogh-jahagirdar

@github-actions github-actions bot added the core label Nov 27, 2025
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @geruh !

@singhpk234
Copy link
Contributor

singhpk234 commented Nov 28, 2025

IMHO i think we should get this in 1.10.1 if the 1.10.1 is not frozen

cc @huaxingao

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

// Handle partition struct object format, which serializes by field ID and skips
// null partition values
Preconditions.checkState(
partitionNode.size() <= fields.size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might wanna assert now that what is missing was null ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s hard to draw a line there, only because the "legacy" object form omits null values. This makes it hard to know the difference between something that was intentionally Null and something that we forgot to serialize. for example, PartitionData("1000": "a", "1001":null) would serialize to {"1000": "a"}.

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao huaxingao added this to the Iceberg 1.10.1 milestone Nov 29, 2025
@huaxingao
Copy link
Contributor

@geruh Could you please address #14702 (comment) and then we can merge.

@huaxingao huaxingao merged commit 9896e8c into apache:main Dec 2, 2025
44 checks passed
@huaxingao
Copy link
Contributor

Thanks @geruh for the PR! Thanks @singhpk234 for the review!

@huaxingao
Copy link
Contributor

@geruh could you cherry-pick the change to 1.10.x?

List<Types.NestedField> fields = partitionType.fields();
for (int pos = 0; pos < fields.size(); ++pos) {
Types.NestedField field = fields.get(pos);
Object partitionValue = partitionData.get(pos, Object.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can use field.type().javaClass() here, instead of `Object.class).

PartitionData partitionData = new PartitionData(partitionType);

if (partitionNode.isArray()) {
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment like this

In 1.11 and after, partition data is serialized as an array with just partition field values.

partitionData.set(pos, partitionValue);
}
} else if (partitionNode.isObject()) {
// Handle partition struct object format, which serializes by field ID and skips
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe update the comment to clarify when the serialization format changed. e.g.

In 1.10 and before, partition data is serialized as a struct with partition field ID and partition field value. Null partition field values are skipped.

@stevenzwu
Copy link
Contributor

stevenzwu commented Dec 3, 2025

IMHO i think we should get this in 1.10.1 if the 1.10.1 is not frozen

cc @huaxingao

@huaxingao We shouldn't include this in the 1.10.1 patch release. It change the serialization for the file scan task which is checkpointed in Flink state.

@geruh we should also add a note for the 1.11.0 release. When a Flink job upgrade the Iceberg version to 1.11.0, it shouldn't roll back to 1.10 or lower due to this serialization change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants