-
Notifications
You must be signed in to change notification settings - Fork 3k
Adding specId for partitions metadata table #4516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private int fileCount; | ||
|
|
||
| Partition(StructLike key) { | ||
| this.specId = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering one thing , didn't test it myself.
If we have two partitions with different spec id but different key, it will show up with a random one. Could this case happen (ie, adding and removing a field in partition spec should give a new spec id but can still produce the same partition value?)? If so, we may need to add spec-id to key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you meant "different spec id but same key"?
Normally if you remove a field from partition spec and then re-add it, you will end up with a former spec and Iceberg will not create a new spec, just to point to the older one. In some cases though, if you re-add the column along with some other columns (forming the new spec together), then you could end up with the problem described here: #3411 (comment). That's something we have to solve too, but in another PR, as the scope of that actually covers other metadata tables too that use Partitioning.partitionType()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yea you are right, seems Iceberg re-uses the spec id if we revert the spec to a former one.
| Types.NestedField.required(1, "partition", table.spec().partitionType()), | ||
| Types.NestedField.required(2, "record_count", Types.LongType.get()), | ||
| Types.NestedField.required(3, "file_count", Types.IntegerType.get()) | ||
| Types.NestedField.required(1, "spec_id", Types.IntegerType.get()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine to add a new field, but you should not change the IDs of the other fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in this case spec_id will be appended to the schema then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I thought it means, we just need to keep the id of the other fields, but could append in any position? ex,
Types.NestedField.required(4, "spec_id", Types.IntegerType.get()),
Types.NestedField.required(1, "partition", table.spec().partitionType()),
Types.NestedField.required(2, "record_count", Types.LongType.get()),
Types.NestedField.required(3, "file_count", Types.IntegerType.get())
Though I think its not a big deal, and we can put it in the end.
44cf239 to
db6e8a0
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks ok to me.
|
Thanks a lot for the review @szehon-ho and @rdblue |
|
I tried to test these cases, while this fixes the PartitionsTable schema, some of the partition entries may have problems (as current spec is still used in a few other places). I made #4560 pr as a follow up (@szlta sorry I was not sure if you were working on going to work on this or something else). |
(cherry picked from commit b3a27c6)
As per #4292, PartitionsTable should call Partitioning.partitionType() to gather the partition struct in its schema. Also the result should show specId too, as the partition key might not be considered a unique key on its own without specId in V2 format. (E.g a partition column removed then re-added along with another column..)
This is just the initial change so that I can see what tests would need fixing and to check with community if the new schema is okay like this:
I think in other meta tables the spec_id precedes the partition key in the output, so I prepended the existing schema here with it. I'm not sure if this counts as a backward-incompatible change though. What do you think @aokolnychyi ?