-
Notifications
You must be signed in to change notification settings - Fork 2.9k
add lastAssignedFieldId to table metadata #2089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change! LGTM.
One question: do you intend to add lastAssignedFieldId to the JSON file in another PR, or you don't think it's necessary (so that this change is mostly to make lastAssignedFieldId available in table metadata class)? I thought we want to persist it in JSON file, and if that's the case, I'll echo my question from this PR:
Is this just to ensure implementation correctness per spec? As it seems that from the current implementation getting this from PartitionSpec::lastAssignedFieldId should be good enough, as there's no table operation to actually remove a partition field from specs?
|
@yyanyy Thanks for the review. It seems table metadata is a better place to hold this info than partition spec updater as it is an info for a table, which might be used by other code paths in the future. |
|
The last assigned partition field ID does need to be added to the spec and the table metadata JSON. It is possible to reuse IDs, but I don't think it is a good idea to do it. There is also no need to reuse them. I think that reuse could only cause problems. |
|
I just thought of an example where ID reuse could be bad. The Nessie catalog allows branching at the table metadata level by keeping different metadata file references. A Nessie branch could remove data for one spec and reuse the IDs, which would make compatibility with another branch using the old spec confusing. |
|
@rdblue Thanks for the comment. |
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment
| } | ||
|
|
||
| int lastAssignedFieldId = Optional.ofNullable(JsonUtil.getIntOrNull(LAST_ASSIGNED_FIELD_ID, node)) | ||
| .orElseGet(() -> specs.stream().mapToInt(PartitionSpec::lastAssignedFieldId).max().orElse(999)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want it to be a required field in v2 and fail validation if it's missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense as the V2 is not released yet. I will update it accordingly.
|
Will update the doc after the change is merged. |
|
@rdblue can you please take another look? thanks. |
| return specsById; | ||
| } | ||
|
|
||
| int lastAssignedFieldId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name is potentially confusing. It is the last assigned partition field ID, and "partition" is a key part of it. How about renaming it to lastAssignedPartitionId instead? Same with the instance field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the field name in TableMetadata.
In PartitionSpec, there is also a lastAssignedFieldId field. It seems to be fine as it is a field inside partition spec, which already indicates it is about partition.
|
|
||
| int lastAssignedFieldId; | ||
| if (formatVersion > 1) { | ||
| lastAssignedFieldId = JsonUtil.getInt(LAST_ASSIGNED_FIELD_ID, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use a precondition to check that the last assigned partition ID is present? That would result in a better error message if it is missing, like the other cases: last-assigned-partition-id must exist in format v2
| static final String PARTITION_SPEC = "partition-spec"; | ||
| static final String PARTITION_SPECS = "partition-specs"; | ||
| static final String DEFAULT_SPEC_ID = "default-spec-id"; | ||
| static final String LAST_ASSIGNED_FIELD_ID = "last-assigned-field-Id"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be last-assigned-partition-id.
| return new TableMetadata(null, formatVersion, uuid, newLocation, | ||
| lastSequenceNumber, System.currentTimeMillis(), newLastColumnId.get(), freshSchema, | ||
| specId, specListBuilder.build(), orderId, sortOrdersBuilder.build(), ImmutableMap.copyOf(newProperties), | ||
| specId, specListBuilder.build(), Math.max(lastAssignedFieldId, freshSpec.lastAssignedFieldId()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to fix the implementation of freshSpec because it doesn't accept a lastAssignedFieldId or reuse existing partition field IDs. In fact, I think what would currently happen is that the spec would be created independent of the rest of the table metadata and could reuse existing IDs. That's not good, but we also don't want to reassign all of the IDs (though that would be less bad).
We need to create a new spec similar to how we use assignFreshIds for the schema, which will reuse IDs for existing fields and only assign new IDs for new fields, starting at the last assigned ID.
|
Thank you, @jun-he! I think there are few main improvements needed:
We can do all 3 in this PR, or just the first one if you prefer to separate them. Just let me know. |
|
@rdblue thanks for the comments. Updated the PR with the renamed field. I will send separate PRs for item 2 and 3. Thanks. |
| "location": "s3://bucket/test/location", | ||
| "last-sequence-number": 34, | ||
| "last-updated-ms": 1602638573590, | ||
| "last-column-id": 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this is last-column-id and not last-assigned-column-id. I think we should make the metadata property last-partition-id instead of last-assigned-partition-id to match. I'll merge this and we can fix it in a follow-up.
|
Thanks, @jun-he! |
Follow up #2083.
Add
lastAssignedFieldIdto table metadata.To keep it simple,
lastAssignedFieldIdacross all partition specs will be computed at initialization without persisting to the JSON file.