-
Notifications
You must be signed in to change notification settings - Fork 3k
Add partition spec evolution #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for working on this, @jun-he! Besides my comment about replacing the entire spec, I have two other main concerns. First, I think we might want to consider a change-based API, which is related to my comment about not replacing the whole spec. For updating schemas, the API expresses things like addColumn, renameColumn, etc. that correspond to SQL DDL statements. I think it would make sense to have a similar API here:
I'm not sure how we should pass the transform to Second, I think we are going to need a v1 version and a v2 version. In v1, we have to maintain the order and number of fields, so we are more limited for partition spec evolution: we can rename partition fields, replace existing partition fields with the |
|
#924 just made it in, so this can implement |
966a6b5 to
3e6111d
Compare
|
@rdblue Thanks for the comments. If we keep the existing spec immutable,
|
3e6111d to
befa310
Compare
befa310 to
57fe84c
Compare
|
@jun-he, thanks for working on this. I don't think that the API here is the right one. I think that the change-based API is the only one we need, so it doesn't make sense to have a replacement API in addition. My comment above has more context. |
|
@rdblue I updated the PR with all change based APIs. Can you please take a look? Thanks. |
|
Looks like the test failure is due to importing Guava classes instead of the relocated versions. We now rely on a bundled and relocated version of Guava, introduced in #1068. |
8487eb4 to
8413d1a
Compare
|
@rdblue thanks for the comment. I address the comments and update the PR accordingly. In this way, we can also support partition spec evolution, e.g.
Additionally, it does not need to consider Can you please take another look and let me know your comments? Thanks! |
|
Thanks for the update, @jun-he! I'll take a look. |
| public void testMultipleTimestampPartitions() { | ||
| AssertHelpers.assertThrows("Should not allow year(ts) and year(ts)", | ||
| IllegalArgumentException.class, "Cannot use partition name more than once", | ||
| () -> PartitionSpec.builderFor(SCHEMA).year("ts", "year").year("another_ts", "year").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context for these checks is incorrect. The problem is not a duplicate transform for a field, it is that the name is reused.
Also, these should not be in this test case for multiple timestamp partitions. These should be in a test case for duplicate names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is the side effect if the duplicate partition detection takes precedence over duplicate name detection as discussed in #922 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context, Should not allow year(ts) and year(ts) isn't correct because the call uses two different source columns, ts and another_ts. It should be Should not allow partition fields with the same name.
And since this is passing in the partition name, you can change it to avoid hitting the wrong error case. You could move this to a new test for name collisions, and update this to avoid the name collision by removing the explicit partition field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Updated accordingly.
| () -> PartitionSpec.builderFor(SCHEMA) | ||
| .bucket("id", 8, "id_bucket1") | ||
| .bucket("id", 16, "id_bucket2").build()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| private final Schema schema; | ||
| private final Map<String, PartitionField> curSpecFields; | ||
| private final List<Consumer<PartitionSpec.Builder>> newSpecFields = Lists.newArrayList(); | ||
| private final Map<String, PartitionField> newRemovedFields = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way that this operation keeps track of state makes it very confusing to read and understand. I can see the appeal of trying to just delegating to the spec builder, but the current approach requires complicated checks after the builder is configured (like checkIfRemoved on line 97) and methods that need to rewrite the entire spec anyway (like freshSpecFieldIds and fillGapsByNonNullFields).
I think that this would be easier to understand if it did validation and assignment incrementally. Instead of adding a method to rewrite the spec with different field IDs, I think this should check for deleted fields when the configuration method is called and use the correct ID from the start. This should help avoid the need for work-arounds like getLastPartitionField in the spec builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Thanks for the comments. I think actually we can remove newRemovedFields , checkIfRemoved , and getLastPartitionField as there is no need to check if a newly added field is just removed.
As we discussed in #922 (comment), the main concern for removing a field and then adding it back is to pollute the metadata.
In the current implementation, it won't add duplications and then equivalent to do nothing or rename the field because the partition field ID reuse.
I added a few unit tests for those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also updated the code to use the correct ID from the start to avoid freshSpecFieldIds at the end.
| } else { | ||
| builder.add(spec); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is the right change. I think it's a good idea to allow calling this method with the same spec, but that should be detected and should return the exact same TableMetadata. That way, the commit will appear to succeed, but won't actually change the table (no-op commits are used elsewhere, too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Thanks for the comment. It is a good idea to support noop commit.
The main reason to remove the precondition check here is to support rename a partition field. In renaming case, the partition spec is compatible but it is different and need to be committed.
Updated the code to support no-op commit and added unit tests.
|
|
||
| @Test | ||
| public void testToJsonForV1Table() { | ||
| public void testToJson() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing existing test cases, let's add new test cases for the case you want to exercise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. This test was updated because TableTestBase was removed. There is no new test cases added into it..
I updated the test name because, with the current implementation, this test is not specific for V1 Table and there is no special toJson handling needed for V2 table.
| PartitionField field = new PartitionField( | ||
| sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.day(sourceColumn.type())); | ||
| checkForRedundantPartitions(field); | ||
| checkAndAddPartitionName(targetName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was needed because of my comment that the duplicate field should throw an exception before the duplicate name.
| } else if (transform.startsWith("bucket[")) { | ||
| type = "bucket"; | ||
| } else { | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using the transform name in this situation? Then this would always catch duplicate transforms.
Also, should we add a case for truncate with different lengths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is better. I add a getName() method to Transform interface.
Similar to bucket, we may only allow one truncate transform for a given field.
Do you know if there is a valid use case with multiple truncate for a field?
8413d1a to
99373a7
Compare
|
@rdblue can you take another look? Thanks. |
|
@rdblue can you let me know your comments? Thanks. |
For issue #281, #836, and #1091.
This PR implements change-based APIs to support partition spec evolution