Skip to content

Conversation

@jun-he
Copy link
Collaborator

@jun-he jun-he commented Apr 15, 2020

For issue #281, #836, and #1091.

This PR implements change-based APIs to support partition spec evolution

@rdblue
Copy link
Contributor

rdblue commented Apr 15, 2020

Thanks for working on this, @jun-he!

Besides my comment about replacing the entire spec, I have two other main concerns. First, I think we might want to consider a change-based API, which is related to my comment about not replacing the whole spec. For updating schemas, the API expresses things like addColumn, renameColumn, etc. that correspond to SQL DDL statements. I think it would make sense to have a similar API here:

  • addField(String source, ...)
  • renameField(String name, String newName)
  • replaceField(String name, ...)
  • removeField(String name)

I'm not sure how we should pass the transform to addField and replaceField just yet. We have expressions for the various transforms, so we could use an expression like addField("ts_hour", hour("ts")). We could also add a lot more variations of the method names, like addHourField, addDayField, etc.

Second, I think we are going to need a v1 version and a v2 version. In v1, we have to maintain the order and number of fields, so we are more limited for partition spec evolution: we can rename partition fields, replace existing partition fields with the void transform to "remove" them, and add new partition fields at the end of the spec. But in v2, we have reliable IDs so we can actually remove fields, replace a field with a different transform (drop days and add hours in the same position), also rename and add partition fields. I think the easiest thing is to build separate implementations that validate and apply those changes.

@rdblue
Copy link
Contributor

rdblue commented Apr 15, 2020

#924 just made it in, so this can implement dropField for v1 by replacing the field with a void transform. That will avoid ID problems.

@jun-he
Copy link
Collaborator Author

jun-he commented Apr 24, 2020

@rdblue Thanks for the comments.
I am thinking we will always create a new spec with a new spec id. So the change-based API will first clone an existing spec (either the latest one or another one specified by its spec id). From there, we mutate the cloned one. This will maintain the history of the spec evolution and keep the existing spec immutable. Then, the change based API will be similar to how to handle new spec.
But this means the existing data will still use the original spec because we never modify that.

If we keep the existing spec immutable,

  • addField, renameField, and removeField will be straightforward for V1 and V2.
  • replaceField will be equivalent to removeField and then addField. One issue is to re-use the same field ID. This should be OK for V1 as V1 always reset the field Ids for a new spec. But for V2, we cannot reuse the same field ID as it has already been used by by a partition field in one or multiple old partition specs.

@jun-he jun-he force-pushed the jun/partitionspec-evolution branch from 3e6111d to befa310 Compare April 25, 2020 06:24
@jun-he jun-he force-pushed the jun/partitionspec-evolution branch from befa310 to 57fe84c Compare May 25, 2020 06:47
@jun-he jun-he changed the title [WIP] Add partition spec evolution Add partition spec evolution May 25, 2020
@jun-he
Copy link
Collaborator Author

jun-he commented May 25, 2020

@rdblue Wondering if it is better to have another separate PR to implement change-based partition spec evolution as this PR has already addressed #836 and added the basic evolution features to Table API. Thanks.

@jun-he jun-he changed the title Add partition spec evolution [Part 1] Add partition spec evolution May 25, 2020
@rdblue
Copy link
Contributor

rdblue commented May 25, 2020

@jun-he, thanks for working on this.

I don't think that the API here is the right one. I think that the change-based API is the only one we need, so it doesn't make sense to have a replacement API in addition. My comment above has more context.

@jun-he jun-he changed the title [Part 1] Add partition spec evolution Add partition spec evolution Jun 4, 2020
@jun-he
Copy link
Collaborator Author

jun-he commented Jun 4, 2020

@rdblue I updated the PR with all change based APIs. Can you please take a look? Thanks.

@rdblue
Copy link
Contributor

rdblue commented Jun 4, 2020

Looks like the test failure is due to importing Guava classes instead of the relocated versions. We now rely on a bundled and relocated version of Guava, introduced in #1068.

@jun-he jun-he force-pushed the jun/partitionspec-evolution branch from 8487eb4 to 8413d1a Compare June 28, 2020 22:15
@jun-he
Copy link
Collaborator Author

jun-he commented Jun 28, 2020

@rdblue thanks for the comment. I address the comments and update the PR accordingly.
In this change, UpdatePartitionSpec now ignores all soft-deleted fields in V1 and then applies all changes to create a V2 partition spec. After that, it then fill each of partition filed id gaps with a VoidTransform field.

In this way, we can also support partition spec evolution, e.g.

  • reuse a field id previously removed in V1
  • keep spec concise without useless trailing void fields if needed

Additionally, it does not need to consider V1 case before committing and make the code cleaner and easier to read.

Can you please take another look and let me know your comments? Thanks!

@rdblue
Copy link
Contributor

rdblue commented Jun 30, 2020

Thanks for the update, @jun-he! I'll take a look.

public void testMultipleTimestampPartitions() {
AssertHelpers.assertThrows("Should not allow year(ts) and year(ts)",
IllegalArgumentException.class, "Cannot use partition name more than once",
() -> PartitionSpec.builderFor(SCHEMA).year("ts", "year").year("another_ts", "year").build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context for these checks is incorrect. The problem is not a duplicate transform for a field, it is that the name is reused.

Also, these should not be in this test case for multiple timestamp partitions. These should be in a test case for duplicate names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is the side effect if the duplicate partition detection takes precedence over duplicate name detection as discussed in #922 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context, Should not allow year(ts) and year(ts) isn't correct because the call uses two different source columns, ts and another_ts. It should be Should not allow partition fields with the same name.

And since this is passing in the partition name, you can change it to avoid hitting the wrong error case. You could move this to a new test for name collisions, and update this to avoid the name collision by removing the explicit partition field name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Updated accordingly.

() -> PartitionSpec.builderFor(SCHEMA)
.bucket("id", 8, "id_bucket1")
.bucket("id", 16, "id_bucket2").build());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary empty line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private final Schema schema;
private final Map<String, PartitionField> curSpecFields;
private final List<Consumer<PartitionSpec.Builder>> newSpecFields = Lists.newArrayList();
private final Map<String, PartitionField> newRemovedFields = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way that this operation keeps track of state makes it very confusing to read and understand. I can see the appeal of trying to just delegating to the spec builder, but the current approach requires complicated checks after the builder is configured (like checkIfRemoved on line 97) and methods that need to rewrite the entire spec anyway (like freshSpecFieldIds and fillGapsByNonNullFields).

I think that this would be easier to understand if it did validation and assignment incrementally. Instead of adding a method to rewrite the spec with different field IDs, I think this should check for deleted fields when the configuration method is called and use the correct ID from the start. This should help avoid the need for work-arounds like getLastPartitionField in the spec builder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Thanks for the comments. I think actually we can remove newRemovedFields , checkIfRemoved , and getLastPartitionField as there is no need to check if a newly added field is just removed.
As we discussed in #922 (comment), the main concern for removing a field and then adding it back is to pollute the metadata.
In the current implementation, it won't add duplications and then equivalent to do nothing or rename the field because the partition field ID reuse.
I added a few unit tests for those cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also updated the code to use the correct ID from the start to avoid freshSpecFieldIds at the end.

} else {
builder.add(spec);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is the right change. I think it's a good idea to allow calling this method with the same spec, but that should be detected and should return the exact same TableMetadata. That way, the commit will appear to succeed, but won't actually change the table (no-op commits are used elsewhere, too).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Thanks for the comment. It is a good idea to support noop commit.
The main reason to remove the precondition check here is to support rename a partition field. In renaming case, the partition spec is compatible but it is different and need to be committed.
Updated the code to support no-op commit and added unit tests.


@Test
public void testToJsonForV1Table() {
public void testToJson() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing existing test cases, let's add new test cases for the case you want to exercise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. This test was updated because TableTestBase was removed. There is no new test cases added into it..
I updated the test name because, with the current implementation, this test is not specific for V1 Table and there is no special toJson handling needed for V2 table.

PartitionField field = new PartitionField(
sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.day(sourceColumn.type()));
checkForRedundantPartitions(field);
checkAndAddPartitionName(targetName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was needed because of my comment that the duplicate field should throw an exception before the duplicate name.

} else if (transform.startsWith("bucket[")) {
type = "bucket";
} else {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the transform name in this situation? Then this would always catch duplicate transforms.

Also, should we add a case for truncate with different lengths?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is better. I add a getName() method to Transform interface.

Similar to bucket, we may only allow one truncate transform for a given field.
Do you know if there is a valid use case with multiple truncate for a field?

@jun-he jun-he force-pushed the jun/partitionspec-evolution branch from 8413d1a to 99373a7 Compare July 14, 2020 04:18
@jun-he
Copy link
Collaborator Author

jun-he commented Jul 23, 2020

@rdblue can you take another look? Thanks.

@jun-he jun-he requested a review from rdblue July 23, 2020 04:38
@jun-he
Copy link
Collaborator Author

jun-he commented Jul 29, 2020

@rdblue can you let me know your comments? Thanks.

@rdblue
Copy link
Contributor

rdblue commented Dec 17, 2020

Most of the changes here are included in #1942 so I'll close this. It would be nice to get some of the test updates in another PR, though. Thanks @jun-he!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants