Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR adds a utility method to derive a common type for all partition specs.

Prior to this change, querying metadata tables in v2 format with evolved partitioning failed with runtime exceptions.

@aokolnychyi
Copy link
Contributor Author


TableOperations ops = ((HasTableOperations) table).operations();
TableMetadata current = ops.current();
ops.commit(current, current.updatePartitionSpec(newSpec));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make change for the method updatePartitionSpec to avoid conflicts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is hidden from users. The user-facing API is UpdatePartitionSpec accessible via Table. That one actually ensures we don't hit this case. The spec evolution in v1 tables is actually limited as described here.

There could be some tables where people evolved partitioning before the public API appeared. It is an edge case but this test ensures we get a reasonable exception for such tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +217 to +218
table.updateSpec()
.removeField("data")
.commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could be in one line.

@RussellSpitzer
Copy link
Member

Do we have to worry about the Manifests Table? Or is it ok because we are always displaying in the context of the current spec?

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but looks good to me overall

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks for the fix!

@aokolnychyi aokolnychyi force-pushed the metadata-table-specs branch from 3e5ba47 to 93a7fd8 Compare August 5, 2021 16:05
@aokolnychyi
Copy link
Contributor Author

@RussellSpitzer, somehow tests for all_data_files started to fail after #2877. It looks like we did not update that metadata table in the PR. Is that on purpose?

Checking what is actually going on.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Aug 5, 2021

@RussellSpitzer, I think simply using schema() instead of fileSchema in AllDataFilesTable fixes the problem.
Could you check?

@RussellSpitzer
Copy link
Member

@aokolnychyi Sounds good to me, I thought I mimicd the way we were doing it in the FilesTable using the "fileSchema" as the projected schema, if that isn't the scan.schema we should change it

@aokolnychyi
Copy link
Contributor Author

Well, I am not sure. I'll need your help to verify whether my assumption is correct. We are using the same ManifestReadTask in both but AllDataFilesTable still passes fileSchema instead of schema().

@aokolnychyi aokolnychyi force-pushed the metadata-table-specs branch from 93a7fd8 to ec1f573 Compare August 5, 2021 17:53

List<NestedField> sortedCommonFields = commonFields.values().stream()
.sorted(Comparator.comparingInt(NestedField::fieldId))
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for sorting by fieldId.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me.

Same question that Ryan has regarding checking name during validation in partitionType, but overall this looks good to me. Thanks Anton!

@github-actions github-actions bot added the API label Aug 10, 2021
List<NestedField> structFields = Lists.newArrayList();

// sort the spec IDs in descending order to pick up the most recent field names
List<Integer> specIds = table.specs().keySet().stream()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sort by spec ID to make sure we pick up the most recent field name (see a dedicated test too).

}

@Test
public void testPartitionTypeWithAddingBackSamePartitionFieldInV1Table() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test validates we ignore field names when building the common type. The original spec will have 1000:data and the last spec will have 1000:data_1000 as the old field was renamed to avoid naming conflicts.

structFields.add(structField);
} else {
// verify the fields are compatible as they may conflict in v1 tables
ValidationException.check(field.compatibleWith(existingField),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably just make equivalentIgnoringName a private method in this class for this.

NestedField.optional(1001, "data", Types.StringType.get())
);
StructType actualType = Partitioning.partitionType(table);
Assert.assertEquals("Types must match", expectedType, actualType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could argue that adding data back should re-use the old ID. Not something to fix here, but we should probably fix it at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed wierd. However, I would not worry too much here as we have already stated not to rename and drop fields in v1 tables.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, other than the compatibleWith method that conflicts with the one in PartitionSpec.

@aokolnychyi
Copy link
Contributor Author

Yeah, I wasn't sure about the method name. Updated.

@aokolnychyi aokolnychyi requested a review from rdblue August 14, 2021 03:15
@aokolnychyi aokolnychyi merged commit 95cde3a into apache:master Aug 14, 2021
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @flyrain @karuppayya @RussellSpitzer @rdblue @kbendick @jackye1995!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants