-
Notifications
You must be signed in to change notification settings - Fork 2.9k
AWS: support registerTable in GlueCatalog #4099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| @Override | ||
| public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) { | ||
| Preconditions.checkArgument(isValidIdentifier(identifier), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can likely be generalized to the base metastore class, will do that after we have some other implementations to see how much common code we can extract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, looks like we are missing pre-conditions on metadataFileLocation in HiveCatalog. CodePointer
Adding it at BaseMetaStoreClass will unify this stuff.
| metadataFileLocation)) | ||
| .build()) | ||
| .build()); | ||
| } catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the full class path here? Is there another AlreadyExistsException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is rethrown as Iceberg's AlreadyExistsException
| .name(IcebergToGlueConverter.getTableName(identifier)) | ||
| .tableType(GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE) | ||
| .parameters(ImmutableMap.of( | ||
| BaseMetastoreTableOperations.TABLE_TYPE_PROP, | ||
| BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toLowerCase(Locale.ENGLISH), | ||
| BaseMetastoreTableOperations.METADATA_LOCATION_PROP, | ||
| metadataFileLocation)) | ||
| .build()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it may be more readable to build the table input separately and then pass it in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, Thanks @jackye1995
| Assert.assertEquals("Table type should be set", | ||
| GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE, response.table().tableType()); | ||
| Assert.assertNull("Storage descriptor should be empty", response.table().storageDescriptor()); | ||
| Assert.assertTrue("Partition spec should be empty", response.table().partitionKeys().isEmpty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit / question] : should we use !response.table().hasPartitionKeys()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, updating
| } | ||
|
|
||
| @Override | ||
| public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to do this more generically in BaseMetastoreCatalog similar to a regular create?
public Table registerTable(TableIdentifier identifier, String metadataFileLocation) {
TableOperations ops = newTableOps(identifier);
if (ops.current() != null) {
throw new AlreadyExistsException("Table already exists: %s", identifier);
}
FileIO io = ops.io();
TableMetadata metadata = TableMetadataParser.read(io, metadataFileLocation);
try {
// use temporary ops to pick up the table metadata settings
ops.temp(metadata).commit(null, metadata);
} catch (CommitFailedException ignored) {
throw new AlreadyExistsException("Table was created concurrently: %s", identifier);
}
return new BaseTable(ops, fullTableName(name(), identifier));
}That will rewrite the metadata file rather than using it directly, but it seems like it would work in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995, what do you think about this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| table.refresh(); | ||
| long v1SnapshotId = table.currentSnapshot().snapshotId(); | ||
| String v1MetadataLocation = ((BaseTable) table).operations().current().metadataFileLocation(); | ||
| table.newDelete().deleteFile(dataFile).commit(); | ||
| table.refresh(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid the refresh calls.
| try { | ||
| glue.createTable(CreateTableRequest.builder() | ||
| .databaseName(IcebergToGlueConverter.getDatabaseName(identifier)) | ||
| .tableInput(tableInput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we save awsProperties.glueCatalogId() as well?
|
This seemed to be not working for converting hadoop catalog based tables to glue based tables. because tables written using hadoop catalog writes metadata.json file in the form v<V+1>.metadata.json And tables written using glue and hive catalog have format <V+1>-.metadata.json Hence when we call registerTable with HadoopCatalog based table, parseVersion function would fail after renaming the latest metadata.json file I was able to register and query the data. I was thinking to raise a PR to handle this in registerTable function as follows. 1.) We shall call parseVersion() in registerTable. If parsing gets failed in BaseMetastoreTableOperations then the file can be in File-System-Tables spec and we can reparse the file using File-System-Tables spec. 2.) We can then rename the metadata file in required spec and use the renamed file path as metadataFileLocation. Created a issue for the same : #4794 |
|
@jackye1995, @rdblue: I think we can close this PR as glue now supports the register table (via #5037) |
@amogh-jahagirdar