Skip to content

Conversation

@jackye1995
Copy link
Contributor

@github-actions github-actions bot added the AWS label Feb 11, 2022
@jackye1995
Copy link
Contributor Author


@Override
public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) {
Preconditions.checkArgument(isValidIdentifier(identifier),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can likely be generalized to the base metastore class, will do that after we have some other implementations to see how much common code we can extract

Copy link
Contributor

@singhpk234 singhpk234 Feb 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, looks like we are missing pre-conditions on metadataFileLocation in HiveCatalog. CodePointer

Adding it at BaseMetaStoreClass will unify this stuff.

metadataFileLocation))
.build())
.build());
} catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the full class path here? Is there another AlreadyExistsException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is rethrown as Iceberg's AlreadyExistsException

Comment on lines 446 to 453
.name(IcebergToGlueConverter.getTableName(identifier))
.tableType(GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE)
.parameters(ImmutableMap.of(
BaseMetastoreTableOperations.TABLE_TYPE_PROP,
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toLowerCase(Locale.ENGLISH),
BaseMetastoreTableOperations.METADATA_LOCATION_PROP,
metadataFileLocation))
.build())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it may be more readable to build the table input separately and then pass it in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, Thanks @jackye1995

Assert.assertEquals("Table type should be set",
GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE, response.table().tableType());
Assert.assertNull("Storage descriptor should be empty", response.table().storageDescriptor());
Assert.assertTrue("Partition spec should be empty", response.table().partitionKeys().isEmpty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit / question] : should we use !response.table().hasPartitionKeys()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, updating

}

@Override
public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) {
Copy link
Contributor

@rdblue rdblue Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do this more generically in BaseMetastoreCatalog similar to a regular create?

    public Table registerTable(TableIdentifier identifier, String metadataFileLocation) {
      TableOperations ops = newTableOps(identifier);
      if (ops.current() != null) {
        throw new AlreadyExistsException("Table already exists: %s", identifier);
      }

      FileIO io = ops.io();
      TableMetadata metadata = TableMetadataParser.read(io, metadataFileLocation);

      try {
        // use temporary ops to pick up the table metadata settings
        ops.temp(metadata).commit(null, metadata);
      } catch (CommitFailedException ignored) {
        throw new AlreadyExistsException("Table was created concurrently: %s", identifier);
      }

      return new BaseTable(ops, fullTableName(name(), identifier));
    }

That will rewrite the metadata file rather than using it directly, but it seems like it would work in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995, what do you think about this suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I have considered this particular suggestion and implemented it under #5037, please do have a look at the implementation...

Comment on lines +383 to +387
table.refresh();
long v1SnapshotId = table.currentSnapshot().snapshotId();
String v1MetadataLocation = ((BaseTable) table).operations().current().metadataFileLocation();
table.newDelete().deleteFile(dataFile).commit();
table.refresh();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the refresh calls.

try {
glue.createTable(CreateTableRequest.builder()
.databaseName(IcebergToGlueConverter.getDatabaseName(identifier))
.tableInput(tableInput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we save awsProperties.glueCatalogId() as well?

@vamen
Copy link

vamen commented May 17, 2022

This seemed to be not working for converting hadoop catalog based tables to glue based tables. because tables written using hadoop catalog writes metadata.json file in the form v<V+1>.metadata.json

And tables written using glue and hive catalog have format <V+1>-.metadata.json

Hence when we call registerTable with HadoopCatalog based table, parseVersion function would fail

after renaming the latest metadata.json file I was able to register and query the data.

I was thinking to raise a PR to handle this in registerTable function as follows.

1.) We shall call parseVersion() in registerTable. If parsing gets failed in BaseMetastoreTableOperations then the file can be in File-System-Tables spec and we can reparse the file using File-System-Tables spec.

2.) We can then rename the metadata file in required spec and use the renamed file path as metadataFileLocation.

Created a issue for the same : #4794

@jackye1995 @rdblue @rajarshisarkar @RussellSpitzer

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Jul 22, 2022

@jackye1995, @rdblue: I think we can close this PR as glue now supports the register table (via #5037)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants