Skip to content

Conversation

@theoxu31
Copy link

@theoxu31 theoxu31 commented Feb 4, 2023

Add support for registerTable in GlueCatalog.
Customizations:

  • allowing GlueDataCatalog registerTable API for exiting Table
  • remove the commit part in registerTable API to avoid creating new metadata file
  • copy the iceberg parameters at table level to glue storage descriptor level as well for consistency with other glue catalog tables.
  • allow s3 file io cross region call for glue catalog register table when using assumeRole

Reference: #4099

@github-actions github-actions bot added the AWS label Feb 4, 2023
@jackye1995
Copy link
Contributor

Thanks for picking this up from me, ping a few people for review @amogh-jahagirdar @singhpk234 @rajarshisarkar @aajisaka @JonasJ-ap

}

@Override
public org.apache.iceberg.Table registerTable(TableIdentifier identifier, String metadataFileLocation) {
Copy link
Contributor

@jackye1995 jackye1995 Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to have a feature flag like glue.register-table.create-new-metadata (for the lack of a better name) in AwsProperties to distinguish 2 behaviors, to either create a new metadata file or not. If the flag is true (by default), it can call the base class method directly.

.tableInput(tableInput)
.build());
} catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException e) {
glue.updateTable(UpdateTableRequest.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think we should at least make this a feature flag like glue.register-table.replace-if-exists.

We could further argue if we should make this an API feature or not, but that is subject to debate. Any thoughts?

TableInput tableInput = TableInput.builder()
.name(IcebergToGlueConverter.getTableName(identifier, awsProperties.glueCatalogSkipNameValidation()))
.tableType(GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE)
.parameters(tableParameters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we need to still get the metadata and use the IcebergToGlueConverter to get the merged schema for display.

@ajantha-bhat
Copy link
Member

remove the commit part in registerTable API to avoid creating new metadata file

#6591 already fixed this part for Glue too right?

@jackye1995
Copy link
Contributor

#6591 already fixed this part for Glue too right?

great, did not notice that one!

In that case I think the only missing feature is just replace

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Feb 4, 2023

  • allowing GlueDataCatalog registerTable API for exiting Table

There is also related on going work
#5327

@jackye1995
Copy link
Contributor

jackye1995 commented Feb 4, 2023

Nice, I anticipated similar concerns as in that thread, that's why I'd like to just put it up and see how the community reacts to this.

I think the conversation there was around the fact that registerTable is used for a recovery use case, so there is no requirement for atomic operation and the original metadata location might be broken.

What Theo is trying to achieve (based on my understanding) is a use case of continuous registration, where a user sends notification of the latest metadata location, and then the metadata location is updated for an existing table in Glue. This is developed for migration use cases of, for example, HiveCatalog or HadoopCatalog to GlueCatalog.

In this use case, atmoic switch of metadata location is a requirement, compared to a drop + register case. And it can clearly be achieved through calling glue.updateTable with all the right information. I think we would like to know if that is something worth adding to the Iceberg OSS, or it needs to just remain as a custom logic.

I think it has benefit in OSS, as people naturally think of using registerTable when talking about this use case, and as we have so many catalog offerings, it's worth supporting cross-catalog migration explicitly.

Any thoughts? @RussellSpitzer @yabola @flyrain @szehon-ho @rdblue

@yabola
Copy link
Contributor

yabola commented Feb 5, 2023

@jackye1995 Thanks for pinging me. I agree with you : there is no requirement in a recovery use case, but this can be a requirement in automatic switch of metadata location. But I am not sure if it is a custom logic. Looking forward to other people's perspective . If it makes sense, I can continue to complete my PR.

@ajantha-bhat
Copy link
Member

I think it has benefit in OSS, as people naturally think of using registerTable when talking about this use case, and as we have so many catalog offerings, it's worth supporting cross-catalog migration explicitly.

@jackye1995: Totally agree that we need to support cross catalog migration.
So, long back I raised #5492 and got a comment that it is better to handle this in a separate project (was also discussed this in the previous Iceberg sync). So, As a side project I am working on it (https://github.com/ajantha-bhat/iceberg-catalog-migrator/blob/first/README.md). In a week or two I can publish it as an open source project under projectNessie repo.

@jackye1995
Copy link
Contributor

I briefly read the project you linked, that's very cool CLI! But we are not really trying to build a new migration project out of it, the ask is much simpler.

What I want to get out of the discussion is really which way we go out of the following:

  1. atomic replace can be a catalog-specific feature for registerTable, such that a glue.force-register-table flag could be added to support it
  2. an API change in Iceberg catalog can be performed to make this a generic feature, such as adding a force boolean flag to registerTable
  3. it should not be a part of Iceberg core catalog API at all

.databaseName(IcebergToGlueConverter.getDatabaseName(identifier,
awsProperties.glueCatalogSkipNameValidation()))
.tableInput(tableInput)
.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should first get the last table version to avoid commit conflict

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think glue catalog already handle this internally?

If we call glue.getTable to get the table version first then call glue.updateTable with nextVersionId it will cause concurrentModificationException.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is what I mean. It needs to explicitly pass in the version number of the current version to ensure atomic update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will update

@jackye1995 jackye1995 changed the title support registerTable in GlueCatalog AWS: support force register table in GlueCatalog Feb 7, 2023
glue.createTable(
CreateTableRequest.builder().databaseName(databaseName).tableInput(tableInput).build());
} catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException e) {
if (awsProperties.glueCatalogForceRegisterTable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do something better to keep the original behavior when the flag is off:

if (!awsProperties.glueCatalogForceRegisterTable()) {
  return super.registerTable(.....);
}

// the replace-based logic below
...

@theoxu31
Copy link
Author

Resolved comments, pinging people for review @jackye1995 @yabola @amogh-jahagirdar @singhpk234 @rajarshisarkar @aajisaka @JonasJ-ap

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good, I think there's some fixes we need to do in exception handling and the tests we are adding.

Comment on lines -272 to -279
AssertHelpers.assertThrows(
"should fail to rename",
ValidationException.class,
"Input Glue table is not an iceberg table",
() ->
glueCatalog.renameTable(
TableIdentifier.of(namespace, tableName),
TableIdentifier.of(namespace, newTableName)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this assertion removed? It looks like it's for rename

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion was removed due to the logic change in renameTable that allows rename table to use the previous table's Iceberg Properties (metadata location) the related integration test would always fail at this assert.

Comment on lines 579 to 639
Table table = glueCatalog.loadTable(identifier);
Table table = glueCatalogWithForceRegisterTable.loadTable(identifier);
String metadataLocation = ((BaseTable) table).operations().current().metadataFileLocation();
Assertions.assertThatThrownBy(() -> glueCatalog.registerTable(identifier, metadataLocation))
.isInstanceOf(AlreadyExistsException.class);
Assertions.assertThat(glueCatalog.dropTable(identifier, true)).isTrue();
Assertions.assertThat(glueCatalog.dropNamespace(Namespace.of(namespace))).isTrue();
Assertions.assertThat(glueCatalogWithForceRegisterTable.dropTable(identifier, false)).isTrue();
Table registeredTable =
glueCatalogWithForceRegisterTable.registerTable(identifier, metadataLocation);
Assertions.assertThat(registeredTable).isNotNull();

GetTableResponse response =
glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build());
Assert.assertEquals(
"external table type is set after register",
"EXTERNAL_TABLE",
response.table().tableType());
String actualMetadataLocation =
response.table().parameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
Assert.assertEquals(
"metadata location should be updated with registerTable call",
metadataLocation,
actualMetadataLocation);

// commit new transaction, should create a new metadata file
DataFile dataFile =
DataFiles.builder(partitionSpec)
.withPath("/path/to/data-a.parquet")
.withFileSizeInBytes(10)
.withRecordCount(1)
.build();
table.newAppend().appendFile(dataFile).commit();

metadataLocation = ((BaseTable) table).operations().current().metadataFileLocation();
// update metadata location
glueCatalogWithForceRegisterTable.registerTable(identifier, metadataLocation);
response =
glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build());
String updatedMetadataLocation =
response.table().parameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
Assert.assertEquals(
"metadata location should be updated with registerTable call",
metadataLocation,
updatedMetadataLocation);
Assert.assertEquals("Table Version should be updated", "2", response.table().versionId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want separate tests for testRegisterWhenTableExists, one is when force registration is enabled and one when not. we still want to validate the case when force is false works as expected but this change seems to be removing the validating the previous case. Let me know if I missed something when reading the code!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good call out, I have separated into two tests in the new revision commits, thanks!

Comment on lines +500 to +501
throw new NoSuchNamespaceException(
e, "Namespace %s is not found in Glue", identifier.namespace());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling seems off, can't EntityNotFoundException also be thrown when the table is not found?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception handling is meant for catching exceptions for glue.createTable calls only, for any exception during the getTable & updateTable calls we should have it throw exceptions as expected, thoughts?
Also I think the EntityNotFoundException won't be thrown for table not found in the case of Table AlreadyExistsException.

Comment on lines -114 to +121
this.catalogProperties = ImmutableMap.copyOf(properties);
this.catalogProperties = new HashMap<>();
catalogProperties.putAll(properties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making CatalogProperties non-immutable in order to effectively side-load information into the AWS client factory post-initialization is a dangerous precedent to set. In addition, it doesn't seem to actually accomplish anything besides allowing the GlueCatalog to suddenly switch regions post-initialization, which is likely to introduce some dangerous side effects.


String factoryImpl =
PropertyUtil.propertyAsString(catalogProperties, AwsProperties.CLIENT_FACTORY, null);
if (factoryImpl != null && factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hurts extensibility to make logic specific to a particular implementation class. If, for example, a customer needs to extend AssumeRoleAwsClientFactory to add some functionality for their own use, this logic will break as the class name will no longer match.

Comment on lines +465 to +468
String catalogFileIORegion = awsProperties.getGlueCatalogFileIORegion();
if (catalogFileIORegion != null) {
catalogProperties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION, catalogFileIORegion);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this logic and the associated parameter be removed and replaced with just setting AwsProperties.CLIENT_ASSUME_ROLE_REGION directly? This change is not in any way scoped to this call or AWS service, it effectively creates a case where what region the GlueCatalog is calling can suddenly change post-initialization after the first time registerTable is called.

public static final boolean GLUE_CATALOG_FORCE_REGISTER_TABLE_DEFAULT = false;

/** Configure the Glue Catalog S3 FileIO Region to allow cross region s3 access */
public static final String GLUE_CATALOG_FILE_IO_REGION = "glue.catalog-file-io-region";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we already have client.region which sets the region for all services. I am going to be introducing a change in the near future that will allow setting per-region for the default AWS Client Factory (and we can extend it to the Assume Role client factory as well), so we probably want a more general per-service naming scheme. Based on how existing parameters are formatted where glue. and s3. are already established prefixes, most likely something like: glue.region, s3.region, kms.region, etc...

Comment on lines +471 to +514
TableOperations ops = newTableOps(identifier);
InputFile metadataFile = ops.io().newInputFile(metadataFileLocation);
TableMetadata metadata = TableMetadataParser.read(ops.io(), metadataFile);

Map<String, String> tableParameters =
ImmutableMap.of(
BaseMetastoreTableOperations.TABLE_TYPE_PROP,
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toLowerCase(Locale.ENGLISH),
BaseMetastoreTableOperations.METADATA_LOCATION_PROP,
metadataFileLocation);

String databaseName =
IcebergToGlueConverter.getDatabaseName(
identifier, awsProperties.glueCatalogSkipNameValidation());
String tableName =
IcebergToGlueConverter.getTableName(
identifier, awsProperties.glueCatalogSkipNameValidation());

TableInput tableInput =
TableInput.builder()
.applyMutation(
builder ->
IcebergToGlueConverter.setTableInputInformation(
builder, metadata, tableParameters))
.name(tableName)
.tableType(GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE)
.parameters(tableParameters)
.build();

try {
glue.createTable(
CreateTableRequest.builder().databaseName(databaseName).tableInput(tableInput).build());
} catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException e) {
GetTableResponse response =
glue.getTable(
GetTableRequest.builder().databaseName(databaseName).name(tableName).build());
String versionId = response.table().versionId();
glue.updateTable(
UpdateTableRequest.builder()
.databaseName(databaseName)
.tableInput(tableInput)
.versionId(versionId)
.build());
} catch (EntityNotFoundException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this logic is a combination of a fork of the logic in BaseMetastoreCatalog.registerTable and GlueTableOperations.persistGlueTable. As GlueTableOperations also has access to AwsProperties, I would recommend refactoring the logic in GlueTableOperations so that persistGlueTable can conditionally fall back to its update mode as that improves code reuse.

If the concern is the creation of an extra metadata file, it looks like GlueTableOperations already checks whether it needs to during the writeNewMetadataIfRequired function and considering tableMetadata is always set for registerTable, it will never choose to write a new metadata file anyways.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 25, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants