-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[590] Add Iceberg Glue Catalog Sync implementation #599
base: 590-CatalogSync
Are you sure you want to change the base?
[590] Add Iceberg Glue Catalog Sync implementation #599
Conversation
xtable-core/src/main/java/org/apache/xtable/reflection/ReflectionUtils.java
Show resolved
Hide resolved
xtable-aws/src/main/java/org/apache/xtable/glue/GlueCatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-aws/src/main/java/org/apache/xtable/glue/GlueSchemaExtractor.java
Show resolved
Hide resolved
xtable-aws/src/main/java/org/apache/xtable/glue/IcebergGlueCatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/exception/CatalogSyncException.java
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/glue/GlueCatalogConfig.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/glue/GlueCatalogConversionSource.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/glue/GlueCatalogConversionSource.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/glue/GlueCatalogConfig.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/glue/GlueCatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/glue/IcebergGlueCatalogSyncOperations.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/TableFormatUtils.java
Outdated
Show resolved
Hide resolved
a723014
to
ced6288
Compare
77e54bd
to
d6ae25a
Compare
6e09f5d
to
893b1a6
Compare
da244b7
to
71ed9f0
Compare
893b1a6
to
eefbdd2
Compare
b4d8cf8
to
5daece8
Compare
47f49fa
to
4c1249c
Compare
String catalogConversionSourceImpl; | ||
switch (catalogType) { | ||
case CatalogType.GLUE: | ||
catalogSyncClientImpl = GlueCatalogSyncClient.class.getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't have these classes on the path once the code is broken out into modules, we'll want to adopt a similar pattern using the service loader like we did for the conversion targets
<groupId>com.amazonaws</groupId> | ||
<artifactId>aws-java-sdk-bundle</artifactId> | ||
<version>1.12.328</version> | ||
<groupId>software.amazon.awssdk</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define the glue
artifact and version here in the dependencyManagement section as well
import org.apache.xtable.model.catalog.CatalogTableIdentifier; | ||
import org.apache.xtable.model.catalog.HierarchicalTableIdentifier; | ||
|
||
public class CatalogUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a @NoArgsConstructor(access = Private)
if this is meant to only expose methods and not allow developers to create instances
|
||
public class CatalogUtils { | ||
|
||
public static HierarchicalTableIdentifier castToHierarchicalTableIdentifier( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a basic unit test for this?
"create", | ||
new Class<?>[] {Map.class}, | ||
new Object[] {glueConfig.getClientCredentialConfigs()}); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define more specific exceptions to catch here?
OBJECT_MAPPER.writeValueAsString(properties), GlueCatalogConfig.class); | ||
Map<String, String> clientCredentialProperties = | ||
propertiesWithPrefix(properties, CLIENT_CREDENTIAL_PROVIDER_PREFIX); | ||
glueCatalogConfig.setClientCredentialConfigs(clientCredentialProperties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to just do this parsing with jackson? Similar to what was done here: f676788#diff-ac70a3d50d9eed96fd75f20baf8289cddec6834b15ad6165c077769c3729760dR287
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check if this can be parsed directly using jackson
public static GlueCatalogConfig of(Map<String, String> properties) { | ||
try { | ||
GlueCatalogConfig glueCatalogConfig = | ||
OBJECT_MAPPER.readValue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just use convertValue
for this instead of writing out to an intermediate string?
return parameters; | ||
} | ||
|
||
private BaseTable loadTableFromFs(String tableBasePath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work with an external catalog for Iceberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "external catalog for iceberg" in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Production tables for Iceberg are managed with a catalog that handles the transactions, especially when dealing with multiple writer cases. This is why we allow users to specify that configuration when using Iceberg as a source or target for conversion today. In the conversion, we use that catalog to get the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. There should not be any issues if an external catalog is configured as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested? I did not think this would work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source: Iceberg table with glue catalog, target: Iceberg HMS
Is this a right scenario to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, specifically an iceberg table managed with the glue catalog https://iceberg.apache.org/docs/1.6.0/aws/?h=glue#glue-catalog
3d5605e
to
4c816ea
Compare
|
||
package org.apache.xtable.catalog; | ||
|
||
public class Constants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar nitpick here, use NoArgsConstructor with private access
|
||
public class TableFormatUtils { | ||
|
||
public static String getTableDataLocation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that this is very Iceberg specific and will also limit our extensibility with new formats. Since this will throw an exception if a new format is detected, the user cannot plugin in a new implementation for a new format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of throwing exception, I can default to tableLocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is fine for now, wondering if there is a better place for this logic though so if someone adds an implementation that they can override this logic to fit their needs without needing to make core changes to XTable
return dataLocation; | ||
} | ||
|
||
// Get table format name from table properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper javadocs here please.
Also will all catalogs be required to set this property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will the spark type be set? There is no guarantee spark created the table.
What should the code do if tableFormat is still null at the end of the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
As per the documentation: https://docs.aws.amazon.com/athena/latest/ug/delta-lake-tables-syncing-metadata.html, Athena and AWS Glue consoles requires
spark.sql.sources.provider
property to be set when creating delta lake table using those methods. -
We will throw error if unable to get tableFormat as I believe this is needed to create ConversionSource from SourceTable: https://github.com/apache/incubator-xtable/pull/599/files#diff-90350a03dbbf828930b2ad5b2629bb8c9da7aaa24b7f984600e03e807a185f7bR79
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense to move that logic into the method instead of making each caller throw that exception?
If you think it makes more sense to force the caller to handle it, let's consider using Optional so it is clear the value may not be set and the caller must handle it appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have moved the null check inside the method
@@ -55,15 +55,15 @@ | |||
</property> | |||
<property> | |||
<name>fs.s3.aws.credentials.provider</name> | |||
<value>com.amazonaws.auth.DefaultAWSCredentialsProviderChain</value> | |||
<value>software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be useful to pull the aws and hadoop upgrade related code into a smaller PR that can be merged quickly for anyone that may be running into the compatibility issues you saw with hadoop and aws sdk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR for aws sdk and hadoop version upgrade: #614
88172eb
to
0685975
Compare
0685975
to
1478c6c
Compare
Important Read
What is the purpose of the pull request
Add Glue catalog sync client implementation for Iceberg
Brief change log
Verify this pull request