Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[590] Add Iceberg Glue Catalog Sync implementation #599

Open
wants to merge 6 commits into
base: 590-CatalogSync
Choose a base branch
from

Conversation

kroushan-nit
Copy link
Contributor

@kroushan-nit kroushan-nit commented Dec 12, 2024

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

Add Glue catalog sync client implementation for Iceberg

Brief change log

Verify this pull request

  • Unit tests

@kroushan-nit kroushan-nit changed the base branch from main to 590-CatalogSync December 12, 2024 17:22
@kroushan-nit kroushan-nit changed the title 590 iceberg glue sync Add Iceberg Glue Catalog Sync implementation Dec 12, 2024
@kroushan-nit kroushan-nit marked this pull request as draft December 12, 2024 17:31
xtable-aws/pom.xml Outdated Show resolved Hide resolved
@kroushan-nit kroushan-nit marked this pull request as ready for review December 13, 2024 19:17
@vinishjail97 vinishjail97 force-pushed the 590-CatalogSync branch 3 times, most recently from a723014 to ced6288 Compare December 19, 2024 22:12
@kroushan-nit kroushan-nit changed the title Add Iceberg Glue Catalog Sync implementation [590] Add Iceberg Glue Catalog Sync implementation Dec 21, 2024
@kroushan-nit kroushan-nit force-pushed the 590-IcebergGlueSync branch 2 times, most recently from b4d8cf8 to 5daece8 Compare December 27, 2024 05:07
String catalogConversionSourceImpl;
switch (catalogType) {
case CatalogType.GLUE:
catalogSyncClientImpl = GlueCatalogSyncClient.class.getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't have these classes on the path once the code is broken out into modules, we'll want to adopt a similar pattern using the service loader like we did for the conversion targets

<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.12.328</version>
<groupId>software.amazon.awssdk</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define the glue artifact and version here in the dependencyManagement section as well

import org.apache.xtable.model.catalog.CatalogTableIdentifier;
import org.apache.xtable.model.catalog.HierarchicalTableIdentifier;

public class CatalogUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a @NoArgsConstructor(access = Private) if this is meant to only expose methods and not allow developers to create instances


public class CatalogUtils {

public static HierarchicalTableIdentifier castToHierarchicalTableIdentifier(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a basic unit test for this?

"create",
new Class<?>[] {Map.class},
new Object[] {glueConfig.getClientCredentialConfigs()});
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define more specific exceptions to catch here?

OBJECT_MAPPER.writeValueAsString(properties), GlueCatalogConfig.class);
Map<String, String> clientCredentialProperties =
propertiesWithPrefix(properties, CLIENT_CREDENTIAL_PROVIDER_PREFIX);
glueCatalogConfig.setClientCredentialConfigs(clientCredentialProperties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to just do this parsing with jackson? Similar to what was done here: f676788#diff-ac70a3d50d9eed96fd75f20baf8289cddec6834b15ad6165c077769c3729760dR287

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check if this can be parsed directly using jackson

public static GlueCatalogConfig of(Map<String, String> properties) {
try {
GlueCatalogConfig glueCatalogConfig =
OBJECT_MAPPER.readValue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use convertValue for this instead of writing out to an intermediate string?

return parameters;
}

private BaseTable loadTableFromFs(String tableBasePath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work with an external catalog for Iceberg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "external catalog for iceberg" in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production tables for Iceberg are managed with a catalog that handles the transactions, especially when dealing with multiple writer cases. This is why we allow users to specify that configuration when using Iceberg as a source or target for conversion today. In the conversion, we use that catalog to get the state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. There should not be any issues if an external catalog is configured as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested? I did not think this would work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source: Iceberg table with glue catalog, target: Iceberg HMS

Is this a right scenario to test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, specifically an iceberg table managed with the glue catalog https://iceberg.apache.org/docs/1.6.0/aws/?h=glue#glue-catalog

@kroushan-nit kroushan-nit force-pushed the 590-IcebergGlueSync branch 2 times, most recently from 3d5605e to 4c816ea Compare December 31, 2024 10:37

package org.apache.xtable.catalog;

public class Constants {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar nitpick here, use NoArgsConstructor with private access


public class TableFormatUtils {

public static String getTableDataLocation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that this is very Iceberg specific and will also limit our extensibility with new formats. Since this will throw an exception if a new format is detected, the user cannot plugin in a new implementation for a new format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of throwing exception, I can default to tableLocation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is fine for now, wondering if there is a better place for this logic though so if someone adds an implementation that they can override this logic to fit their needs without needing to make core changes to XTable

return dataLocation;
}

// Get table format name from table properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use proper javadocs here please.

Also will all catalogs be required to set this property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Btw, this property should be set automatically by the engine (which created the table) in most of the cases

Does this javadoc make sense -

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will the spark type be set? There is no guarantee spark created the table.
What should the code do if tableFormat is still null at the end of the method.

Copy link
Contributor Author

@kroushan-nit kroushan-nit Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@the-other-tim-brown the-other-tim-brown Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense to move that logic into the method instead of making each caller throw that exception?

If you think it makes more sense to force the caller to handle it, let's consider using Optional so it is clear the value may not be set and the caller must handle it appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have moved the null check inside the method

@@ -55,15 +55,15 @@
</property>
<property>
<name>fs.s3.aws.credentials.provider</name>
<value>com.amazonaws.auth.DefaultAWSCredentialsProviderChain</value>
<value>software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful to pull the aws and hadoop upgrade related code into a smaller PR that can be merged quickly for anyone that may be running into the compatibility issues you saw with hadoop and aws sdk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR for aws sdk and hadoop version upgrade: #614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants