Skip to content

Conversation

@jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Nov 28, 2020

As we get some initial feedback for AWS module usage, we notice many users would like to have the ability to use their customzied mechanism to load AWS credential and region because resources like S3 buckets and Glue catalogs are commonly centralized in a single AWS account. In particular, multiple users ask for the support of STS assume role credential provider. So this PR adds the support for loading a custom AwsClientConfigFactory to satisfy all customers need, and use the assume role feature as an example, so that people can use AssumeRoleConfigFactory to assume a role for access of AWS resources in cross-account cross-region cases.

In addition, we notice that similar to the work done in FileIO interface, the introduction of these features require all the related factory interfaces to have the initialize(Map catalogProperties) interface to cascadingly pass in the catalog properties. Therefore, we also introduce the CatalogConfigurable interface in iceberg-api to make the whole thing more organized.

DynConstructors.Ctor<T> ctor;
try {
ctor = DynConstructors.builder(FileIO.class).impl(impl).buildChecked();
ctor = DynConstructors.builder(resultClass).impl(impl).buildChecked();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the class in DynConstructors.builder(Class) is only used during hiddenImpl. Can probably just use DynConstructors.builder() unless you want to keep that open for later?

Copy link

@johnclara johnclara Nov 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if we should just use a ServiceLoader instead of using the dyn constructors stuff? DynConstructors seems to be more heavy duty/less safe. Like: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L656

Copy link

@johnclara johnclara Nov 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also a little ridiculous having to serialize all the config for FileIO and others into Map<String, String> so that it can be loaded from properties if it's already serializable.

Is there some way Spark could support a Map<String, Serializable> which can get passed through to DataSourceV2? Is that already getting discussed somewhere?

In the short term, can we make builders loadable which then build the object.

This way whether or not something is valid config/is null is up to the object/builder during initialization instead of post initialization?

Also it lets there be final configuration variables instead of this weird, private null for the first couple seconds of it's lifetime.

What I've been working on:


/**
 * Should have an empty constructor.
 *
 * Necessary for getting past the Hadoop Configuration / Spark Options / Iceberg Options configuration barriers.
 */
public interface LoadableBuilder<T extends LoadableBuilder<T, K>> {
    /**
     * Method name for our custom {@link Conf}
     *
     * Should load config from conf
     * @param conf
     */
    T load(Conf conf);

    K build();
}

public interface Dumpable {
    /**
     * Method name for our custom {@link Conf}
     *
     * Should add current config to conf
     * @param conf
     */
    void dump(Conf conf);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if we should just use a ServiceLoader instead of using the dyn constructors stuff?

I don't think this would be a good choice. ServiceLoader will scan the entire classpath for matching classes, which takes a really long time in large applications for initialization. It also introduces runtime issues because service files need to be merged when creating shaded Jars and it is easy to overlook.

Since we already know the class to load from configuration that is passed to the catalog, I don't think ServiceLoader is a good option compared to using reflection to load a class. And the Dyn* classes just help make that easier to read and maintain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also a little ridiculous having to serialize all the config for FileIO and others into Map<String, String> so that it can be loaded from properties if it's already serializable.

I don't think I'm following what you are suggesting. The configuration passed in comes from a string map, like a Spark config or Hadoop config. Serialization is a concern after catalogs are configured because the components that were created on the driver need to be serialized to be sent for use on executors. Those are separate enough that I'm not sure how to improve on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also a little ridiculous having to serialize all the config for FileIO and others into Map<String, String> so that it can be loaded from properties if it's already serializable.

Yes as Ryan said, I don't think it serializes all the config. Map<String, String> properties is passed in during Spark initialization time and only initialized once. For AWS module, AwsProperties is serialized, but it is a much smaller subset of all the configurations.

Copy link

@johnclara johnclara Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, serialization is unrelated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the user needs to reimplement the aws builders for every service, see the example I provided in the conversation below. In the v2 AWS client, the following aspects can be override:

  1. credential provider
  2. region
  3. endpoint (url + port)
  4. http client (for things like proxy)
  5. client configuration
  6. service configuration

we can either provide a way to dynamically configure each type of AWS client, or dynamically configure each aspect of the clients. From my current observation it is more common for users to want to use the same configuration across all clients, but maybe also have some exceptions for some specific clients. So it seems to make more sense to go with this proposed approach.

For your given example, yes information like port needs to be passed in, but even in a normal application it needs to be passed in from some configuration file. In this method you can also read your own configuration system instead of reading from the catalog properties passed in.

You can read the example I gave below for more information, and there is no need to initialize the client from the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 for sure. It's awesome that's it's so configurable. I've been pulling apart our codebase to try to strip out s3a (which is way harder to configure on the version I'm on than this) and I'm excited for when I can test this out!

it's also a little ridiculous having to serialize all the config this line was worded poorly and was meant to be sympathetic about all the config mapping you're having to add, not the implementation. (It was after hours of trying to translate between my own internal configuration)

@rdblue rdblue requested a review from danielcweeks December 1, 2020 01:20

/**
* Interface for loading a custom factory to configure AWS credentials for different services.
* See {@link AssumeRoleCredentialsFactory} as an example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a quick example of how this is used and how it simplifies configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AssumeRoleCredentialsFactory is an example for it, I can give another example with more complicated customer use case:

package com.my.team;

import software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider;
import software.amazon.awssdk.awscore.client.builder.AwsClientBuilder;
import software.amazon.awssdk.services.glue.GlueClient;
import software.amazon.awssdk.services.glue.GlueClientBuilder;
import software.amazon.awssdk.services.s3.S3ClientBuilder;
import software.amazon.awssdk.services.sts.auth.StsAssumeRoleCredentialsProvider;
import software.amazon.awssdk.services.sts.auth.StsAssumeRoleWithSamlCredentialsProvider;

public class CustomCredentialsFactory implements AwsClientCredentialsFactory {


  @Override
  public void configure(AwsClientBuilder clientBuilder) {
   if (clientBuilder instanceof S3ClientBuilder) {
     clientBuilder.credentialsProvider(StsAssumeRoleWithSamlCredentialsProvider.builder()
         .refreshRequest(...)
         .build());
   } else if (clientBuilder instanceof GlueClientBuilder) {
     clientBuilder.credentialsProvider(StsAssumeRoleCredentialsProvider.builder()
         .refreshRequest(...)
         .build())
   } else {
     clientBuilder.credentialsProvider(ContainerCredentialsProvider.builder().build());
   }
  }
}

And it can be started with:

spark-sql --jars iceberg-spark3-runtime.jar \
    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket \
    --conf spark.sql.catalog.my_catalog.client.credentials-factory=com.my.team.CustomCredentialsFactory

Copy link
Contributor

@danielcweeks danielcweeks Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm just catching up on all of the discussion and may not be following everything just yet, but I feel like we should try to avoid most S3Client configuration in Iceberg. Part of the reason the S3FileIO takes a supplier for an S3Client is the we can externalize all of the configuration.

The places where we have exposed configuration (part size, acl, etc.) is because they are request level options. Since credentials are at the client level, we should be able to have just a S3ClientFactory that supplies all of the custom client configuration as opposed to exposing features (like credentials) directly.

I feel like that would really reduce the complexity in Iceberg and at the same time expose all of the possible configuration for the S3Client that anyone may want to leverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR mostly comes from the following use cases:

  1. use a customized credential provider across all the AWS clients
  2. use some kind of proxy configuration for HTTP client to bypass network boundary (not included in this PR, plan to put in another one)

I completely agree that we should expose limited number of configurations for S3FileIO, but those are not related to S3 but AWS clients in general.

And maybe I missed something, but based on my understanding, for people using open source Iceberg directly, it is not straightforward to use the supplier, unless they do something like

public MyS3ClientFileIO extends S3FileIO {
  public MyS3ClientFileIO() {
    super(new MyS3ClientSupplier());
  }
}

And load the MyS3ClientFileIO instead of S3FileIO as the FileIO implementation, which is (1) not very convenient, and (2) people with the requirements I listed basically have to override all the classes that touches AWS clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 I'm not proposing that users would subclass the S3FileIO, but rather that we take your approach of dynamically loading and just up-level it to the whole S3Client supplier as opposed to just parts of the client configuration. This would mean configuring with something like --conf spark.sql.catalog.my_catalog.client.s3.supplier=com.my.team.S3ClientSupplier. We could then just change the S3FileIO to check for a configured supplier before using the default client.

This allows anyone to supply a fully configured client based on their own requirements and we wouldn't need to expose the specific configurations through the Iceberg APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes I have been personally debating about this and that's why I put up this PR for feedbacks. We can either as you suggest provide integration points to load each AWS client, or in my current PR to provide integration points to load important aspects like credential providers, and there are pros and cons for both.

Currently what I see is that we will at least have the following clients in the AWS module:

  • Glue
  • DynamoDB
  • S3
  • KMS
  • (planned) CloudWatch
  • and might be more

And in my use cases listed, it is much cleaner to just expose a holistic factory for credentials rather than to load every single AWS client dynamically.

So what about exposing a single factory named AwsClientConfigurationFactory, for people to configure any client configuration? It will look like:

public interface AwsClientConfigurationFactory implements CatalogConfigurable, Serializable {

    <T extends AwsClientBuilder & AwsSyncClientBuilder> T configure(T clientBuilder);
}

public class MyClientConfigurationFactory implements AwsClientConfigurationFactory {

    <T extends AwsClientBuilder & AwsSyncClientBuilder> T configure(T clientBuilder) {
        if (clientBuilder instanceof S3ClientBuilder) {
             S3ClientBuilder s3Clientbuilder = (S3ClientBuilder) clientBuilder;
             // do whatever you want with s3
       }
      // do common stuffs across all client builders
       clientBuilder.credentialsProvider(...);
    }
}

By doing so, there is only 1 integration point for all AWS clients, and it can satisfy all both use case for configuring a single client, or configuring common properties across all clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks see updated PR for more details.

}

public static Integer propertyAsInt(Map<String, String> properties, String property) {
String value = properties.get(property);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can go to npe if properties is null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods in this util assumes properties is not null.


@Override
public void initialize(Map<String, String> properties) {
roleArn = properties.get(AwsProperties.CLIENT_ASSUME_ROLE_ARN);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

properties can be null at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize is called from engines like Spark that passes user configurations that is guaranteed to be non-null. That is why I do not check null for it.

@Override
public void configure(AwsClientBuilder clientBuilder) {
// STS will use default credential in the environment as the root credential to assume the role
if (!(clientBuilder instanceof StsClientBuilder)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to check if the caller has called init before calling configure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize is called from engines like Spark that passes user configurations, and is guaranteed to be called as the fist method during dynamic loading.

public void initialize(Map<String, String> properties) {
String regionStr = properties.get(AwsProperties.CLIENT_ASSUME_ROLE_REGION);
Preconditions.checkNotNull(regionStr, "Cannot initialize AssumeRoleSingleRegionFactory with null region");
region = Region.of(regionStr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegionStr can be an incorrect value.
We need to check if region is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are special regions that cannot be checked, that is why we only check not null.

@jackye1995 jackye1995 changed the title AWS: support custom credentials, region and assume role functionality AWS: support custom client configuration Dec 2, 2020
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 A few things I feel like we should take a second look at. It might actually make sense to break out the STS Assume role implementation into a separate, follow-on PR (though I'd like to hear thoughts from others on that, @johnclara?). I'm just thinking it might simplify reviewing the approach for dynamic loading so we can get that in.

* The same catalog properties map is passed in to {@link CatalogConfigurable#initialize(Map catalogProperties)}
* to cascadingly complete the full catalog loading process.
*/
public interface CatalogConfigurable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it seems that we're standardizing on Map<String, String> for properties in a number of areas (catalog, table properties, etc.), It might make sense to just make this Configuratble and promote out side of Catalog. I'm just certain we add a lot of value by making this catalog specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I name it CatalogConfigurable is that Hadoop has its configurable interface called Configurable and it causes class name duplication. And I put it in the catalog package to not confuse the properties here with table properties. What do you think?

/**
* Example of {@link AwsClientConfigFactory} for the assume role use case.
*/
public class AssumeRoleClientConfigFactory implements AwsClientConfigFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat mixed on whether we want to include this in Iceberg. At a high-level, we want to leave as much AWS client configuration outside of Iceberg as possible (this allows for more flexibility without adding configuration/complexity to the iceberg project). At the same time, I see we might want to include functionality in cases where it's so commonly used that everyone would end up implementing the same functionality.

At Netflix, our situation is more complicated than this would allow for, so we would likely have a custom implementation regardless. In the more general case, I would assume that instance credentials and the default credential provider chain would be sufficient. Should we just leave this out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree, internally we also have our own credential provider implementation, but I think this will be a very common use case, because it is very likely for organizations to have a centralized glue catalog / s3 bucket and run computes in other AWS accounts. Plus it also serve as an example implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And with discussion in #1887 , we will mark all aws dependencies as provided, which means there is no additional jar size increase to add this example.

* such as credentials provider, region, endpoint, http client, etc.
* See {@link AssumeRoleClientConfigFactory} as an example.
*/
public interface AwsClientConfigFactory extends CatalogConfigurable, Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I generally agree with the dynamically loading/configuring AWS Clients, something about this implementation feels a little awkward. For example, this interface technically isn't a "Factory" per se (that's mostly just a naming issue, I still think general approach is sound. Maybe AwsClientConfigurer). It configures a client, but doesn't actually create anything new. Also, the actual configuration is applied via AwsProperties::configure, which seems a little strange.

I'd suggest making a small change to move the dynamic loading into the AwsClientUtil (just keep the class string reference in properties) and remove the configure from AwsProperties (keeping logic out of the properties class). AwsClientConfigurer would initialize with AwsProperties or take it as a parameter so that configure can be called directly (not through the properties class).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought (and I'm starting to think this may just make more sense) is to merge the idea of the AwsClientConfigFactory and AwsClientUtil and just make a single dynamically loadable class that would be a factory for creating the clients. All the AwsClientUtil does at this point is instantiate the classes and all the AwsClientConfigFactory does is configure it, so it would be pretty sensible to just make one that does both as a AwsClientFactory

Then you just have

AwsClientFactory::<service>Client() with a DefaultAwsClientFactory that is loaded if a custom implementation is not supplied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely agree that factory is not a good name, and I put it in AwsProperties::configure just to satisfy the interface of client.applyMutation(...), but I also feel it is a bit awkward. DefaultAwsClientFactory sounds like a good idea, let me do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I don't want people implementing the class to keep updating when we add new AWS clients, let me think about that for a moment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks So after testing a few different approaches, I think keeping a AwsClientConfigurer and a single AwsClientFactory is the most flexible approach. Users can configure all clients using AwsClientConfigurer, and we can introduce new clients in AwsClientFactory without users changing their configurer implementation. Please let me know what you think, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 I like this better, but I feel like I'm not following what value the AwsClientConfigurer adds at this point. It seems that AwsClientConfigurer interface is just a duplication of the setDefaultsAndConfigure in AwsClientFactory (they have the same signature). Why wouldn't we just drop the AwsClientConfigurer and dynamically load the factory? Then if someone wants to extend the configuration, they just extend AwsClientFactory and override the configure method? It seems like we would still be able to add new clients and the configure method would still apply to any of the standard client builder options.

The added value this provides is that since the Factory is responsible for instantiation, we can now override that behavior (e.g. cache/reuse/proxy or force new instances for instances of the client). That may not be necessary for glue catalog as I assume there would only be one instance, but for S3FileIO it may be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought about just having a factory, but here are my reasons for having the additional AwsClientConfigurer delegation, please let me know if you think they are important or not:

  1. the interface of the factory would continue to evolve as more clients are added. I think for a public interface used for dynamic loading, it is preferred to keep it stable and unchanged for most of the time, so there won't be incompatibility errors when the package versions do not match.
  2. I see there are 3 layers of configuration, (1) defaults to initialize a bare-minimum client (http client impl, etc.) (2) user-specific defaults for all their clients (credentials, etc.), (3) user-specific client configurations (s3 client retry, etc.), and the intention of setDefaultsAndConfigure is to configure (1) and then leave (2) and (3) for customers to configure using the configurer.
  3. If a user can directly load a factory, there is nothing to stop the user from overriding the specific get client methods and skip config layer 1. This also defeats the purpose of using a general configure method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks any thoughts on this?

Copy link
Contributor

@danielcweeks danielcweeks Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this just dropped off my radar.

I don't actually disagree with in concept with any of the items above, but I would propose handling it differently.

Make AwsClientFactory an interface, which would just expose the client calls (e.g. factory.s3(), factory.glue()). Create a DefaultAwsClientFactory that has the common configuration setDefaultsAndConfigure. Create an StsAwsClientFactory that overrides for setDefaultsAndConfigure for the common functionality.

Now I can see your point in #3 that if someone were to override the client creation, they would have to directly call the setDefaultsAndConfigure, but I don't feel like preventing users from overriding the creation of the client is a good tradeoff for ensuring that they're consistently configured.

We could separate the instantiation and access of the client in the factory so that you can enforce the configure path, but I feel like that's just overcomplicating the situation. Doing the above just gets us back to a more traditional, interface and implementation design that I feel will make more sense for anyone looking to provide a custom implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated, please let me know if this looks good to you.

@johnclara
Copy link

johnclara commented Dec 9, 2020

though I'd like to hear thoughts from others on that, @johnclara?

I'm still catching up on all the different configuration locations/barriers. Clearly from my above comments, I'm confused about how it all works together.

Ryan's email helped me a lot: https://lists.apache.org/thread.html/r1c357c7867234fed58b12f0cba6e9d2c13ed1af94b8d922ffe3a3def%40%3Cdev.iceberg.apache.org%3E

For this PR it sounds like it's still in flux but my understanding is (please correct the things I have wrong):

For spark as an example the config barriers are:

  • User passes:
    • command line args + hadoop config files on disk
  • Spark will push down:
    • DataSourceV2 options + hadoop config
  • Spark source will initialize the catalog object (catalog client?) with:
    • Map<String, String> argument (called properties?) + hadoop config

Inside the catalog the config sources seem like:

  • Map<String, String> arguments: system/compute constraints (like running on prem vs on an ec2 instance)
  • HadoopConfig: for setting up HadoopFileSystem
  • CatalogProperties: I'm not sure what these are? If it's hive or glue would it be a Map<String, String> per table entry or would it be for the catalog as a whole? In ryan's email I think it says that it should just give the location and that it is an iceberg table (but what if it should also say if it's an iceberg table backed by aws vs gcp and how to read it?) [edit: I'm wrong. I don't understand what he's saying fully here: Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.] These are the same as the arguments
  • TableProperties: table specific infromation

It sounds like for tables which use S3FileIO:

  • HadoopConfig should be ignored since we're not using HadoopFileIO.
  • Map<String, String> arguments should give enough info to reach the catalog.
  • Then catalog properties should be mixed with some arguments which have constraints on where the compute is running (eg proxy for on prem vs an ec2 machine). This should give enough info to be able to read the table's json file.
  • Then the catalog properties should be ignored and the table properties should be mixed with the arguments. This should give enough information to read and write to the table?

Should a file io be created just for the json file and then a new file io be created for the other part of the table after reading the table properties? (manifest list/ manifest + data files?)

Where should the kms key id go? Is that a per table thing to write new data files (table properties) or is it a user/system defined property which should go in the arguments?

It would help me a ton if that was written out somewhere (it might be in Jack's docs that he's written up I still have to go take a look)

@jackye1995
Copy link
Contributor Author

* CatalogProperties: I'm not sure what these are? 

In the context of Spark, it is the Map<String, String> argument of spark source. It is just generalized also for Flink (and any other use case that can accept a map).

Should a file io be created just for the json file and then a new file io be created for the other part of the table after reading the table properties? (manifest list/ manifest + data files?)

That is why FileIO is treated as a catalog property instead of a table property, to avoid this complication.

Where should the kms key id go? Is that a per table thing to write new data files (table properties) or is it a user/system defined property which should go in the arguments?

With the same reasoning as FileIO, if you put KMS key as a table property, then you need to create a table to know the KMS key, which needs to write to S3 and cause a circular dependency.

I do agree there is a use case for one KMS key per table, that can be added in the future if there is such a request, with the KMS key in catalog property as default, and the one in table property as an override.

Please let me know if these answer your question.

@johnclara
Copy link

johnclara commented Dec 10, 2020

Thanks this makes a lot of sense!

* CatalogProperties: I'm not sure what these are? 

In the context of Spark, it is the Map<String, String> argument of spark source. It is just generalized also for Flink (and any other use case that can accept a map).

So CatalogProperties are the arguments that are passed in. What about the HMS Metastore properties (or any other information that is kept alongside the pointer to the iceberg table)? In my team's Dynamo backed catalog implementation, we only have two columns: TableName and Location. But I could also see having other columns (eg "this is a gcp table. this is an aws table"). These are entirely ignored? I was confusing these Metastore properties with Catalog properties.

So CatalogProperties will have a mix of system/compute constraints + how to access to access the table?

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I feel like we're making progress on this, but there are still a few items that I mention in the comments.

* it off to a separate module that would then interact with the streams.
*/
public interface FileIO extends Serializable {
public interface FileIO extends Serializable, CatalogConfigurable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it actually makes sense for the FileIO to implement CatalogConfigurable. For S3FileIO, the initialization is just setting up the AwsProperties, which is also being set as part of the constructor and those two paths actually conflict (as you note in the comments). However, this makes S3FileIO pretty confusing. (I'll add more comments on S3FileIO).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, let me remove the CatalogConfigurable related changes first. I do think it is worth adding, but seems too much to discuss in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to be sure that we understand how the properties are getting propagated to the FileIO. Right now, it seems that we're loading it via the CatalogUtils, but only in the glue catalog, so it seems somewhat specific at this point. Since the glue catalog can just load the properties and use the S3FileIO constructor, it seems like we should pull it out from the PR and revisit with a broader change address the default dynamic loading.

I agree that it might be a good addition because if you dynamically load FileIO right now, I don't believe you can set custom AwsProperties.

private String glueCatalogId;
private boolean glueCatalogSkipArchive;

private AwsClientFactory clientFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like the factory reference should be held by AwsProperties. This also complicates the Serialization as AwsClientFactory now needs to be Serializable and I'm not even sure that it is at this point due to the UrlConnectionHttpClient.create() reference. It seems like we should separate AwsProperties and the AwsClientFactory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UrlConnectionHttpClient.create() is a static final, should not affect serialization.

I think the client factory has to be serializable anyway, because AwsClientFactory::s3will be used as the supplier for s3 client in S3FileIO. Is there any use case that you think need it to be not serializable?

}

private <T extends AwsClientBuilder & AwsSyncClientBuilder> T configure(T clientBuilder) {
clientBuilder.credentialsProvider(StsAssumeRoleCredentialsProvider.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we did in a few other places (like S3OutputStream) it would be good to break out the builders a little as the nested builders tend to be a little harder to follow.


private <T extends AwsClientBuilder & AwsSyncClientBuilder> T configure(T clientBuilder) {
clientBuilder.credentialsProvider(StsAssumeRoleCredentialsProvider.builder()
.stsClient(StsClient.builder().httpClient(HTTP_CLIENT_DEFAULT).build())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small issue I ran into when configuring StsClient was that region() was required in some cases (It may have been due to cross-account role assume in the same region, but I had to explicitly set it). If that's not a common issue, we can ignore it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this issue in integration tests on my side, let me add them here and see if I can simulate the situation you describe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be too obscure of a case to deal with at this point. I just don't know how others are using assume role, so it's entirely up to your judgement. Just wanted to point it out for the future.

@jackye1995
Copy link
Contributor Author

@danielcweeks since we are approaching the next release and also the holiday season, I would like to minimize the change here and only focus on finalizing the client factory interface, so at least this PR can join the current release train. I have moved AssumeRole part out of the picture and will use another PR for that discussion.

I agree that it would be cleaner to move the factory out of the AwsProperties class. I have created AwsClientFactories class for that, following a similar structure as theLocationProviders class and made DefaultAwsClientFactory an inner class. I think this should make the whole thing much cleaner.

The only discussion at this point I believe is about making AwsClientFactory serializable or not. I tried about not serializing it, and as a result in S3FileIO I have to do:

this.s3 = () -> AwsClientFactories.from(properties)::s3

In that call above, properties got serialized as a part of the serializable supplier, which defeats the purpose of us using AwsProperties to store only useful information. So I think making AwsClientFactory serializable should still be the way to go there.

Please let me know what you think, thank you!

@danielcweeks
Copy link
Contributor

@jackye1995 thanks for breaking this apart, I think that really helps and we can follow on with the STS implementation and discuss the dynamic catalog configuration separately.

As to the AwsClientFactory being serializable, I agree with you (my concern was more that it was part of AwsProperties which felt awkward.). This looks good.

Appears there are some checkstyle failures, but other than that, I'm good with this.

@jackye1995
Copy link
Contributor Author

@danielcweeks checkstyle fixed, please let me know if there is any other concern

@danielcweeks
Copy link
Contributor

+1 Thanks @jackye1995!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants