ADLS integration #120

kmcclenn · 2024-06-06T22:35:04Z

Summary

Adding integration for ADLS with openhouse deployed locally on a docker container. Using an ABS image for docker.

Changes

Packages introduced/removed

OpenHouse is currently using Iceberg version 1.2.0. However, this version does not have the Iceberg/Azure implementation code that we need for this PR (the code was introduced in version 1.4.0). Thus, these files were copied into the iceberg/azure folder so they can be utilized for ADLS integration. There are two future changes that will allow us to remove these copied files:

Adding a git submodule to the iceberg repo; this is not the most ideal because it requires us to submodule the entire repo; however, it is better than the copied code. I will add this in a future PR.
If/when OpenHouse is made compatible with Iceberg version >= 1.4.0, we can simply use the org.apache.iceberg:iceberg-azure package.

We also use the packages com.azure:azure-storage-file-datalake:12.19.1 and com.azure:azure-identity:1.12.1 to get the objects associated with the Data Lake client and the client identity verification. However, these packages use netty version 4.1.108.Final, yet the OpenHouse repo requires a netty version of 4.1.75.Final. Thus, we need to make sure to exclude the groups io.netty and io.projectreactor.netty when these Azure SDKs are imported.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Was able to run the openhouse-tables container locally on docker by using the docker compose up -d command in the infra/recipes/docker-compose/oh-abs-spark/ folder. Was then able to make requests using Postman that created and modified a table that was reflected in my Azure account. Was also able to make spark sql calls in the spark-shell that successfully updated the Azure storage account. See video below.

Screen.Recording.2024-06-21.at.7.44.45.PM.mov

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

To replicate this PR, make sure to create your own storage account and blob container. Make sure to allow for anonymous read access for containers and blobs (as in this link).

Future PRs

Will submit another PR for the local integration once I figure it out. Right now, this only works for an actual Azure Storage account.
Will also submit another PR to fix the constructTablePath function and change its response type to URI so the schema is included in the Table location. This will allow me to add submodules for the iceberg/azure code because I won't need the temporary modification of the ADLSLocation.java file anymore.

HotSushi

Neat PR, all the interfaces are nicely implemented! left some comments. Please use BaseStorage and BaseStorageClient so that code replication can be removed.

HotSushi · 2024-06-11T17:26:46Z

cluster/storage/build.gradle

@@ -9,4 +9,6 @@ plugins {
 dependencies {
  implementation project(':cluster:configs')
  implementation 'org.springframework.boot:spring-boot-autoconfigure:' + spring_web_version
+  implementation 'org.apache.iceberg:iceberg-azure:1.5.2'


would be good to move these to conventions file like aws-conventions

HotSushi · 2024-06-11T17:28:22Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+
+  private DataLakeFileClient dataLake;
+
+  private static final String DEFAULT_ENDPOINT = "abfs:/";


endpoint needs to be= "abfs://"
default_rootpath/bucket can be = "tmp"

reason: we want all object stores to have this pattern so that provisioning bucket becomes easier

Why do you need to configure these defaults here?

HotSushi · 2024-06-11T17:28:55Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+    if (storageProperties.getTypes() != null && !storageProperties.getTypes().isEmpty()) {
+
+      // fail if properties are invalid
+      Preconditions.checkArgument(


should be able to call BaseStorageClient:validateProperties() instead of re-defining logic

HotSushi · 2024-06-11T17:29:41Z

iceberg/openhouse/internalcatalog/build.gradle

@@ -14,6 +14,10 @@ dependencies {
  implementation project(':cluster:storage')
  implementation project(':cluster:metrics')

+  // for ADLS interfacing
+  implementation 'org.apache.iceberg:iceberg-azure:1.5.2'


same conventions comment as before

HotSushi · 2024-06-11T17:35:03Z

...ernalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/fileio/FileIOManager.java

@@ -33,9 +33,15 @@ public class FileIOManager {
  @Qualifier("LocalFileIO")
  FileIO localFileIO;

+  @Autowired(required = false) // doesn't inject if null
+  @Qualifier("ADLSFileIO") // avoid ambiguity with multiple beans of the same type
+  FileIO adlsFileIO;


Let it autowire ADLSFileIO directly instead of qualifier + FileIO.

Reason why we used qualifier before: Because HDFS and Local use same HdfsFileIO

HotSushi · 2024-06-11T17:35:40Z

...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/fileio/FileIOConfig.java

+   *
+   * @return ADLSFileIO bean for ADLS storage type, or null if ADLS storage type is not configured
+   */
+  @Bean("ADLSFileIO")


lets take out the qualifier, explained the reasoning in other comment.

jainlavina · 2024-06-11T17:46:51Z

cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorage.java

+
+  @Autowired @Lazy
+  private ADLSStorageClient
+      adlsStorageClient; // declare client class to interact with ADLS filesystem


Can you please be consistent with the comment placement? The rest of code places comment on the line before the function.

jainlavina · 2024-06-11T17:47:22Z

cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorage.java

+  private ADLSStorageClient
+      adlsStorageClient; // declare client class to interact with ADLS filesystem
+
+  // do we need an isConfigured method? to check if ADLS is configured. Leaving out for now


You can leave it out.

jainlavina · 2024-06-11T17:49:12Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+import org.springframework.context.annotation.Lazy;
+import org.springframework.stereotype.Component;
+
+/**


The comment does not match the implementation below.
The implementation is using DataLakeFileClient but the comments mentions BlobClient,

jainlavina · 2024-06-11T17:50:17Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+import org.springframework.stereotype.Component;
+
+/**
+ * ABSStorageClient is an implementation of the StorageClient interface for Azure Blob Storage. It


nit: typo. ADLSStorageClient. Also, should we name it AdlsStorageClient to be consistent with Hdfs?

jainlavina · 2024-06-11T17:51:33Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+
+  private DataLakeFileClient dataLake;
+
+  private static final String DEFAULT_ENDPOINT = "abfs:/";


Why do you need to configure these defaults here?

jainlavina · 2024-06-11T17:52:54Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+
+  private static final String DEFAULT_ROOTPATH = "/tmp/";
+
+  private String endpoint;


Again, why do you need these? Endpoint and rootPath should come from the configuration.

jainlavina · 2024-06-11T17:53:20Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+    if (storageProperties.getTypes() != null && !storageProperties.getTypes().isEmpty()) {
+
+      // fail if properties are invalid
+      Preconditions.checkArgument(


jainlavina · 2024-06-11T17:54:43Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+      Map properties =
+          new HashMap(storageProperties.getTypes().get(ADLS_TYPE.getValue()).getParameters);
+
+      endpoint = storageProperties.getTypes().get(LOCAL_TYPE.getValue()).getEndpoint();


Validation ensures that endpoint and rootPath are configured. You shouldn't need to override it.

jainlavina · 2024-06-11T17:55:35Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+
+  @Autowired private StorageProperties storageProperties;
+
+  private DataLakeFileClient dataLake;


Please name this dataLakeClient.

jainlavina · 2024-06-11T17:57:45Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/ADLSStorageClient.java

+
+  @Override
+  public String getEndpoint() {
+    return endpoint;


I would recommend getting rid of these class-local variables.

Also, no need to override getEndpoint() and getRootPrefix(). You can use those defines in the BaseStorageClient.

sumedhsakdeo

Left some comments on code layout and copied code

sumedhsakdeo · 2024-06-21T04:04:58Z

iceberg/azure/LICENSE

@@ -0,0 +1,337 @@
+


Why is this file needed?

sumedhsakdeo · 2024-06-21T04:05:08Z

iceberg/azure/NOTICE

@@ -0,0 +1,25 @@
+
+Apache Iceberg


sumedhsakdeo · 2024-06-21T04:05:57Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.azure;


Why is the azure code in org.Apache.iceberg package?

Because it is the azure integration with iceberg https://github.com/apache/iceberg/tree/main/azure

sumedhsakdeo · 2024-06-21T04:07:30Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSFileIO.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.azure.adlsv2;


Ditto. @jiang95-dev / @abhisheknath2011 can you help with the packaging?

I think we can have package as com.linkedin.openhouse.azure.adlsv2

Nvm. I got the context from Kai why we are copying code. Either way is ok.

I asked Kai to send explore git submodules instead of copying code.

Sure, sounds good. Yes copying is required due to iceberg version compatibility issue.

hmm..I am interested in this as well.

For an update, it seems like you can only submodule an entire repository, not a subfolder. Is that still worth it? There were some sketchy solutions that I tried to get just a subfolder, but none of them worked for me.

sumedhsakdeo · 2024-06-21T04:08:21Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSInputStream.java

+  }
+
+  @Override
+  public int read(byte[] b, int off, int len) throws IOException {


Why is all this code copied? Is it also available in Iceberg library artifacts to use as a library dependency?

The submodules is difficult because I need to change a slight thing in the iceberg azure code in order for it to work. I am not going to include them for now but then create a new PR that fixes the root issues with the Tables URI which will then allow me to add the submodules as well.

sumedhsakdeo · 2024-06-24T02:41:36Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/adlsv2/AzuriteContainer.java

+import org.testcontainers.containers.GenericContainer;
+import org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy;
+
+public class AzuriteContainer extends GenericContainer<AzuriteContainer> {


This file is found in iceberg/azure/src/test not iceberg/azure/src/main

sumedhsakdeo · 2024-06-24T02:42:22Z

infra/recipes/docker-compose/common/abs-services.yml

+  # minioS3:
+  #   image: minio/minio
+  #   environment:
+  #     - MINIO_ROOT_USER=admin
+  #     - MINIO_ROOT_PASSWORD=password
+  #   ports:
+  #     - 9871:9001
+  #     - 9870:9000
+  #   command: [ "server", "/data", "--console-address", ":9001" ]
+  # mc:
+  #   depends_on:
+  #     - minioS3
+  #   image: minio/mc
+  #   environment:
+  #     - AWS_ACCESS_KEY_ID=admin
+  #     - AWS_SECRET_ACCESS_KEY=password
+  #     - AWS_REGION=us-east-1
+  #   entrypoint: >
+  #     /bin/sh -c "
+  #     until (/usr/bin/mc config host add minio http://minioS3:9000 admin password) do echo '...waiting...' && sleep 1; done;
+  #     /usr/bin/mc rm -r --force minio/openhouse-bucket;
+  #     /usr/bin/mc mb minio/openhouse-bucket;
+  #     /usr/bin/mc policy set public minio/openhouse-bucket;
+  #     tail -f /dev/null
+  #     "


Suggested change

# minioS3:

# image: minio/minio

# environment:

# - MINIO_ROOT_USER=admin

# - MINIO_ROOT_PASSWORD=password

# ports:

# - 9871:9001

# - 9870:9000

# command: [ "server", "/data", "--console-address", ":9001" ]

# mc:

# depends_on:

# - minioS3

# image: minio/mc

# environment:

# - AWS_ACCESS_KEY_ID=admin

# - AWS_SECRET_ACCESS_KEY=password

# - AWS_REGION=us-east-1

# entrypoint: >

# /bin/sh -c "

# until (/usr/bin/mc config host add minio http://minioS3:9000 admin password) do echo '...waiting...' && sleep 1; done;

# /usr/bin/mc rm -r --force minio/openhouse-bucket;

# /usr/bin/mc mb minio/openhouse-bucket;

# /usr/bin/mc policy set public minio/openhouse-bucket;

# tail -f /dev/null

# "

kmcclenn · 2024-06-24T17:06:19Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java

+ * Support</a>
+ */
+class ADLSLocation {
+  private static final Pattern URI_PATTERN = Pattern.compile("^(abfss?://)?([^/?#]+)(.*)?$");


I added the optional 'abfs' to fix the bug where the table location was dropping the prefix. Feels a bit hacky though -- any other ideas for solutions?

I will fix this in a new PR -- dealing with the constructTablePath function to change its return type to a URI

kmcclenn · 2024-06-24T17:07:21Z

...les/src/main/java/com/linkedin/openhouse/tables/repository/impl/InternalRepositoryUtils.java

@@ -181,6 +181,11 @@ private static long safeParseLong(String keyName, Map<String, String> megaProps)
   * in HTS and client-visible table location.
   */
  static String getSchemeLessPath(String rawPath) {
-    return URI.create(rawPath).getPath();
+    URI uri = URI.create(rawPath);


Also added this to allow for locations with and without the prefix to be equal. However, also feels a bit hacky -- is this the best way to go about it or does anyone see a better solution?

jainlavina · 2024-06-24T17:51:13Z

SETUP.md

  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions   \
  --conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog   \
  --conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog     \
  --conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter    \
  --conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080   \
  --conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/$(whoami).token) \
-  --conf spark.sql.catalog.openhouse.cluster=LocalHadoopCluster
+  --conf spark.sql.catalog.openhouse.cluster=LocalFSCluster \
+  --conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO \


This is needed only for the ADLS storage.

My recommendation will be to provide different spark-shell commands here for different recipes.

+1 to the above recommendation.

Sounds good, fixed

I am actually not sure why io-impl is needed to be passed as part of spark conf. An instance of this class should be returned by OpenHouseTableOperations. CC: @HotSushi

jainlavina · 2024-06-24T17:52:55Z

cluster/storage/build.gradle

  id 'openhouse.iceberg-conventions'
  id 'openhouse.maven-publish'
 }

 dependencies {
  implementation project(':cluster:configs')
+  implementation project(':iceberg:azure')


Are you sure this is needed?

Yes because we need to import the copied files. I will revisit if I get the submodules to work

jainlavina · 2024-06-24T17:55:01Z

cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorage.java

+@Component
+public class AdlsStorage extends BaseStorage {
+
+  // Declare client class to interact with ADLS filesystem


s/filesystem/storage

jainlavina · 2024-06-24T17:55:34Z

cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorage.java

+
+  // Declare client class to interact with ADLS filesystem
+  @Autowired @Lazy private AdlsStorageClient adlsStorageClient;
+


Please add javadoc comments.

jainlavina · 2024-06-24T17:57:28Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java

+
+  private DataLakeFileClient dataLakeClient;
+
+  @Getter private ADLSFileIO fileIO;


Why do you need FileIO in the storage client?

In order to return the fileIO object in provideADLSFileIO() in the FileIOConfig.java file, i had to initialize it in the storage client. Unlike other storage methods like s3, the DataLakeClient didn't natively contain the properties to initialize the ADLSFileIO, so I had to do it before than then pass the object. Let me know if I should try to find a different solution

+1 to @jainlavina's question

I don't see FileIO in the S3 storage client https://github.com/linkedin/openhouse/blob/main/cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/s3/S3StorageClient.java

Can we have consistency in layering of objects? @HotSushi can you help @kmcclenn here.

jainlavina · 2024-06-24T18:00:48Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java

+
+      validateProperties();
+
+      endpoint = storageProperties.getTypes().get(ADLS_TYPE.getValue()).getEndpoint();


Use getEndpoint() from BaseStorageClient.

openhouse/cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/BaseStorageClient.java

Line 26 in 397e483

public String getEndpoint() {

jainlavina · 2024-06-24T18:02:24Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java

+
+    URI uri;
+    Map properties;
+    String endpoint;


No need to declare these and initialize them separately.

IF you remove the if block because the validation is done in validateProperties(), then you can have code that looks like:
String endpoint = getEndpoint();

or alternatively, just use getEndpoint() directly.

jainlavina · 2024-06-24T19:38:37Z

...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java

+    if (dataLakeClient == null) {
+      this.fileIO = new ADLSFileIO();
+      fileIO.initialize(properties);
+      DataLakeFileSystemClient client = fileIO.client(uri.toString());


Is FileIO the only way to create a ADLS client?

I dont think so but it provides a nice abstraction that lets us do it, plus we likely need to create the ADLSFileIO anyway to use in FileIOConfig. Let me know if I should find another way though

jainlavina · 2024-06-24T19:39:26Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.azure;


jainlavina · 2024-06-24T19:40:17Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSFileIO.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.azure.adlsv2;


hmm..I am interested in this as well.

abhisheknath2011

Great job @kmcclenn! Thanks for debugging and fixing the issues. Left come comments. It would be good to capture the below details on the PR:

Azure specific libs are are introduced here and the version incompatibility with Iceberg. Approach that was followed to integrate.
Details of the certain lib exclusions that are needed to test the integration on client/server side.
Azure integration doc. Need not be in this PR.

abhisheknath2011 · 2024-06-24T23:12:23Z

SETUP.md

 ```
-bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0   \
-  --jars openhouse-spark-runtime_2.12-*-all.jar  \
+bin/spark-shell --packages org.apache.iceberg:iceberg-azure:1.5.0,org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0 \


So we still need to pass Iceberg-azure as part of packages instead of dependency on spark-runtime as OH currently supports iceberg v1.2?

I think so because version 1.2 doesn't have the iceberg-azure implementation

abhisheknath2011 · 2024-06-24T23:13:08Z

SETUP.md

  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions   \
  --conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog   \
  --conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog     \
  --conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter    \
  --conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080   \
  --conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/$(whoami).token) \
-  --conf spark.sql.catalog.openhouse.cluster=LocalHadoopCluster
+  --conf spark.sql.catalog.openhouse.cluster=LocalFSCluster \
+  --conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO \


+1 to the above recommendation.

abhisheknath2011 · 2024-06-24T23:19:21Z

buildSrc/src/main/groovy/openhouse.iceberg-conventions.gradle

@@ -6,6 +6,6 @@ dependencies {
  implementation('org.apache.iceberg:iceberg-bundled-guava:' + icebergVersion)
  implementation('org.apache.iceberg:iceberg-data:' + icebergVersion)
  implementation('org.apache.iceberg:iceberg-core:' + icebergVersion)
-
-  testImplementation('org.apache.iceberg:iceberg-common:' + icebergVersion)
+  implementation('org.apache.iceberg:iceberg-common:' + icebergVersion)


So iceberg common is needed as implementation with ADLS integration?

As of right now, yes. If I do the submodules later I may reevaluate

abhisheknath2011 · 2024-06-24T23:29:49Z

infra/recipes/docker-compose/common/hadoop/hadoop.env

@@ -43,3 +43,6 @@ MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
 MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
 MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/

+AZURE_CLIENT_ID=<client-id>


So these Azure specific details need to be manually populated, right?

Yes, correct

why are these needed?

They are not actually. Good catch

abhisheknath2011 · 2024-06-24T23:34:37Z

integrations/spark/openhouse-spark-runtime/build.gradle

@@ -37,6 +38,11 @@ dependencies {
  runtimeOnly "org.antlr:antlr4-runtime:4.7.1"
  antlr "org.antlr:antlr4:4.7.1"

+  fatJarRuntimeDependencies("com.azure:azure-storage-file-datalake:12.19.1") {
+    exclude group: 'io.netty'


Please add some details in the PR description regarding the netty version brought in by these changes and the libs that are excluded. Good to add some details in PR on the iceberg azure libs are are inegrated.

abhisheknath2011

Overall looking good. Left some minor comments. There are some open items related to handling URI, code azure those are copied etc. Please add a comment in the PR description as you are planning to handle in the next PR and also update the PR description based on my last review.

abhisheknath2011 · 2024-06-28T16:57:51Z

cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageType.java

    } else {
-      throw new IllegalArgumentException("Unknown storage type: " + type);
+      throw new IllegalArgumentException("Unknown storage type " + type);


Suggested change

throw new IllegalArgumentException("Unknown storage type " + type);

throw new IllegalArgumentException("Unknown storage type: " + type);

abhisheknath2011 · 2024-06-28T17:07:40Z

iceberg/azure/build.gradle

+  id 'openhouse.iceberg-azure-conventions'
+}
+
+dependencies {


Is this section empty? if so we can remove this.

abhisheknath2011 · 2024-06-28T17:12:15Z

iceberg/azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java

+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.PropertyUtil;
+
+public class AzureProperties implements Serializable {


Shall we put some class level comment here? Although this is copied it would be helpful.

abhisheknath2011 · 2024-06-28T17:25:57Z

...ernalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/fileio/FileIOManager.java

@@ -33,9 +36,15 @@ public class FileIOManager {
  @Qualifier("LocalFileIO")
  FileIO localFileIO;

+  @Autowired(required = false)


Suggested change

@Autowired(required = false)

@Qualifier("ADLSFileIO")

@Autowired(required = false)

Do we need qualifier here?

Sushant said we don't because we only have one storage type using ADLSFileIO so it Autowires directly. See this comment: #120 (comment)

sumedhsakdeo · 2024-07-09T17:57:34Z

SETUP.md

+  --conf spark.sql.catalog.openhouse.adls.auth.shared-key.account.name= <account name> \
+  --conf spark.sql.catalog.openhouse.adls.auth.shared-key.account.key= <account key>


Again is this something we need user to pass or something we were letting the catalog retrieve. How were we thinking of this for s3/minio? CC: @HotSushi

It seemed like in the s3 PR variables were passed in a similar fashion.

sumedhsakdeo · 2024-07-09T19:47:23Z

buildSrc/src/main/groovy/openhouse.iceberg-azure-conventions.gradle

+  // Ideally, we have these, but they are only supported for iceberg version >= 1.4.0, which is not compatible
+  // with the current Openhouse implementation.
+  // implementation('org.apache.iceberg:iceberg-azure:' + icebergAzureVersion)
+  // implementation('org.apache.iceberg:iceberg-azure-bundle:' + icebergAzureVersion)


Suggested change

// Ideally, we have these, but they are only supported for iceberg version >= 1.4.0, which is not compatible

// with the current Openhouse implementation.

// implementation('org.apache.iceberg:iceberg-azure:' + icebergAzureVersion)

// implementation('org.apache.iceberg:iceberg-azure-bundle:' + icebergAzureVersion)

// Ideally, we have these, but they are only supported for iceberg version >= 1.4.0, which is not compatible

// with the current Openhouse implementation.

implementation('org.apache.iceberg:iceberg-azure:' + icebergAzureVersion)

implementation('org.apache.iceberg:iceberg-azure-bundle:' + icebergAzureVersion)

If we do this, then do we need to copy code? It might be ok to pull in library for iceberg-azure at version 1.5.2 for use in openhouse repo. Thoughts @HotSushi ?

I tried this - it gives more dependency issues.

sumedhsakdeo · 2024-07-09T20:00:35Z

...ternalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/fileio/FileIOConfig.java

+      ADLSFileIO fileIO =
+          ((AdlsStorageClient) storageManager.getStorage(StorageType.ADLS).getClient()).getFileIO();


Can we create the ADLSFileIO object here, like we do S3FileIO?

openhouse/iceberg/openhouse/internalcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/fileio/FileIOConfig.java

Line 78 in 36aba11

return new S3FileIO(() -> s3);

Unfortunately, ADLSFileIO doesnt have the same constructor that allows us to pass in a created client. Instead, we have to initialize it with the inputted properties. https://github.com/kmcclenn/openhouse/blob/ed0d2293904b995178c2614679b7c6aae3eb0d33/cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java#L65

kmcclenn changed the title ~~code for ADLS integration; untested~~ [WIP] code for ADLS integration; untested Jun 6, 2024

kmcclenn marked this pull request as draft June 6, 2024 22:42

HotSushi reviewed Jun 11, 2024

View reviewed changes

jainlavina reviewed Jun 11, 2024

View reviewed changes

kmcclenn closed this Jun 20, 2024

kmcclenn reopened this Jun 20, 2024

kmcclenn force-pushed the kai/azure_sandbox/adls_integration branch from 7c18a8c to e716a20 Compare June 20, 2024 18:42

kmcclenn closed this Jun 20, 2024

kmcclenn reopened this Jun 20, 2024

kmcclenn force-pushed the kai/azure_sandbox/adls_integration branch from 1e930e6 to 65115d8 Compare June 20, 2024 19:05

kmcclenn closed this Jun 20, 2024

kmcclenn reopened this Jun 20, 2024

kmcclenn changed the title ~~[WIP] code for ADLS integration; untested~~ [WIP] code for ADLS integration Jun 20, 2024

sumedhsakdeo requested changes Jun 21, 2024

View reviewed changes

kmcclenn force-pushed the kai/azure_sandbox/adls_integration branch from cfaf8f9 to 141511e Compare June 22, 2024 01:27

sumedhsakdeo reviewed Jun 24, 2024

View reviewed changes

kmcclenn changed the title ~~[WIP] code for ADLS integration~~ ADLS integration Jun 24, 2024

kmcclenn marked this pull request as ready for review June 24, 2024 14:47

kmcclenn changed the title ~~ADLS integration~~ [WIP] ADLS integration Jun 24, 2024

kmcclenn marked this pull request as draft June 24, 2024 14:48

kmcclenn closed this Jun 24, 2024

kmcclenn reopened this Jun 24, 2024

kmcclenn commented Jun 24, 2024

View reviewed changes

kmcclenn changed the title ~~[WIP] ADLS integration~~ ADLS integration Jun 24, 2024

kmcclenn marked this pull request as ready for review June 24, 2024 17:08

jainlavina reviewed Jun 24, 2024

View reviewed changes

abhisheknath2011 reviewed Jun 24, 2024

View reviewed changes

abhisheknath2011 reviewed Jun 28, 2024

View reviewed changes

sumedhsakdeo reviewed Jul 9, 2024

View reviewed changes

kmcclenn force-pushed the kai/azure_sandbox/adls_integration branch 4 times, most recently from 5e22664 to 02bf559 Compare July 16, 2024 21:18

kmcclenn closed this Jul 16, 2024

kmcclenn reopened this Jul 16, 2024

Kai McClennen and others added 16 commits July 16, 2024 17:43

code for ADLS integration; untested

628cdf0

added code for remote connection to adls, URI not fully functional

19685e0

Renameed files for compatibility

369584e

Successful integration with Azure Storage

4153e9b

More formatting fixes, cleaned up code

6a17c9b

formatting fixes

cdb5382

removed s3 references so tests/build passes

f6ff689

changes to fix client side, still running into slight errors

9d27d1d

changes to fix client side

80fbd60

working integration for client and server

0d5013b

added dependency to pass tests

98c9d52

minor changes for PR

d98b2b4

PR fixes:

2435b66

Minor formatting/style fixes

5f44cd4

table uri uses allocateTableLocation now

2cef9cd

minor changes

7d74714

kmcclenn force-pushed the kai/azure_sandbox/adls_integration branch from ef0d92e to 7d74714 Compare July 17, 2024 00:44

kmcclenn closed this Jul 17, 2024

kmcclenn reopened this Jul 17, 2024

kmcclenn closed this Jul 17, 2024

kmcclenn reopened this Jul 17, 2024


		private DataLakeFileClient dataLake;

		private static final String DEFAULT_ENDPOINT = "abfs:/";


		private static final String DEFAULT_ROOTPATH = "/tmp/";

		private String endpoint;


		@Autowired private StorageProperties storageProperties;

		private DataLakeFileClient dataLake;


		// Declare client class to interact with ADLS filesystem
		@Autowired @Lazy private AdlsStorageClient adlsStorageClient;


		private DataLakeFileClient dataLakeClient;

		@Getter private ADLSFileIO fileIO;


		validateProperties();

		endpoint = storageProperties.getTypes().get(ADLS_TYPE.getValue()).getEndpoint();

	throw new IllegalArgumentException("Unknown storage type " + type);
	throw new IllegalArgumentException("Unknown storage type: " + type);

	@Autowired(required = false)
	@Qualifier("ADLSFileIO")
	@Autowired(required = false)

		--conf spark.sql.catalog.openhouse.adls.auth.shared-key.account.name= <account name> \
		--conf spark.sql.catalog.openhouse.adls.auth.shared-key.account.key= <account key>

		ADLSFileIO fileIO =
		((AdlsStorageClient) storageManager.getStorage(StorageType.ADLS).getClient()).getFileIO();

ADLS integration #120

Are you sure you want to change the base?

ADLS integration #120

Conversation

kmcclenn commented Jun 6, 2024 • edited Loading

Summary

Changes

Packages introduced/removed

Testing Done

Additional Information

Future PRs

HotSushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sumedhsakdeo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhisheknath2011 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmcclenn commented Jun 6, 2024 •

edited

Loading