Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Feb 9, 2021

Made this pull request to support aliyun OSS (Object Storage Service). Aliyun OSS is a very popular object storage service among public could services, it's necessary to support it (similar to aws s3) in apache iceberg official repo (In fact I saw many asia users are asking whether there's an official support for aliyun oss).

When I prepare this patch, the biggest obstacle is: How to write unit tests for the OSS+iceberg integration work. For aws s3, adobe create a great tool named s3mock, saying it could just setup a local mini cluster that support aws s3 protocal, then we could easily write unit tests based that mini s3 clusters.

For aliyun OSS services, we don't have that great tool (the official aliyun oss sdk also don't provide the mini cluster jar). I don't expect to commit plenty of codes with integration tests only but without unit tests. So I implemented a similar local oss mini clusters, it's similar to adobe's work but it's more lightweight because we don't want every API to be implemented in apache iceberg. Implementing the basic OSS API in a local mini cluster is enough, it takes hundreds lines of code to accomplish this.

Currently, I've implemented all the iceberg FileIO's interfaces by aliyun oss sdk, and make both the unit tests and integration tests work well.

  • How to run the aliyun oss unit tests ?
./gradlew :iceberg-aliyun:test
  • How to run the aliyun oss integration tests ?
export OSS_TEST_RULE_CLASS_IMPL=org.apache.iceberg.aliyun.oss.OSSIntegrationTestRule
export OSS_TEST_ENDPOINT=<aliyun-oss-endpoint>
export OSS_TEST_ACCESS_KEY=<your-aliyun-oss-access-key>
export OSS_TEST_ACCESS_SECRET=<your-aliyun-oss-access-secret-key>
export OSS_TEST_BUCKET=<your-oss-bucket-name>
export OSS_TEST_KEY_PREFIX=<your-oss-object-key-prefix>. # optional.

./gradlew :iceberg-aliyun:test

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a first glance, 2 major questions:

  1. this has a lot of similarity with the S3FileIO. As more services add support for Iceberg, should we extract the common logic to have a ObjectStorageFileIO in core? @rdblue what do you think?
  2. I really like the approach for the mock application, but it feels that the whole thing can be published by aliyun for Iceberg to import. So just curious, any plan for that in the long term?

* encryption and verify the identity of a requester. The AccessKey ID is used to identify a user.
* <p>
* For more information about how to obtain an AccessKey pair, see:
* https://www.alibabacloud.com/help/doc-detail/53045.htm?spm=a2c63.p38356.879954.8.5dcb7d9186qY48#task968
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spm=a2c63.p38356.879954.8.5dcb7d9186qY48#task968 be removed to make the URL cleaner, similar comment for all the URLs below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those parameters could be removed. Thanks for the reminding ! @jackye1995

try {
return new S3OutputStream(client(), uri(), awsProperties());
} catch (IOException e) {
throw new UncheckedIOException("Filed to create output stream for location: " + uri(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for noticing this! But I think we should not have it in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could just address the s3 typo in a separate PR. Let's just revert it here.

org.apache.hive:hive-service = 2.3.8
org.apache.tez:tez-dag = 0.8.4
org.apache.tez:tez-mapreduce = 0.8.4
org.springframework:* = 5.1.10.RELEASE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's not the latest release?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we could align it with the s3mock testing framework, I don't expect to introduce two different versions in apache iceberg repo for testing purpose.

@openinx
Copy link
Member Author

openinx commented Feb 18, 2021

  1. this has a lot of similarity with the S3FileIO. As more services add support for Iceberg, should we extract the common logic to have a ObjectStorageFileIO in core? @rdblue what do you think?

I think it may not worth to abstract the ObjectStorageFileIO. In this PR we could see that aliyun OSSFileIO is quite similar to aws S3FileIO because aliyun object storage services are defining their API which are quite similar to aws API (When I implements the OSSMockApplication, I found that most of the arguments & API names are the same), though we could make abstraction between aliyun & aws, azure & google cloud storage API will be a quite different story, we won't know whether does the abstraction match azure blob storage or GCS unless we implement the azure blob storage FileIO and GCS FileIO.

  1. I really like the approach for the mock application, but it feels that the whole thing can be published by aliyun for Iceberg to import. So just curious, any plan for that in the long term?

Aliyun Object storage services team is another team in our company, seems currently they don't have the plan to maintain in their aliyun oss sdk release. For our iceberg integration work, we are just depending on few APIs so we don't have to maintain all the API from the aliyun official documents ( It's actually lightweight enough to get this OSSMockApplication in iceberg repo). Another side, according to my observation, aliyun OSS API has been relatively stable for a long time, the maintenance cost is controllable for us.

@jackye1995
Copy link
Contributor

I think it may not worth to abstract the ObjectStorageFileIO. In this PR we could see that aliyun OSSFileIO is quite similar to aws S3FileIO because aliyun object storage services are defining their API which are quite similar to aws API

Cool, in that case let's just take a note for each other when there are major changes happening to any of the FileIO, to see if the change is applicable to the other side.

@openinx
Copy link
Member Author

openinx commented Mar 4, 2021

@rdblue How is your feeling about integrating aliyun services to apache iceberg format ? In Asia iceberg community, I found that many users have the need to save iceberg to aliyun OSS. I think it would be really helpful if we could make the integration work into apache iceberg repo. Besides the aliyun OSS, aliyun also provide the catalog services name Data Lake Format ( DLF) , which is similar to amazon glue, I'd like to push those work in another PR.

@rdblue
Copy link
Contributor

rdblue commented Mar 4, 2021

@openinx, it sounds good to me. I'll take a look as soon as I get some time. Thanks for working on this!

@haormj
Copy link
Contributor

haormj commented May 13, 2021

@openinx @rdblue thanks for your work on this, this feature helps me a lot, could you provide the merged schedule?

@openinx
Copy link
Member Author

openinx commented May 13, 2021

@haormj , I would expect this feature could be merged in the next release 0.12. That means we will need more bandwidth to get this reviewing work done. will try to email @rdblue to see if he has any bandwidth.

@jackye1995
Copy link
Contributor

I can help review when you think it's ready. And one more suggestion, because OSS can also refer to any generic object storage service, I think it is clearer to name this with prefix Aliyun, such as AliyunOSSFileIO. What do you think?

@openinx
Copy link
Member Author

openinx commented May 13, 2021

It will be great if @jackye1995 has any extra bandwith ! Renaming the OSS to AliyunOSS sounds good to me, I'd look around the whole PR to see how could I make this PR to be ready for reviewing.

@xingbowu
Copy link
Contributor

After walking through this issue, I agree with that it is helpful if we take this pr into community . I would like to pick up this issue in separated link, thanks for the contribution and great work @openinx

@openinx
Copy link
Member Author

openinx commented Aug 30, 2021

Thanks for picking this up, @xingbowu . I would like to be the reviewer for this PR if you would like to make this into several small PRs.

@rdblue
Copy link
Contributor

rdblue commented Aug 30, 2021

@openinx and @xingbowu, Let me know if I can help get this in!

@openinx
Copy link
Member Author

openinx commented Dec 15, 2021

Since we've got almost of the work merged into apache iceberg repo, closing this PR now.

@openinx openinx closed this Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants