-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Support aliyun OSS (Object Storage Service) #2230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a first glance, 2 major questions:
- this has a lot of similarity with the S3FileIO. As more services add support for Iceberg, should we extract the common logic to have a
ObjectStorageFileIOin core? @rdblue what do you think? - I really like the approach for the mock application, but it feels that the whole thing can be published by aliyun for Iceberg to import. So just curious, any plan for that in the long term?
| * encryption and verify the identity of a requester. The AccessKey ID is used to identify a user. | ||
| * <p> | ||
| * For more information about how to obtain an AccessKey pair, see: | ||
| * https://www.alibabacloud.com/help/doc-detail/53045.htm?spm=a2c63.p38356.879954.8.5dcb7d9186qY48#task968 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spm=a2c63.p38356.879954.8.5dcb7d9186qY48#task968 be removed to make the URL cleaner, similar comment for all the URLs below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, those parameters could be removed. Thanks for the reminding ! @jackye1995
| try { | ||
| return new S3OutputStream(client(), uri(), awsProperties()); | ||
| } catch (IOException e) { | ||
| throw new UncheckedIOException("Filed to create output stream for location: " + uri(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for noticing this! But I think we should not have it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could just address the s3 typo in a separate PR. Let's just revert it here.
| org.apache.hive:hive-service = 2.3.8 | ||
| org.apache.tez:tez-dag = 0.8.4 | ||
| org.apache.tez:tez-mapreduce = 0.8.4 | ||
| org.springframework:* = 5.1.10.RELEASE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's not the latest release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we could align it with the s3mock testing framework, I don't expect to introduce two different versions in apache iceberg repo for testing purpose.
I think it may not worth to abstract the
Aliyun Object storage services team is another team in our company, seems currently they don't have the plan to maintain in their aliyun oss sdk release. For our iceberg integration work, we are just depending on few APIs so we don't have to maintain all the API from the aliyun official documents ( It's actually lightweight enough to get this OSSMockApplication in iceberg repo). Another side, according to my observation, aliyun OSS API has been relatively stable for a long time, the maintenance cost is controllable for us. |
Cool, in that case let's just take a note for each other when there are major changes happening to any of the FileIO, to see if the change is applicable to the other side. |
|
@rdblue How is your feeling about integrating aliyun services to apache iceberg format ? In Asia iceberg community, I found that many users have the need to save iceberg to aliyun OSS. I think it would be really helpful if we could make the integration work into apache iceberg repo. Besides the aliyun OSS, aliyun also provide the catalog services name Data Lake Format ( DLF) , which is similar to amazon glue, I'd like to push those work in another PR. |
|
@openinx, it sounds good to me. I'll take a look as soon as I get some time. Thanks for working on this! |
|
I can help review when you think it's ready. And one more suggestion, because OSS can also refer to any generic object storage service, I think it is clearer to name this with prefix |
|
It will be great if @jackye1995 has any extra bandwith ! Renaming the OSS to |
|
After walking through this issue, I agree with that it is helpful if we take this pr into community . I would like to pick up this issue in separated link, thanks for the contribution and great work @openinx |
|
Thanks for picking this up, @xingbowu . I would like to be the reviewer for this PR if you would like to make this into several small PRs. |
|
Since we've got almost of the work merged into apache iceberg repo, closing this PR now. |
Made this pull request to support aliyun OSS (Object Storage Service). Aliyun OSS is a very popular object storage service among public could services, it's necessary to support it (similar to aws s3) in apache iceberg official repo (In fact I saw many asia users are asking whether there's an official support for aliyun oss).
When I prepare this patch, the biggest obstacle is: How to write unit tests for the OSS+iceberg integration work. For aws s3, adobe create a great tool named s3mock, saying it could just setup a local mini cluster that support aws s3 protocal, then we could easily write unit tests based that mini s3 clusters.
For aliyun OSS services, we don't have that great tool (the official aliyun oss sdk also don't provide the mini cluster jar). I don't expect to commit plenty of codes with integration tests only but without unit tests. So I implemented a similar local oss mini clusters, it's similar to adobe's work but it's more lightweight because we don't want every API to be implemented in apache iceberg. Implementing the basic OSS API in a local mini cluster is enough, it takes hundreds lines of code to accomplish this.
Currently, I've implemented all the iceberg FileIO's interfaces by aliyun oss sdk, and make both the unit tests and integration tests work well.