-
Notifications
You must be signed in to change notification settings - Fork 3k
[2806]Dell EMC ECS catalog implementation #2807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2. Add ecs.client.factory to create external EcsClient. 3. Add unit test for EcsCatalog.
…ion. 2. Move PropertiesSerDes into EcsClient that provide an all-in-on EcsClient abstraction. 3. Add more comments of details.
|
One suggestion: You might consider making the top level folder |
|
@wang-x-xia, thanks for working on this. I'm glad to see proposed support for EMC! I think the first thing to do is to get this into more manageable chunks to review and commit. Is it possible to divide this into a FileIO implementation PR and then a Catalog and TableOperations PR? It would also be really helpful to add a bit more about what you're proposing to the description. For example: How does the catalog work? What is the atomic operation you're using? |
2. Simplify package and module name.
|
@wang-x-xia, it's really hard to review PRs that are larger than necessary. If you want to get this in more quickly, I suggest making it easier to review by dividing it up into reasonable sized PRs. |
I agree with Ryan that it’s very hard to review PRs that are so large in scope. Sometimes, I’ve seen people have one main PR / mother PR, kept as a reference (which is updated as other PRs are reviewed). And then smaller PRs of some components (like the ones Ryan mentioned) are broken out for review, with possibly a reference to the whole PR for people to see the desired end picture (marking it as a draft or [DO NOT MERGE] etc). This way, contributors can review PRs that are more manageable in size, but the overview can still be provided if it really is that important. Just be sure to update the reference / mother PR based on updates you make to the others. Ideally, parts are well enough contained to be reviewable on their own. But I do agree with Ryan, that if you want to get this in more quickly, it would be most advisable to break it up into more manageable chunks (along the API lines he mentioned would be a good place to start). 🙂 |
|
Thanks for your suggestion!
Maybe the first part will separate to different PR if it contains too many files. |
|
Out of curiosity, do you use feature branches for large features in Iceberg or you typically prefer to merge the parts directly onto |
|
@fpj, we prefer merging into master to avoid the need to re-review feature branches to get them into master. I think it works best when we can take a working branch and divide it up into working PRs that can be committed separately. |
|
While we are refining the PR, I would like to hear feedback on the how to do the regression test moving forward. In this init PR, we will provide an ECS in-memory simulation as test suite. Will this approach work for community? |
|
Before we start to split the PR into smaller PRs, I think we iceberg community need to reach the consistence about the public/private vendor integration contribution. The iceberg-aws module is a great example, it provides independent mock unit tests for the small feature, the most important point is : Adobe has provided the s3 integration test utility : com.adobe.testing:s3mock-junit4, it could just launch a local mini s3 cluster for accessing the HTTP API (the S3Mock pretend as a real S3 http server by implementing the S3 API under a local fs directory). The S3Mock simulator have fully covered test cases to guarantee the local S3 has the same semantics as the aws s3. When I implement the aliyun OSS integration, I thought I should provide a similar object storage simulator to align between the local tests and public aliyun oss, so I provided a OSSMockApplication and TestLocalOSS to align the semantics. For my personal view, I would prefer to provide a fully tested simulator for private vendor integration so that we could build unit tests on top of it to verify the correctness. As we will introduce more and more public/private vendor integration in future, I think we should consider agreeing on the details of introducing the vendor as soon as possible, and provide a more complete guide for community contributors to follow and implement. FYI @rdblue & @danielcweeks . |
|
FYI @jackye1995 & @yyanyy |
|
I generally agree that we want to be able to run tests that exercise the actual code against a working back-end, and not tests that use custom mocking at some level within the code being tested. |
I think in the ideal world we should, but I'm not sure if we need to completely block new contributions for cloud vendor integration if there is no working backend library for storage services that are available for unit test. In aws module we have an integration test package that talks to the actual service. However we don't run them during PR submission and they are run manually before each release. I think we should try to integrate them as one of the auto tests to catch regression. With or without a library that provides full functionality for unit testing, I think this integration test is still valuable. |
|
Let's make this more clear, I write the following table:
|
@rdblue , your prefer is definitely right if we don't consider the private vendor services. The Dell ECS cannot be publicly accessed when we release manager decide to check the candidate release, it will need to deploy their software in their required hardware + hosts to verify the correctness ( free or charge ? @mechgouki ) . So that's why I think we need a services simulator provided from Dell ECS to align the protocol between iceberg tests and Dell real production services. |
|
Thanks @rdblue and @openinx for the feedback. For the real integration with our customers, we will take the responsibility, instead of community. |
|
We are also moving to cloud native approach, so maybe in the future we could deploy on cloud and run the integration test there. But right now we would like to explore a new testing strategy with community and get ball rolling |
|
A few points I'd like to discuss:
I remember @rdblue you talked about the possibility of having a RESTful Catalog implementation to plugin, would that help this Dell use case?
I have been thinking recently a lot about the SDK version, and maybe we could consider reverting to v1, and Dell can contribute just a The reason I am thinking about reverting to v1 is because of client side encryption support. V2 was promised to offer client side encryption this summer which would let v2 SDK have full functionality compatibility with v1 plus supposedly better performance, but the whole project was significantly delayed and won't be done until years later. So there is also an ask for adding S3 client side encryption from user side, for which the only way to achieve that is through reverting to v1. I think this version change could be done given the fact that nothing around AWS client is publicly exposed. Some work is needed to update documentation around the dependency jars to add. But if we see enough benefits in adopting S3-like private vendors by reintroducing V1, I think this seems to be the best way to go. @danielcweeks what do you think about the S3 SDK situation? @mechgouki if we reintroduce v1 SDK, do you think you still need the dell module, or could you just implement a |
Basically we have 2 areas which Dell EMC features could benefit as below: So in order to support these 2 changes, we need change both in FileIO and S3Catalog, which we can not directly use AWS module while we could based on V1 SDK to extend. |
|
had some offline discussion with @mechgouki . Here are some conclusions:
|
|
Had some offline discussion with @danielcweeks, and we are exploring why SDK V2 could not achieve the goal. It seems that we can still set header through: The SDK V1 just has that as a util method in Could you validate if that is the case? If we can set headers like this, can we move to implement this through the V2 SDK? @mechgouki |
|
Based on the latest conversation with @mechgouki , Dell has decided to open source their own client SDK under BSD license, and will not go through the S3 SDK. So they will rewrite the PR to contribute their catalog and FileIO. |
Yes, this is correct. And we will take responsibility for the support the client SDK |
|
The Dell EMC ECS SDK already open source at https://github.com/EMCECS/ecs-object-client-java |
|
I closed the first PR which create an abstraction of ECS APIs. Due to using our own SDK, this PR is redundant. And the second PR is now available: #3376 |
|
@wang-x-xia do you plan to also create a new PR for the |
|
No. The code of this PR won't update. I'll create a new PR for catalog implementation. |
|
@jackye1995 @openinx Just record the offline discussion about the integration test:
|
|
close the PR based on conversations above. |
|
@wang-x-xia Hello, we are trying to use ecs and iceberg to build lake follow by this Dell doc https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/apache-iceberg-dell-emc-ecs.pdf |
We was planned to merged this change into Iceberg 0.13 - but due to holiday , it did not happened. So please send mail to dell emc channel to get the official support. |
|
@wang-x-xia Ecs support hudi? |
Use S3 protocol. Apache Hudi uses the HDFS as its storage abstraction. So it won't use additional benefits from ECS. |
|
Hudi supports S3, it should be possible, streaming hudi is easier to write,
hopefully support HUDi
Xia ***@***.***> 于2022年2月22日周二 12:01写道:
… @wang-x-xia <https://github.com/wang-x-xia> Ecs support hudi?
Use S3 protocol. Apache Hudi uses the HDFS as its storage abstraction. So
it won't use additional benefits from ECS.
—
Reply to this email directly, view it on GitHub
<#2807 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIXXZSGFUBQ7LLNJFS46SDU4MDBFANCNFSM5AGGJQIQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
@mechgouki Hi, |
|
@guillaumBrisard |
I create a new module for Dell EMC ECS Catalog implementation.
The package "org.apache.iceberg.dell.emc.ecs" contains following things:
Then, the package "org.apache.iceberg.dell.emc.ecs.impl" impls the EcsClient and some related interfaces.
Because Dell EMC ECS extends standard Amazon S3 API, we use Amazon S3 SDK v1 (v2 SDK doesn't allow the custom behavior).
For unit tests, I create an EcsClient impl named MemoryEcsClient. It provides the same assumptions that EcsClient provided.
Original issue: #2806