Skip to content

Conversation

@wang-x-xia
Copy link
Contributor

I create a new module for Dell EMC ECS Catalog implementation.

The package "org.apache.iceberg.dell.emc.ecs" contains following things:

  1. Abstacations of object storage:
Class Description
ObjectBaseKey the prefix of object key.
ObjectKey the object key.
ObjectKeys the object key operations.
ObjectHeadInfo the basic information of an object.
EcsClient the abstract client of object storage.
PropertiesSerDes the properties de/serialization.
  1. ECS catalog implementations
Impls Interface
EcsCatalog Catalog
EcsFile InputFile, OutputFile
EcsFileIO FileIO
EcsTableOperations TableOperations

Then, the package "org.apache.iceberg.dell.emc.ecs.impl" impls the EcsClient and some related interfaces.

Because Dell EMC ECS extends standard Amazon S3 API, we use Amazon S3 SDK v1 (v2 SDK doesn't allow the custom behavior).

Features Method Doc
replace exist object with eTag EcsClient#replace un-document
create object if absent EcsClient#writeIfAbsent, EcsClient#copyObjectIfAbsent If-None-Match
append bytes EcsClient#outputStream Range

For unit tests, I create an EcsClient impl named MemoryEcsClient. It provides the same assumptions that EcsClient provided.

Original issue: #2806

2. Add ecs.client.factory to create external EcsClient.
3. Add unit test for EcsCatalog.
…ion.

2. Move PropertiesSerDes into EcsClient that provide an all-in-on EcsClient abstraction.
3. Add more comments of details.
@github-actions github-actions bot added the build label Jul 12, 2021
@kbendick
Copy link
Contributor

kbendick commented Jul 12, 2021

One suggestion: You might consider making the top level folder dell, similarly to our package aws.

@rdblue
Copy link
Contributor

rdblue commented Jul 13, 2021

@wang-x-xia, thanks for working on this. I'm glad to see proposed support for EMC!

I think the first thing to do is to get this into more manageable chunks to review and commit. Is it possible to divide this into a FileIO implementation PR and then a Catalog and TableOperations PR? It would also be really helpful to add a bit more about what you're proposing to the description. For example: How does the catalog work? What is the atomic operation you're using?

2. Simplify package and module name.
@wang-x-xia
Copy link
Contributor Author

@kbendick

The package and module name has changed.

@rdblue

We want to give a whole solution of catalog. It's hard to separate parts of impls into different PRs. I think I can give you more knowledge about this catalog. I'm preparing something about it.

@rdblue
Copy link
Contributor

rdblue commented Jul 14, 2021

@wang-x-xia, it's really hard to review PRs that are larger than necessary. If you want to get this in more quickly, I suggest making it easier to review by dividing it up into reasonable sized PRs.

@kbendick
Copy link
Contributor

kbendick commented Jul 15, 2021

@rdblue

We want to give a whole solution of catalog. It's hard to separate parts of impls into different PRs. I think I can give you more knowledge about this catalog. I'm preparing something about it.

I agree with Ryan that it’s very hard to review PRs that are so large in scope.

Sometimes, I’ve seen people have one main PR / mother PR, kept as a reference (which is updated as other PRs are reviewed). And then smaller PRs of some components (like the ones Ryan mentioned) are broken out for review, with possibly a reference to the whole PR for people to see the desired end picture (marking it as a draft or [DO NOT MERGE] etc).

This way, contributors can review PRs that are more manageable in size, but the overview can still be provided if it really is that important. Just be sure to update the reference / mother PR based on updates you make to the others.

Ideally, parts are well enough contained to be reviewable on their own. But I do agree with Ryan, that if you want to get this in more quickly, it would be most advisable to break it up into more manageable chunks (along the API lines he mentioned would be a good place to start). 🙂

@wang-x-xia
Copy link
Contributor Author

@rdblue and @kbendick

Thanks for your suggestion!
I'll separate this into 3 or 4 parts:

  1. EcsClient, which provides object access methods used in Catalog and FileIO.
  2. Implementations related to FileIO.
  3. Implementations related to Catalog.

Maybe the first part will separate to different PR if it contains too many files.
I need some time to finish this.

@fpj
Copy link

fpj commented Jul 15, 2021

Out of curiosity, do you use feature branches for large features in Iceberg or you typically prefer to merge the parts directly onto master?

@rdblue
Copy link
Contributor

rdblue commented Jul 19, 2021

@fpj, we prefer merging into master to avoid the need to re-review feature branches to get them into master. I think it works best when we can take a working branch and divide it up into working PRs that can be committed separately.

@mechgouki
Copy link

While we are refining the PR, I would like to hear feedback on the how to do the regression test moving forward. In this init PR, we will provide an ECS in-memory simulation as test suite. Will this approach work for community?
The reason here is today ECS mainly focus on on-premise/hybrid cloud use case with different appliance model (with different perf/capacity), so it will be a little hard to run regression test by community alone, that's why we suggest to use the simulation here.

@openinx
Copy link
Member

openinx commented Sep 2, 2021

Before we start to split the PR into smaller PRs, I think we iceberg community need to reach the consistence about the public/private vendor integration contribution. The iceberg-aws module is a great example, it provides independent mock unit tests for the small feature, the most important point is : Adobe has provided the s3 integration test utility : com.adobe.testing:s3mock-junit4, it could just launch a local mini s3 cluster for accessing the HTTP API (the S3Mock pretend as a real S3 http server by implementing the S3 API under a local fs directory). The S3Mock simulator have fully covered test cases to guarantee the local S3 has the same semantics as the aws s3.

When I implement the aliyun OSS integration, I thought I should provide a similar object storage simulator to align between the local tests and public aliyun oss, so I provided a OSSMockApplication and TestLocalOSS to align the semantics. For my personal view, I would prefer to provide a fully tested simulator for private vendor integration so that we could build unit tests on top of it to verify the correctness.

As we will introduce more and more public/private vendor integration in future, I think we should consider agreeing on the details of introducing the vendor as soon as possible, and provide a more complete guide for community contributors to follow and implement.

FYI @rdblue & @danielcweeks .

@openinx
Copy link
Member

openinx commented Sep 2, 2021

FYI @jackye1995 & @yyanyy

@rdblue
Copy link
Contributor

rdblue commented Sep 2, 2021

I generally agree that we want to be able to run tests that exercise the actual code against a working back-end, and not tests that use custom mocking at some level within the code being tested.

@yyanyy
Copy link
Contributor

yyanyy commented Sep 2, 2021

Before we start to split the PR into smaller PRs, I think we iceberg community need to reach the consistence about the public/private vendor integration contribution. The iceberg-aws module is a great example, it provides independent mock unit tests for the small feature, the most important point is : Adobe has provided the s3 integration test utility : com.adobe.testing:s3mock-junit4, it could just launch a local mini s3 cluster for accessing the HTTP API (the S3Mock pretend as a real S3 http server by implementing the S3 API under a local fs directory). The S3Mock simulator have fully covered test cases to guarantee the local S3 has the same semantics as the aws s3.

When I implement the aliyun OSS integration, I thought I should provide a similar object storage simulator to align between the local tests and public aliyun oss, so I provided a OSSMockApplication and TestLocalOSS to align the semantics. For my personal view, I would prefer to provide a fully tested simulator for private vendor integration so that we could build unit tests on top of it to verify the correctness.

As we will introduce more and more public/private vendor integration in future, I think we should consider agreeing on the details of introducing the vendor as soon as possible, and provide a more complete guide for community contributors to follow and implement.

FYI @rdblue & @danielcweeks .

I think in the ideal world we should, but I'm not sure if we need to completely block new contributions for cloud vendor integration if there is no working backend library for storage services that are available for unit test. In aws module we have an integration test package that talks to the actual service. However we don't run them during PR submission and they are run manually before each release. I think we should try to integrate them as one of the auto tests to catch regression. With or without a library that provides full functionality for unit testing, I think this integration test is still valuable.

@openinx
Copy link
Member

openinx commented Sep 3, 2021

Let's make this more clear, I write the following table:

Tests Run in unit tests Run when release Public vendor services Private vendor services
API mock tests YES YES required required
Services simulator YES YES optional required
Integrated tests by accessing real vendor services NO YES ? Private services cannot be checked required required

@openinx
Copy link
Member

openinx commented Sep 3, 2021

I generally agree that we want to be able to run tests that exercise the actual code against a working back-end

@rdblue , your prefer is definitely right if we don't consider the private vendor services. The Dell ECS cannot be publicly accessed when we release manager decide to check the candidate release, it will need to deploy their software in their required hardware + hosts to verify the correctness ( free or charge ? @mechgouki ) . So that's why I think we need a services simulator provided from Dell ECS to align the protocol between iceberg tests and Dell real production services.

@mechgouki
Copy link

Thanks @rdblue and @openinx for the feedback.
Yes, today we (Dell EMC Object storage) mainly runs as private service, so even we do have the process for customer to try but that could be an over-kill for community. So I would like to suggest that we provide a new S3 mock service( which will base on Adobe one and focus on special extension APIs since we do have good compatibility with AWS S3)

For the real integration with our customers, we will take the responsibility, instead of community.

@mechgouki
Copy link

We are also moving to cloud native approach, so maybe in the future we could deploy on cloud and run the integration test there.

But right now we would like to explore a new testing strategy with community and get ball rolling

@jackye1995
Copy link
Contributor

A few points I'd like to discuss:

  1. around private vendor of catalog implementation

I remember @rdblue you talked about the possibility of having a RESTful Catalog implementation to plugin, would that help this Dell use case?

  1. around S3 SDK version

I have been thinking recently a lot about the SDK version, and maybe we could consider reverting to v1, and Dell can contribute just a S3Catalog instead.

The reason I am thinking about reverting to v1 is because of client side encryption support. V2 was promised to offer client side encryption this summer which would let v2 SDK have full functionality compatibility with v1 plus supposedly better performance, but the whole project was significantly delayed and won't be done until years later. So there is also an ask for adding S3 client side encryption from user side, for which the only way to achieve that is through reverting to v1.

I think this version change could be done given the fact that nothing around AWS client is publicly exposed. Some work is needed to update documentation around the dependency jars to add. But if we see enough benefits in adopting S3-like private vendors by reintroducing V1, I think this seems to be the best way to go.

@danielcweeks what do you think about the S3 SDK situation?

@mechgouki if we reintroduce v1 SDK, do you think you still need the dell module, or could you just implement a S3Catalog instead in AWS module?

@mechgouki
Copy link

A few points I'd like to discuss:

  1. around private vendor of catalog implementation

I remember @rdblue you talked about the possibility of having a RESTful Catalog implementation to plugin, would that help this Dell use case?

  1. around S3 SDK version

I have been thinking recently a lot about the SDK version, and maybe we could consider reverting to v1, and Dell can contribute just a S3Catalog instead.

The reason I am thinking about reverting to v1 is because of client side encryption support. V2 was promised to offer client side encryption this summer which would let v2 SDK have full functionality compatibility with v1 plus supposedly better performance, but the whole project was significantly delayed and won't be done until years later. So there is also an ask for adding S3 client side encryption from user side, for which the only way to achieve that is through reverting to v1.

I think this version change could be done given the fact that nothing around AWS client is publicly exposed. Some work is needed to update documentation around the dependency jars to add. But if we see enough benefits in adopting S3-like private vendors by reintroducing V1, I think this seems to be the best way to go.

@danielcweeks what do you think about the S3 SDK situation?

@mechgouki if we reintroduce v1 SDK, do you think you still need the dell module, or could you just implement a S3Catalog instead in AWS module?

@jackye1995

Basically we have 2 areas which Dell EMC features could benefit as below:
(1)Append operation in additional of MPU, if the client has less local cache for large object( like edge), Dell EMC object service could help here.
(2)Atomic rename. We have If-Match and If-None-Match semantic as we support strong consistency model (within one site) from the very beginning.

So in order to support these 2 changes, we need change both in FileIO and S3Catalog, which we can not directly use AWS module while we could based on V1 SDK to extend.

@jackye1995
Copy link
Contributor

had some offline discussion with @mechgouki . Here are some conclusions:

  1. we can make this as the S3Catalog implementation in AWS module. In the future even if other vendors including AWS S3 come up with similar semantics, we can add a catalog config to switch across implementations.
  2. we can add a catalog config like use-append to switch to the append output stream implementation in S3FileIO. Overall the Netflix S3OutputStream seems to be still more performant, but the use case around append-optimized object storage looks like a reasonable use case to support.
  3. these all depends on the switch to v1 SDK, but I think it seems to be mutually benefical given that v2 does not support custom header for third-party vendors. Reverting to v1 can provide Iceberg more vendor integration, more features, and reduce the number of modules we need to create for new vendors.

@jackye1995
Copy link
Contributor

Had some offline discussion with @danielcweeks, and we are exploring why SDK V2 could not achieve the goal. It seems that we can still set header through:

    s3.putObject(PutObjectRequest.builder()
        .bucket(bucketName)
        .key(objectKey)
        .overrideConfiguration(AwsRequestOverrideConfiguration.builder()
            .putHeader("If-None-Match", "*")
            .build())
        .build());

The SDK V1 just has that as a util method in ObjectMetadata, all the user metadata are just headers that has prefix x-amz-meta- based on https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html.

Could you validate if that is the case? If we can set headers like this, can we move to implement this through the V2 SDK? @mechgouki

@jackye1995
Copy link
Contributor

Based on the latest conversation with @mechgouki , Dell has decided to open source their own client SDK under BSD license, and will not go through the S3 SDK. So they will rewrite the PR to contribute their catalog and FileIO.

@mechgouki
Copy link

Based on the latest conversation with @mechgouki , Dell has decided to open source their own client SDK under BSD license, and will not go through the S3 SDK. So they will rewrite the PR to contribute their catalog and FileIO.

Yes, this is correct. And we will take responsibility for the support the client SDK

@mechgouki
Copy link

The Dell EMC ECS SDK already open source at https://github.com/EMCECS/ecs-object-client-java

@wang-x-xia
Copy link
Contributor Author

I closed the first PR which create an abstraction of ECS APIs. Due to using our own SDK, this PR is redundant.

And the second PR is now available: #3376

@jackye1995
Copy link
Contributor

@wang-x-xia do you plan to also create a new PR for the EcsCatalog, or keep updating this one? If you plan to create a new one, I will close this PR.

@wang-x-xia
Copy link
Contributor Author

@jackye1995

No. The code of this PR won't update. I'll create a new PR for catalog implementation.
I think some discussion on this PR is active. So I didn't close it yesterday.
Close this PR is fine for me.

@mechgouki
Copy link

@jackye1995 @openinx Just record the offline discussion about the integration test:

  • We will first try to merge the implementation and client mock test suites which you guys already stared the review process
  • And we will take responsibility for our customers to run iceberg on our products , not community.
  • We are in parallel developing the full integration mock service - but that need time to pass the our side review first.

@jackye1995
Copy link
Contributor

close the PR based on conversations above.

@jackye1995 jackye1995 closed this Nov 3, 2021
@figurant
Copy link

figurant commented Jan 6, 2022

@wang-x-xia Hello, we are trying to use ecs and iceberg to build lake follow by this Dell doc https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/apache-iceberg-dell-emc-ecs.pdf
where can we find this jar? iceberg-ecs-catalog-0.12.0.jar
thanks.

@mechgouki
Copy link

@wang-x-xia Hello, we are trying to use ecs and iceberg to build lake follow by this Dell doc https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/apache-iceberg-dell-emc-ecs.pdf where can we find this jar? iceberg-ecs-catalog-0.12.0.jar thanks.

We was planned to merged this change into Iceberg 0.13 - but due to holiday , it did not happened. So please send mail to dell emc channel to get the official support.

@melin
Copy link

melin commented Feb 19, 2022

@wang-x-xia Ecs support hudi?

@wang-x-xia
Copy link
Contributor Author

@wang-x-xia Ecs support hudi?

Use S3 protocol. Apache Hudi uses the HDFS as its storage abstraction. So it won't use additional benefits from ECS.

@melin
Copy link

melin commented Feb 23, 2022 via email

@guillaumBrisard
Copy link

@mechgouki Hi,
is it planned to be merged into Iceberg 0.14 please ?

@mechgouki
Copy link

@guillaumBrisard
Yes, we are waiting for the Doc to be merged, if everything go smoothly, we could consider in 0.14 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.