[2806]Dell EMC ECS catalog implementation #2807

wang-x-xia · 2021-07-12T06:48:44Z

I create a new module for Dell EMC ECS Catalog implementation.

The package "org.apache.iceberg.dell.emc.ecs" contains following things:

Abstacations of object storage:

Class	Description
ObjectBaseKey	the prefix of object key.
ObjectKey	the object key.
ObjectKeys	the object key operations.
ObjectHeadInfo	the basic information of an object.
EcsClient	the abstract client of object storage.
PropertiesSerDes	the properties de/serialization.

ECS catalog implementations

Impls	Interface
EcsCatalog	Catalog
EcsFile	InputFile, OutputFile
EcsFileIO	FileIO
EcsTableOperations	TableOperations

Then, the package "org.apache.iceberg.dell.emc.ecs.impl" impls the EcsClient and some related interfaces.

Because Dell EMC ECS extends standard Amazon S3 API, we use Amazon S3 SDK v1 (v2 SDK doesn't allow the custom behavior).

Features	Method	Doc
replace exist object with eTag	EcsClient#replace	un-document
create object if absent	EcsClient#writeIfAbsent, EcsClient#copyObjectIfAbsent	If-None-Match
append bytes	EcsClient#outputStream	Range

For unit tests, I create an EcsClient impl named MemoryEcsClient. It provides the same assumptions that EcsClient provided.

Original issue: #2806

2. Add ecs.client.factory to create external EcsClient. 3. Add unit test for EcsCatalog.

…ion. 2. Move PropertiesSerDes into EcsClient that provide an all-in-on EcsClient abstraction. 3. Add more comments of details.

kbendick · 2021-07-12T23:48:35Z

One suggestion: You might consider making the top level folder dell, similarly to our package aws.

rdblue · 2021-07-13T00:28:13Z

@wang-x-xia, thanks for working on this. I'm glad to see proposed support for EMC!

I think the first thing to do is to get this into more manageable chunks to review and commit. Is it possible to divide this into a FileIO implementation PR and then a Catalog and TableOperations PR? It would also be really helpful to add a bit more about what you're proposing to the description. For example: How does the catalog work? What is the atomic operation you're using?

2. Simplify package and module name.

wang-x-xia · 2021-07-14T09:26:11Z

@kbendick

The package and module name has changed.

@rdblue

We want to give a whole solution of catalog. It's hard to separate parts of impls into different PRs. I think I can give you more knowledge about this catalog. I'm preparing something about it.

rdblue · 2021-07-14T22:28:07Z

@wang-x-xia, it's really hard to review PRs that are larger than necessary. If you want to get this in more quickly, I suggest making it easier to review by dividing it up into reasonable sized PRs.

kbendick · 2021-07-15T05:28:35Z

@rdblue

We want to give a whole solution of catalog. It's hard to separate parts of impls into different PRs. I think I can give you more knowledge about this catalog. I'm preparing something about it.

I agree with Ryan that it’s very hard to review PRs that are so large in scope.

Sometimes, I’ve seen people have one main PR / mother PR, kept as a reference (which is updated as other PRs are reviewed). And then smaller PRs of some components (like the ones Ryan mentioned) are broken out for review, with possibly a reference to the whole PR for people to see the desired end picture (marking it as a draft or [DO NOT MERGE] etc).

This way, contributors can review PRs that are more manageable in size, but the overview can still be provided if it really is that important. Just be sure to update the reference / mother PR based on updates you make to the others.

Ideally, parts are well enough contained to be reviewable on their own. But I do agree with Ryan, that if you want to get this in more quickly, it would be most advisable to break it up into more manageable chunks (along the API lines he mentioned would be a good place to start). 🙂

wang-x-xia · 2021-07-15T09:28:47Z

@rdblue and @kbendick

Thanks for your suggestion!
I'll separate this into 3 or 4 parts:

EcsClient, which provides object access methods used in Catalog and FileIO.
Implementations related to FileIO.
Implementations related to Catalog.

Maybe the first part will separate to different PR if it contains too many files.
I need some time to finish this.

fpj · 2021-07-15T13:42:43Z

Out of curiosity, do you use feature branches for large features in Iceberg or you typically prefer to merge the parts directly onto master?

rdblue · 2021-07-19T23:26:01Z

@fpj, we prefer merging into master to avoid the need to re-review feature branches to get them into master. I think it works best when we can take a working branch and divide it up into working PRs that can be committed separately.

mechgouki · 2021-07-20T05:56:36Z

While we are refining the PR, I would like to hear feedback on the how to do the regression test moving forward. In this init PR, we will provide an ECS in-memory simulation as test suite. Will this approach work for community?
The reason here is today ECS mainly focus on on-premise/hybrid cloud use case with different appliance model (with different perf/capacity), so it will be a little hard to run regression test by community alone, that's why we suggest to use the simulation here.

openinx · 2021-09-02T07:49:59Z

Before we start to split the PR into smaller PRs, I think we iceberg community need to reach the consistence about the public/private vendor integration contribution. The iceberg-aws module is a great example, it provides independent mock unit tests for the small feature, the most important point is : Adobe has provided the s3 integration test utility : com.adobe.testing:s3mock-junit4, it could just launch a local mini s3 cluster for accessing the HTTP API (the S3Mock pretend as a real S3 http server by implementing the S3 API under a local fs directory). The S3Mock simulator have fully covered test cases to guarantee the local S3 has the same semantics as the aws s3.

When I implement the aliyun OSS integration, I thought I should provide a similar object storage simulator to align between the local tests and public aliyun oss, so I provided a OSSMockApplication and TestLocalOSS to align the semantics. For my personal view, I would prefer to provide a fully tested simulator for private vendor integration so that we could build unit tests on top of it to verify the correctness.

As we will introduce more and more public/private vendor integration in future, I think we should consider agreeing on the details of introducing the vendor as soon as possible, and provide a more complete guide for community contributors to follow and implement.

FYI @rdblue & @danielcweeks .

openinx · 2021-09-02T07:50:33Z

FYI @jackye1995 & @yyanyy

rdblue · 2021-09-02T16:10:42Z

I generally agree that we want to be able to run tests that exercise the actual code against a working back-end, and not tests that use custom mocking at some level within the code being tested.

yyanyy · 2021-09-02T22:50:59Z

Before we start to split the PR into smaller PRs, I think we iceberg community need to reach the consistence about the public/private vendor integration contribution. The iceberg-aws module is a great example, it provides independent mock unit tests for the small feature, the most important point is : Adobe has provided the s3 integration test utility : com.adobe.testing:s3mock-junit4, it could just launch a local mini s3 cluster for accessing the HTTP API (the S3Mock pretend as a real S3 http server by implementing the S3 API under a local fs directory). The S3Mock simulator have fully covered test cases to guarantee the local S3 has the same semantics as the aws s3.

When I implement the aliyun OSS integration, I thought I should provide a similar object storage simulator to align between the local tests and public aliyun oss, so I provided a OSSMockApplication and TestLocalOSS to align the semantics. For my personal view, I would prefer to provide a fully tested simulator for private vendor integration so that we could build unit tests on top of it to verify the correctness.

As we will introduce more and more public/private vendor integration in future, I think we should consider agreeing on the details of introducing the vendor as soon as possible, and provide a more complete guide for community contributors to follow and implement.

FYI @rdblue & @danielcweeks .

I think in the ideal world we should, but I'm not sure if we need to completely block new contributions for cloud vendor integration if there is no working backend library for storage services that are available for unit test. In aws module we have an integration test package that talks to the actual service. However we don't run them during PR submission and they are run manually before each release. I think we should try to integrate them as one of the auto tests to catch regression. With or without a library that provides full functionality for unit testing, I think this integration test is still valuable.

openinx · 2021-09-03T03:15:11Z

Let's make this more clear, I write the following table:

Tests	Run in unit tests	Run when release	Public vendor services	Private vendor services
API mock tests	YES	YES	required	required
Services simulator	YES	YES	optional	required
Integrated tests by accessing real vendor services	NO	YES ? Private services cannot be checked	required	required

openinx · 2021-09-03T03:22:26Z

I generally agree that we want to be able to run tests that exercise the actual code against a working back-end

@rdblue , your prefer is definitely right if we don't consider the private vendor services. The Dell ECS cannot be publicly accessed when we release manager decide to check the candidate release, it will need to deploy their software in their required hardware + hosts to verify the correctness ( free or charge ? @mechgouki ) . So that's why I think we need a services simulator provided from Dell ECS to align the protocol between iceberg tests and Dell real production services.

mechgouki · 2021-09-03T05:33:00Z

Thanks @rdblue and @openinx for the feedback.
Yes, today we (Dell EMC Object storage) mainly runs as private service, so even we do have the process for customer to try but that could be an over-kill for community. So I would like to suggest that we provide a new S3 mock service( which will base on Adobe one and focus on special extension APIs since we do have good compatibility with AWS S3)

For the real integration with our customers, we will take the responsibility, instead of community.

mechgouki · 2021-09-03T05:36:29Z

We are also moving to cloud native approach, so maybe in the future we could deploy on cloud and run the integration test there.

But right now we would like to explore a new testing strategy with community and get ball rolling

jackye1995 · 2021-09-30T05:08:14Z

A few points I'd like to discuss:

around private vendor of catalog implementation

I remember @rdblue you talked about the possibility of having a RESTful Catalog implementation to plugin, would that help this Dell use case?

around S3 SDK version

I have been thinking recently a lot about the SDK version, and maybe we could consider reverting to v1, and Dell can contribute just a S3Catalog instead.

The reason I am thinking about reverting to v1 is because of client side encryption support. V2 was promised to offer client side encryption this summer which would let v2 SDK have full functionality compatibility with v1 plus supposedly better performance, but the whole project was significantly delayed and won't be done until years later. So there is also an ask for adding S3 client side encryption from user side, for which the only way to achieve that is through reverting to v1.

I think this version change could be done given the fact that nothing around AWS client is publicly exposed. Some work is needed to update documentation around the dependency jars to add. But if we see enough benefits in adopting S3-like private vendors by reintroducing V1, I think this seems to be the best way to go.

@danielcweeks what do you think about the S3 SDK situation?

@mechgouki if we reintroduce v1 SDK, do you think you still need the dell module, or could you just implement a S3Catalog instead in AWS module?

mechgouki · 2021-09-30T05:32:26Z

A few points I'd like to discuss:

around private vendor of catalog implementation

I remember @rdblue you talked about the possibility of having a RESTful Catalog implementation to plugin, would that help this Dell use case?

around S3 SDK version

I have been thinking recently a lot about the SDK version, and maybe we could consider reverting to v1, and Dell can contribute just a S3Catalog instead.

The reason I am thinking about reverting to v1 is because of client side encryption support. V2 was promised to offer client side encryption this summer which would let v2 SDK have full functionality compatibility with v1 plus supposedly better performance, but the whole project was significantly delayed and won't be done until years later. So there is also an ask for adding S3 client side encryption from user side, for which the only way to achieve that is through reverting to v1.

I think this version change could be done given the fact that nothing around AWS client is publicly exposed. Some work is needed to update documentation around the dependency jars to add. But if we see enough benefits in adopting S3-like private vendors by reintroducing V1, I think this seems to be the best way to go.

@danielcweeks what do you think about the S3 SDK situation?

@mechgouki if we reintroduce v1 SDK, do you think you still need the dell module, or could you just implement a S3Catalog instead in AWS module?

@jackye1995

Basically we have 2 areas which Dell EMC features could benefit as below:
(1)Append operation in additional of MPU, if the client has less local cache for large object( like edge), Dell EMC object service could help here.
(2)Atomic rename. We have If-Match and If-None-Match semantic as we support strong consistency model (within one site) from the very beginning.

So in order to support these 2 changes, we need change both in FileIO and S3Catalog, which we can not directly use AWS module while we could based on V1 SDK to extend.

jackye1995 · 2021-09-30T06:31:31Z

had some offline discussion with @mechgouki . Here are some conclusions:

we can make this as the S3Catalog implementation in AWS module. In the future even if other vendors including AWS S3 come up with similar semantics, we can add a catalog config to switch across implementations.
we can add a catalog config like use-append to switch to the append output stream implementation in S3FileIO. Overall the Netflix S3OutputStream seems to be still more performant, but the use case around append-optimized object storage looks like a reasonable use case to support.
these all depends on the switch to v1 SDK, but I think it seems to be mutually benefical given that v2 does not support custom header for third-party vendors. Reverting to v1 can provide Iceberg more vendor integration, more features, and reduce the number of modules we need to create for new vendors.

jackye1995 · 2021-09-30T18:07:35Z

Had some offline discussion with @danielcweeks, and we are exploring why SDK V2 could not achieve the goal. It seems that we can still set header through:

    s3.putObject(PutObjectRequest.builder()
        .bucket(bucketName)
        .key(objectKey)
        .overrideConfiguration(AwsRequestOverrideConfiguration.builder()
            .putHeader("If-None-Match", "*")
            .build())
        .build());

The SDK V1 just has that as a util method in ObjectMetadata, all the user metadata are just headers that has prefix x-amz-meta- based on https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html.

Could you validate if that is the case? If we can set headers like this, can we move to implement this through the V2 SDK? @mechgouki

jackye1995 · 2021-10-01T04:26:41Z

Based on the latest conversation with @mechgouki , Dell has decided to open source their own client SDK under BSD license, and will not go through the S3 SDK. So they will rewrite the PR to contribute their catalog and FileIO.

mechgouki · 2021-10-01T05:05:18Z

Based on the latest conversation with @mechgouki , Dell has decided to open source their own client SDK under BSD license, and will not go through the S3 SDK. So they will rewrite the PR to contribute their catalog and FileIO.

Yes, this is correct. And we will take responsibility for the support the client SDK

mechgouki · 2021-10-01T05:14:34Z

The Dell EMC ECS SDK already open source at https://github.com/EMCECS/ecs-object-client-java

wang-x-xia · 2021-10-26T11:36:06Z

I closed the first PR which create an abstraction of ECS APIs. Due to using our own SDK, this PR is redundant.

And the second PR is now available: #3376

jackye1995 · 2021-10-26T20:36:14Z

@wang-x-xia do you plan to also create a new PR for the EcsCatalog, or keep updating this one? If you plan to create a new one, I will close this PR.

wang-x-xia · 2021-10-27T02:37:15Z

@jackye1995

No. The code of this PR won't update. I'll create a new PR for catalog implementation.
I think some discussion on this PR is active. So I didn't close it yesterday.
Close this PR is fine for me.

mechgouki · 2021-10-28T04:10:24Z

@jackye1995 @openinx Just record the offline discussion about the integration test:

We will first try to merge the implementation and client mock test suites which you guys already stared the review process
And we will take responsibility for our customers to run iceberg on our products , not community.
We are in parallel developing the full integration mock service - but that need time to pass the our side review first.

jackye1995 · 2021-11-03T15:19:37Z

close the PR based on conversations above.

figurant · 2022-01-06T04:06:39Z

@wang-x-xia Hello, we are trying to use ecs and iceberg to build lake follow by this Dell doc https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/apache-iceberg-dell-emc-ecs.pdf
where can we find this jar? iceberg-ecs-catalog-0.12.0.jar
thanks.

mechgouki · 2022-01-06T10:43:28Z

@wang-x-xia Hello, we are trying to use ecs and iceberg to build lake follow by this Dell doc https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/apache-iceberg-dell-emc-ecs.pdf where can we find this jar? iceberg-ecs-catalog-0.12.0.jar thanks.

We was planned to merged this change into Iceberg 0.13 - but due to holiday , it did not happened. So please send mail to dell emc channel to get the official support.

melin · 2022-02-19T15:16:38Z

@wang-x-xia Ecs support hudi?

wang-x-xia · 2022-02-22T04:01:10Z

@wang-x-xia Ecs support hudi?

Use S3 protocol. Apache Hudi uses the HDFS as its storage abstraction. So it won't use additional benefits from ECS.

melin · 2022-02-23T01:30:21Z

Hudi supports S3, it should be possible, streaming hudi is easier to write, hopefully support HUDi Xia ***@***.***> 于2022年2月22日周二 12:01写道：

…

@wang-x-xia <https://github.com/wang-x-xia> Ecs support hudi? Use S3 protocol. Apache Hudi uses the HDFS as its storage abstraction. So it won't use additional benefits from ECS. — Reply to this email directly, view it on GitHub <#2807 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIXXZSGFUBQ7LLNJFS46SDU4MDBFANCNFSM5AGGJQIQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

guillaumBrisard · 2022-06-10T09:44:03Z

@mechgouki Hi,
is it planned to be merged into Iceberg 0.14 please ?

mechgouki · 2022-07-21T13:29:16Z

@guillaumBrisard
Yes, we are waiting for the Doc to be merged, if everything go smoothly, we could consider in 0.14 release

wang-x-xia added 5 commits May 10, 2021 16:59

Add EcsCatalog.

481ae30

1. Fix behaviours when use empty namespace.

fb9bb35

2. Add ecs.client.factory to create external EcsClient. 3. Add unit test for EcsCatalog.

1. Expose userMetadata to api. Use userMetadata to store property ver…

620f2bd

…ion. 2. Move PropertiesSerDes into EcsClient that provide an all-in-on EcsClient abstraction. 3. Add more comments of details.

1. Fix default warehouse location.

bdf142d

1. Replace String.split to Splitter to avoid array copy.

194d679

github-actions bot added the build label Jul 12, 2021

1. Use code style to format code.

ee67bd9

2. Simplify package and module name.

wang-x-xia mentioned this pull request Jul 21, 2021

[2806]Dell EMC ECS features required by a new ECS Catalog impl #2847

Closed

jackye1995 closed this Nov 3, 2021

[2806]Dell EMC ECS catalog implementation #2807

[2806]Dell EMC ECS catalog implementation #2807

Uh oh!

Conversation

wang-x-xia commented Jul 12, 2021

Uh oh!

kbendick commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jul 13, 2021

Uh oh!

wang-x-xia commented Jul 14, 2021

Uh oh!

rdblue commented Jul 14, 2021

Uh oh!

kbendick commented Jul 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wang-x-xia commented Jul 15, 2021

Uh oh!

fpj commented Jul 15, 2021

Uh oh!

rdblue commented Jul 19, 2021

Uh oh!

mechgouki commented Jul 20, 2021

Uh oh!

openinx commented Sep 2, 2021

Uh oh!

openinx commented Sep 2, 2021

Uh oh!

rdblue commented Sep 2, 2021

Uh oh!

yyanyy commented Sep 2, 2021

Uh oh!

openinx commented Sep 3, 2021

Uh oh!

openinx commented Sep 3, 2021

Uh oh!

mechgouki commented Sep 3, 2021

Uh oh!

mechgouki commented Sep 3, 2021

Uh oh!

jackye1995 commented Sep 30, 2021

Uh oh!

mechgouki commented Sep 30, 2021

Uh oh!

jackye1995 commented Sep 30, 2021

Uh oh!

jackye1995 commented Sep 30, 2021

Uh oh!

jackye1995 commented Oct 1, 2021

Uh oh!

mechgouki commented Oct 1, 2021

Uh oh!

mechgouki commented Oct 1, 2021

Uh oh!

wang-x-xia commented Oct 26, 2021

Uh oh!

jackye1995 commented Oct 26, 2021

Uh oh!

wang-x-xia commented Oct 27, 2021

Uh oh!

mechgouki commented Oct 28, 2021

Uh oh!

jackye1995 commented Nov 3, 2021

Uh oh!

figurant commented Jan 6, 2022

Uh oh!

mechgouki commented Jan 6, 2022

Uh oh!

melin commented Feb 19, 2022

Uh oh!

wang-x-xia commented Feb 22, 2022

Uh oh!

melin commented Feb 23, 2022 via email

Uh oh!

guillaumBrisard commented Jun 10, 2022

Uh oh!

mechgouki commented Jul 21, 2022

Uh oh!

kbendick commented Jul 12, 2021 •

edited

Loading

kbendick commented Jul 15, 2021 •

edited

Loading