Skip to content

Conversation

@wang-x-xia
Copy link
Contributor

It's the first part of #2807 .

Dell EMC ECS is an Object Storage which delivers rich S3-compatibility.

This PR is to add new interfaces that provide several features from the Dell EMC ECS.

The main interface is EcsClient, it provides methods like:

  1. To get object info.
  2. To read and write object with Java InputStream/OutputStream that for data files.
  3. To read, replace, write and copy object that for metadata files.
  4. To list objects that for fetch concepts such as namespaces and databases.

And there are 2 additional interfaces which provide helper method of client:

  1. PropertiesSerDes is used to convert Map<String, String> to file content. I add a default version value to avoid compatible problems in future.
  2. ObjectKeys is used to convert name of Iceberg resources to object key.

The EcsClientImpl provide the impl that use Amazon AWS S3 Java SDK (V1) to connect with the ECS.
The Amazon AWS S3 Java SDK (V2), used in the :iceberg-aws module, doesn't provide extension.

The MemoryEcsClient provide the impl that mock all operations of EcsClient in memory. I will use in future tests.

2. Create EcsClient to define the related methods used in Iceberg.
@github-actions github-actions bot added the build label Jul 21, 2021
2. remove EcsClient#getProperties.
@mechgouki
Copy link

mechgouki commented Aug 13, 2021

@kbendick @rdblue
We would like to hear the suggestion whether we should wait for this the first new PR(1 of 3) pass the review and then start to post the rest ?
Or
Should we just go ahead to post the rest 2 PRs now?

@openinx

build.gradle Outdated
compile(project(':iceberg-nessie')) {
exclude group: 'com.google.code.findbugs', module: 'jsr305'
}
compile(project(':iceberg-dell'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove additions to the runtime Jars until later. We add new integrations to the runtime Jars after they are finished, and we also look at the dependencies that are added to the runtime Jar as a result. We need to be careful with runtime Jar licensing.

I don't think that we want to start worrying about updating LICENSE files and looking at sizes just yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

*
* @return utils class
*/
default PropertiesSerDes getPropertiesSerDes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Iceberg community doesn't use get in method names unless something needs to conform to Java Bean expectations. The reason is that get is almost never valuable: either there is a more specific verb that should be used instead, or it is basically useless and should be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

import java.util.Objects;

/**
* a immutable record class of object key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Please use sentence case in Javadoc. Sentences should start with a capital letter and end with punctuation. Also, it should be "an" when followed by a noun that begins with a vowel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

*/
public class ObjectKey {
/**
* bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please omit useless comments. Repeating the field name is not helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

*/
private final String key;

public ObjectKey(String bucket, String key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is not right here. The class is called ObjectKey, but it is created by passing a String called key. Is the string the key, or is it part of the key? Maybe this should be similar to S3, where the bucket/string combination is called a URI, S3URI.

Copy link
Contributor Author

@wang-x-xia wang-x-xia Sep 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In object storage, the object key contains both a bucket name and a key.
And in my design, the user can provide different delimiters for object storage.
For example, the user can create an object which bucket is a and key is folder2021-folder09-obj1 with delimiter -.
It's hard to format it as a string, so I move this to the ObjectKeys interface.

And I need some time to check whether using emc://bucket/key as a general string.

}

@Override
public String toString() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not form a location, like emc://bucket/path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

/**
* use ECS append api to write data
*/
public class EcsAppendOutputStream extends OutputStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this implement PositionOutputStream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PositionOutputStream is just a simple extension of OutputStream. To implement it in this class is a little wired in OOP. I'll create an adapter class in the next PR.


public ContentAndETagImpl(ObjectHeadInfo headInfo, byte[] content) {
this.headInfo = headInfo;
this.content = content;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is content here? Is it an etag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content is all bytes data of an object. It's for properties objects.

* <p>
* this properties is only a placeholder and may use in future.
*/
String CURRENT_VERSION = "0";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of this if it isn't serialized into the blob?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's in object user metadata. We wish to use it if the properties format needs to change.

@rdblue
Copy link
Contributor

rdblue commented Sep 2, 2021

@wang-x-xia, if EMC is S3 compatible and uses the v1 S3 API, why not make this an implementation of FileIO that uses the older S3 API? Can you help me understand why this is specific to EMC?

@wang-x-xia
Copy link
Contributor Author

@wang-x-xia, if EMC is S3 compatible and uses the v1 S3 API, why not make this an implementation of FileIO that uses the older S3 API? Can you help me understand why this is specific to EMC?

ECS can append data to object. So the complex data cache of the S3 FileIO impl can be omit.

The ECSClient interface is for that if Amazon deprecated their V1 SDK, it's easily to create a new implementation of this interface.

Actually, the V2 SDK is same with V1 SDK. But just V1 SDK allow to custom HTTP headers that can help ECS and other object services to provide some extension feature.
If Iceberg need a V1-SDKed FileIO, I can help to implement one. The object writes will be same with current AWS implementation with a local data cache.

ECS also have its own SDK, but the license may cause some problems. Finally I used the S3 V1 SDK.

@mechgouki
Copy link

@wang-x-xia, if EMC is S3 compatible and uses the v1 S3 API, why not make this an implementation of FileIO that uses the older S3 API? Can you help me understand why this is specific to EMC?

The short answer is the V2 SDK added limitation for extension, while some extension APIs will benefit Iceberg to have better performance run on our product.
As Xia said, we could switch to use DellEMC Client which also open source https://github.com/EMCECS/ecs-object-client-java
But we might need additional check on legal as the license is BSD.

2. Remove most redundant methods in EcsClient.
@mechgouki
Copy link

@wang-x-xia, if EMC is S3 compatible and uses the v1 S3 API, why not make this an implementation of FileIO that uses the older S3 API? Can you help me understand why this is specific to EMC?

Discussed internally, we plan to switch to DellEMC ECS Client lib here as it is fully open source and we are supporting/will support it

@wang-x-xia
Copy link
Contributor Author

@rdblue Here is a problem with using our own SDK.
I noticed that https://github.com/EMCECS/ecs-object-client-java uses a BSD-3 license.
If I use this, what more actions should I take?

@rdblue
Copy link
Contributor

rdblue commented Sep 9, 2021

@wang-x-xia, the 3-clause BSD license is compatible with the Apache License: https://www.apache.org/legal/resolved.html

It is listed as Category A. So it should be fine.

What is the rationale for using a library specific to EMC rather than the S3 v1 API? It seems like the S3 v1 API is probably going to be useful to more people. Are there EMC features that we can only get using the EMC client? You mentioned append support, but all of the formats that Iceberg uses are immutable. So we never need to re-open a file and append more data.

@mechgouki
Copy link

mechgouki commented Sep 10, 2021

@wang-x-xia, the 3-clause BSD license is compatible with the Apache License: https://www.apache.org/legal/resolved.html

It is listed as Category A. So it should be fine.

What is the rationale for using a library specific to EMC rather than the S3 v1 API? It seems like the S3 v1 API is probably going to be useful to more people. Are there EMC features that we can only get using the EMC client? You mentioned append support, but all of the formats that Iceberg uses are immutable. So we never need to re-open a file and append more data.

Basically we have 2 areas which Dell EMC features could benefit :
(1)Append operation in additional of MPU, if the client has less local cache for large object( like edge), Dell EMC object service could help here.
(2)Atomic rename. We have If-Match and If-None-Match semantic as we support strong consistency model (within one site) from the very beginning.

And talked internally, for better support the customer, if we use Dell EMC features, we'd better use our own client with BSD License. AWS S3 V1 could be workaround here, but V2 can not be an option .

Looking forward for your suggestion here

@rdblue
Copy link
Contributor

rdblue commented Sep 10, 2021

@mechgouki, that sounds like reasonable motivation to go with the EMC client to me. What are the dependencies of the EMC artifact?

@mechgouki
Copy link

@mechgouki, that sounds like reasonable motivation to go with the EMC client to me. What are the dependencies of the EMC artifact?

@rdblue
We listed all the dependencies at https://github.com/EMCECS/ecs-object-client-java/tree/master/3rd-party-licenses

@wang-x-xia wang-x-xia closed this Oct 26, 2021
@wang-x-xia
Copy link
Contributor Author

Using the https://github.com/EMCECS/ecs-object-client-java, the PR is redundant to create an abstraction for ECS APIs.

The following File IO PR is #3376.
And the parent PR of ECS Catalog with related discussion is #2807.

You can check these PRs for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants