-
Notifications
You must be signed in to change notification settings - Fork 3k
[2806]Dell EMC ECS features required by a new ECS Catalog impl #2847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2. Create EcsClient to define the related methods used in Iceberg.
2. remove EcsClient#getProperties.
build.gradle
Outdated
| compile(project(':iceberg-nessie')) { | ||
| exclude group: 'com.google.code.findbugs', module: 'jsr305' | ||
| } | ||
| compile(project(':iceberg-dell')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove additions to the runtime Jars until later. We add new integrations to the runtime Jars after they are finished, and we also look at the dependencies that are added to the runtime Jar as a result. We need to be careful with runtime Jar licensing.
I don't think that we want to start worrying about updating LICENSE files and looking at sizes just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
| * | ||
| * @return utils class | ||
| */ | ||
| default PropertiesSerDes getPropertiesSerDes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Iceberg community doesn't use get in method names unless something needs to conform to Java Bean expectations. The reason is that get is almost never valuable: either there is a more specific verb that should be used instead, or it is basically useless and should be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
| import java.util.Objects; | ||
|
|
||
| /** | ||
| * a immutable record class of object key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Please use sentence case in Javadoc. Sentences should start with a capital letter and end with punctuation. Also, it should be "an" when followed by a noun that begins with a vowel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
| */ | ||
| public class ObjectKey { | ||
| /** | ||
| * bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please omit useless comments. Repeating the field name is not helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
| */ | ||
| private final String key; | ||
|
|
||
| public ObjectKey(String bucket, String key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is not right here. The class is called ObjectKey, but it is created by passing a String called key. Is the string the key, or is it part of the key? Maybe this should be similar to S3, where the bucket/string combination is called a URI, S3URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In object storage, the object key contains both a bucket name and a key.
And in my design, the user can provide different delimiters for object storage.
For example, the user can create an object which bucket is a and key is folder2021-folder09-obj1 with delimiter -.
It's hard to format it as a string, so I move this to the ObjectKeys interface.
And I need some time to check whether using emc://bucket/key as a general string.
| } | ||
|
|
||
| @Override | ||
| public String toString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not form a location, like emc://bucket/path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
| /** | ||
| * use ECS append api to write data | ||
| */ | ||
| public class EcsAppendOutputStream extends OutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this implement PositionOutputStream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PositionOutputStream is just a simple extension of OutputStream. To implement it in this class is a little wired in OOP. I'll create an adapter class in the next PR.
|
|
||
| public ContentAndETagImpl(ObjectHeadInfo headInfo, byte[] content) { | ||
| this.headInfo = headInfo; | ||
| this.content = content; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is content here? Is it an etag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content is all bytes data of an object. It's for properties objects.
| * <p> | ||
| * this properties is only a placeholder and may use in future. | ||
| */ | ||
| String CURRENT_VERSION = "0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of this if it isn't serialized into the blob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's in object user metadata. We wish to use it if the properties format needs to change.
|
@wang-x-xia, if EMC is S3 compatible and uses the v1 S3 API, why not make this an implementation of FileIO that uses the older S3 API? Can you help me understand why this is specific to EMC? |
ECS can append data to object. So the complex data cache of the S3 FileIO impl can be omit. The ECSClient interface is for that if Amazon deprecated their V1 SDK, it's easily to create a new implementation of this interface. Actually, the V2 SDK is same with V1 SDK. But just V1 SDK allow to custom HTTP headers that can help ECS and other object services to provide some extension feature. ECS also have its own SDK, but the license may cause some problems. Finally I used the S3 V1 SDK. |
The short answer is the V2 SDK added limitation for extension, while some extension APIs will benefit Iceberg to have better performance run on our product. |
2. Remove most redundant methods in EcsClient.
Discussed internally, we plan to switch to DellEMC ECS Client lib here as it is fully open source and we are supporting/will support it |
|
@rdblue Here is a problem with using our own SDK. |
|
@wang-x-xia, the 3-clause BSD license is compatible with the Apache License: https://www.apache.org/legal/resolved.html It is listed as Category A. So it should be fine. What is the rationale for using a library specific to EMC rather than the S3 v1 API? It seems like the S3 v1 API is probably going to be useful to more people. Are there EMC features that we can only get using the EMC client? You mentioned append support, but all of the formats that Iceberg uses are immutable. So we never need to re-open a file and append more data. |
Basically we have 2 areas which Dell EMC features could benefit : And talked internally, for better support the customer, if we use Dell EMC features, we'd better use our own client with BSD License. AWS S3 V1 could be workaround here, but V2 can not be an option . Looking forward for your suggestion here |
|
@mechgouki, that sounds like reasonable motivation to go with the EMC client to me. What are the dependencies of the EMC artifact? |
@rdblue |
|
Using the https://github.com/EMCECS/ecs-object-client-java, the PR is redundant to create an abstraction for ECS APIs. The following File IO PR is #3376. You can check these PRs for more details. |
It's the first part of #2807 .
Dell EMC ECS is an Object Storage which delivers rich S3-compatibility.
This PR is to add new interfaces that provide several features from the Dell EMC ECS.
The main interface is
EcsClient, it provides methods like:And there are 2 additional interfaces which provide helper method of client:
PropertiesSerDesis used to convertMap<String, String>to file content. I add a default version value to avoid compatible problems in future.ObjectKeysis used to convert name of Iceberg resources to object key.The
EcsClientImplprovide the impl that use Amazon AWS S3 Java SDK (V1) to connect with the ECS.The Amazon AWS S3 Java SDK (V2), used in the
:iceberg-awsmodule, doesn't provide extension.The
MemoryEcsClientprovide the impl that mock all operations ofEcsClientin memory. I will use in future tests.