Skip to content

Conversation

@wang-x-xia
Copy link
Contributor

Use Dell EMC ECS object SDK to implement File IO APIs:

  1. Add :iceberg-dell module and dependencies.
  2. Create classes for FileIO:
    1. EcsAppendOutputStream which uses Put and Append API to concat data bytes without local file cache.
    2. EcsSeekableInputStream which uses Get-by-range API to fetch data bytes from object.
    3. EcsURI, EcsFile and EcsFileIO for that basic functions.

To use the location like ecs://bucket/object_name to consistent with other vendors.

Due to Dell EMC ECS is an S3 compatible storage. The location is also compatible with schema "s3", "s3a", and "s3n".

@wang-x-xia
Copy link
Contributor Author

The test mock used a client-side mock that is easier for this repo.
We decide to publish an ECS Mock in the future with more internal discussion and process.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for separating the PR out! Here are some general comments I have:

  1. for EcsFile, I think it's better to have separated EcsInputFile and EcsOutputFile with a base BaseEcsFile, which follows the pattern of S3 and Hadoop and is more clear to code readers.
  2. some classes can be made package private instead of public, such as the stream classes, please make changes accordingly.
  3. For tests, please use more expressive test names, and provide messages for each Assert. For testing errors, we prefer to use AssertHelper.assertThrows instead of @Test(expected=xxx) or try-catch block to verify error messages.
  4. please add a newline after each control statement like if, for, try, etc.

2. Move EcsAppendOutputStream and EcsSeekableInputStream to package private.
3. Fix tests with @test(expected)
4. Add new line for all control block statements.
5. Add ecs prefix and properties key in EcsClientProperties.
6. Inline LocationUtils.ECS_SCHEMA.
@wang-x-xia
Copy link
Contributor Author

According to @jackye1995 's suggestion.
The EcsFile is separate to EcsInputFile and EcsOutputFile.

/**
* A {@link java.io.Externalizable} FileIO of ECS S3 object client.
*/
public class EcsFileIO implements FileIO, Externalizable, AutoCloseable {
Copy link
Contributor

@jackye1995 jackye1995 Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we follow the pattern of (1) directly serialize an object like client as a function, example: https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java#L42-L45. (2) if absolutely necessary, directly overwrite serialization private methods instead of using Externalize, example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/SerializableConfiguration.java#L39-L48. Is it possible for you to use the same pattern here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the javadoc for Externalizable, it seems to be a valid usage here, so I will leave it as is for now and let other reviewers decide.

2. Move BaseEcsFile, EcsInputFile, EcsOutputFile to package private.
3. Separate test from single method to multiple method. Move related test rule to class-level.
4. Remove all static-imports.
@jackye1995
Copy link
Contributor

@rdblue I think this looks good to me overall after a few rounds of review, could you also take a look?

@mechgouki
Copy link

@rdblue Ryan, after several rounds's review with @jackye1995 , I think we are good to merge this PR. However this is a new feature for iceberg to support on-perm product, would you please help to take a look and give the green light ? I think the 0.13 window still open and we would like to get ready soon for our customers.

@jackye1995 jackye1995 requested a review from rdblue November 17, 2021 23:38
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few new comments based on the new addition of the GcsFileIO and also some discussions around aligning different vendor properties. Please let me know if you have any concern.

/**
* A {@link java.io.Externalizable} FileIO of ECS S3 object client.
*/
public class EcsFileIO implements FileIO, Externalizable, AutoCloseable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to directly serialize an object like client as a function instead of using Externalizable. I understand this method is likely correct, but given the fact that all the other implementations are using that approach, it's better to be consistent so that it's easier to maintain in the future.

examples:

https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java#L42-L45

https://github.com/apache/iceberg/blob/master/gcp/src/main/java/org/apache/iceberg/gcp/gcs/GCSFileIO.java#L47

/**
* S3 Access key id of Dell EMC ECS
*/
public static final String ACCESS_KEY_ID = "ecs.s3.access-key-id";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ECS_S3_ACCESS_KEY_ID?

and similar comments to the other variable names

@jackye1995
Copy link
Contributor

@rdblue any comments on this PR?

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost looks good to me, just left few comments.

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me now. I will get this merged. Thanks @wang-x-xia for contribution, and thanks @jackye1995 for reviewing !

@openinx openinx changed the title [2806] Dell EMC ECS file IO Dell: Add Dell EMC EcsFileIO. Feb 18, 2022
@openinx openinx merged commit 85dd3b1 into apache:master Feb 18, 2022
@openinx
Copy link
Member

openinx commented Feb 18, 2022

@wang-x-xia Would you also like to add a new doc page to show people how to access apache iceberg tables backed with Dell EMC ECS storage in this https://github.com/apache/iceberg-docs project ? ( Then it will display the rendered page in the apache official page).

@mechgouki
Copy link

@wang-x-xia Would you also like to add a new doc page to show people how to access apache iceberg tables backed with Dell EMC ECS storage in this https://github.com/apache/iceberg-docs project ? ( Then it will display the rendered page in the apache official page).

Sure, we will work on that soon.

@wang-x-xia
Copy link
Contributor Author

@openinx Fine! I'll finish the doc and also the catalog implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants