Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Mar 15, 2022

This tries to add support of s3 access points
ref : https://aws.amazon.com/s3/features/access-points/

    spark-shell \
        --conf spark.sql.catalog.test=org.apache.iceberg.spark.SparkCatalog \
        --conf spark.sql.catalog.test.warehouse=s3://my-bucket \
        --conf spark.sql.catalog.test.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
        --conf spark.sql.catalog.test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
        --conf spark.sql.catalog.test.s3.access-points.my-bucket=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap

The s3.access-points config provides a map of bucket to endpoint mapping for paths stored in Iceberg, so that the S3FileIO can use the access point to get the bucket instead when the mapping is configured.


cc: @jackye1995 @rajarshisarkar @arminnajafi @amogh-jahagirdar @xiaoxuandev @yyanyy

@singhpk234 singhpk234 marked this pull request as draft March 15, 2022 12:21
@github-actions github-actions bot added the AWS label Mar 15, 2022
@singhpk234 singhpk234 force-pushed the feature/access-points branch from 2985f72 to 8e85276 Compare March 15, 2022 12:27
@jackye1995
Copy link
Contributor

FYI @flyrain @anuragmantri, this is based on our discussion in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645066803099319

Copy link
Contributor

@anuragmantri anuragmantri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, pending integration tests requested by @jackye1995.

Thanks for adding this @singhpk234

@singhpk234 singhpk234 marked this pull request as ready for review March 15, 2022 20:14
@singhpk234 singhpk234 changed the title [WIP] AWS : Add support for s3 access points AWS : Add support for s3 access points Mar 15, 2022
@rdblue rdblue changed the title AWS : Add support for s3 access points AWS: Add support for s3 access points Mar 16, 2022
@singhpk234 singhpk234 force-pushed the feature/access-points branch 3 times, most recently from 4af17c5 to d123c94 Compare March 16, 2022 18:53
@github-actions github-actions bot added the build label Mar 16, 2022
@singhpk234 singhpk234 force-pushed the feature/access-points branch 2 times, most recently from 80a82d8 to cf04f67 Compare March 17, 2022 12:51
}

@Test
public void testNewInputStreamWithCrossRegionAccessPoint() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note that in aws/aws-sdk-java-v2@51632cb, a new config multiRegionEnabled is added, which defaults to true and controls cross-region access when using MRAP. The default works for us, so we probably don't need to add it for now.

@singhpk234 singhpk234 force-pushed the feature/access-points branch from 4209080 to 8535e82 Compare March 18, 2022 21:41
@singhpk234 singhpk234 requested a review from jackye1995 March 19, 2022 04:40
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@jackye1995 jackye1995 requested a review from rdblue March 20, 2022 17:52
);
S3URI uri1 = new S3URI(p1, bucketToAccessPointMapping);

assertEquals("access-point", uri1.bucket());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be any validation that the access point is in a valid format? Or is that handled by the SDK and we don't want to duplicate logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be 2 possiblities:

  1. an ARN, something like arn:aws:s3:region:111122223333:accesspoint/my-access-point
  2. an alias, something like my-access-point-aqfqprnstn7aefdfbarligizwgyfouse1a-s3alias

because of 2, the string can really be anything that satisfies the bucket regex, so it's a bit hard to validate on service side, given the fact that we do not really check for normal bucket name correctness either. I think it should be fine to delegate to the S3 service to check for name correctness.

@singhpk234 singhpk234 force-pushed the feature/access-points branch 2 times, most recently from 2eea753 to 63d1e8a Compare March 21, 2022 06:14
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing all the comments, @rdblue I think all the comments are fixed, let us know if there is any more concern, thanks!

@singhpk234 singhpk234 force-pushed the feature/access-points branch from 5f3ad24 to bb31773 Compare March 24, 2022 03:56
@jackye1995
Copy link
Contributor

This looks good to me, and I think there has been no comment for a few days, I will merge it to unblock other feature contributions, and for people to test out the access point usages. @rdblue @rajarshisarkar let us know if you have any additional concern, we can address before the upcoming release.

@jackye1995 jackye1995 merged commit ea8bbe7 into apache:master Mar 24, 2022
@singhpk234 singhpk234 deleted the feature/access-points branch March 24, 2022 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants