S3 transfer within AWS #517
Replies: 1 comment 1 reply
-
Looks good overall! 🚀 My only note is about the I don't think it's possible to inject and provide services at the same time. oh, and also 👍 for the LocalStack thing (hope that could replace MinIO on the long term) |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
I would like to propose and discuss an addition to the S3 data plane to enable the possibility of executing data transfers between 2 S3 buckets within the AWS infrastructure, i.e. without going through the data plane. In the following, I'll try and outline the problem, how it can be tackled in AWS as well as how this approach can be implemented in the EDC. I've also created a PoC for this feature. The links can be found at the bottom of this discussion.
1. Problem
With the way the existing S3 data plane is implemented, it reads the data from the source bucket first and then writes it to the destination bucket. This is great for decoupling between source and destination, and allowing for transfers between S3 and other storage types. But in the case that both source and destination are S3 buckets, it causes higher than necessary transfer costs, as typically, moving data out of the AWS infrastructure is rather expensive. By copying data between the buckets within AWS, transfer costs can be reduced.
2. Approach in AWS
In AWS, any direct S3 transfer between different accounts (disregarding of whether a simple S3-copy is executed or a service like
DataSync
is used) requires a specific permissions setup between the two accounts. The recommended approach for facilitating transfers between S3 buckets in different AWS accounts A & B is the following:Data is then copied from one bucket to the other across AWS accounts, without ever leaving the AWS infrastructure.
In the recommended approach linked above, the role is created in the destination account, allowing the consumer to read from the provider bucket. For the PoC implementation, these roles have been reversed.
3. Approach for PoC
Implementing this approach in the EDC consists of mainly 2 parts: a dedicated
TransferService
, which will initiate an S3 copy, and provisioning-classes to set up the required role & permission structure. Since theTransferService
will simply need to create an S3 client and call the copy operation, the main part of the PoC implementation is the provisioning. The PoC implementation of the provisioner has been aligned with the implementation of the existing S3 provisioning extension (using aProvisionPipeline
andDeprovisionPipeline
).Provision & transfer sequence
The sequence for the transfer (including provisioning) is as follows:
AmazonS3-PUSH
)AmazonS3
, the provider provisioning goes through the following steps:2.1 Get the current AWS user
2.2 Create a role with a trust policy that allows the current user to assume the role (role tags:
- created-by: EDC
- edc:component-id: [component-id]
- edc:transfer-process-id: [transfer-process-id])
2.3 Add a role policy to the role that allows reading the source object and writing the destination object
2.4 Get the destination bucket policy
2.5 Update the destination bucket policy with a statement to allow the previously created role to write to the bucket
2.6 Assume the role and return the credentials as part of the
ProvisionedResource
3.1 Select the dedicated
TransferService
3.2 Create an S3 client using the credentials of the assumed role
3.3 Invoke the copy operation to copy the source object to the destination bucket
4.1 Get the destination bucket policy
4.2 Filter for the statement added during provisioning by
Sid
and remove it4.3 Update the destination bucket policy (if there are no other statements, the bucket policy is deleted, as when using the SDK, it is not possible to update a bucket policy without statements (even though this works fine in the AWS console))
4.4 Delete the role policy of the created role
4.5 Delete the role
After deprovisioning finishes, the AWS accounts are left in a clean state.
Note: all provisioning is done by the provider. This also includes updating the destination bucket policy. While this is not ideal, there is currently no way for the consumer to take care of this, as both provider and consumer need information from the other party (source account role needs to reference destination bucket, destination bucket policy needs to reference source account role).
Selection of transfer service
It needs to be ensured that the transfer is handled by the dedicated
TransferService
. To support S3 transfers disregarding of the destination type, both S3 data plane extensions need to be supported in parallel, meaning also the defaultPipelineService
would be applicable to handle an S3-to-S3 transfer. Therefore, a newTransferServiceSelectionStrategy
has been added in a separate extension. It tries to find aTransferService
that can handle the transfer and is NOT a PipelineService first. If this does not yield a service, it defaults to the default strategy. If there are better ways to ensure the correctTransferService
is used, I'll gladly incorporate those.4. Testing
Both software and manual tests have been executed to verify the PoC functionality, namely:
S3CopyResourceDefinition
S3CopyProvisionedResource
S3CopyResourceDefinitionGenerator
S3CopyProvisioner
AwsS3CopyTransferService
system-tests
module)5. Open points
As the first implementation is a PoC, there are still some open points/already known room for improvement.
TransferServiceSelectionStrategy
could likely be improved.TransferService
copies just one file (objectPrefix
on the source in not yet supported), but will be extended to also support the copying of multiple files.S3AsyncClient
, but this will be updated to make part size & threshold configurable.s3:GetBucketPolicy
,s3:PutBucketPolicy
,s3:DeleteBucketPolicy
). I did not want to mix up things here, so this has not been done for now, but could be easily achieved.The PoC is currently located in the Cofinity-X fork of this repository. Please find the links below. I also opened a draft PR in the fork for better visualization of additions and changes that have been made for the implementation of this feature.
As mentioned above, there is already known room for improvement, but I think it would make sense to provide a base version of this feature first and improve it step-by-step afterwards. Depending on the result of this discussion, I would update the PoC to incorporate feedback, and eventually open a PR in this repository to provide the feature. Happy to hear your feedback and discuss.
Links
Beta Was this translation helpful? Give feedback.
All reactions