Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Apr 17, 2025

This uses a custom ExecutionInterceptor to set the correct headers when storage credentials are configured for S3FileIO. This is an alternative approach to #12799 and is currently WIP (I haven't actually tested this yet to see whether it works properly and some additional tests are missing)

@nastra nastra marked this pull request as draft April 17, 2025 10:58
@nastra nastra force-pushed the s3fileio-multi-prefix-support-interceptor branch from 0288511 to fed52dd Compare April 17, 2025 11:31
@danielcweeks
Copy link
Contributor

danielcweeks commented Apr 17, 2025

@nastra I'm a little concerned about this approach because we're doing a lot of hand-crafting/manipulation of the request as opposed to using features of the SDK. There are two alternatives I can think of:

  1. Update S3FileIO::client() to take a path: S3FileIO::client(<path>) then fetch or create a new S3Client with the best matched StorageCredential. This does create more clients, but they're light-weight and we can even share the underlying http client.
  2. Use the request-level credentials provider override. Each request (e.g. s3.getObject(...)) supports a request-level override that includes setting a credential provider for that specific request (r -> r.overrideConfiguration(o -> o.credentialsProvider(storageCredentialProvider)))

While the ExecutionInterceptor may work, it's manipulating the request at a much lower level that makes me concerned that we'll run into issues.

I think I prefer #1 above because it is an approach we can replicate for the GCSFileIO and ADLSFileIO. The more similar the implementations, the less likely we'll have inconsistent handling.

@singhpk234
Copy link
Contributor

Thanks @danielcweeks, my understanding for the approaches are the following :
For 1, My only concern with the following was each client maintaining its own connection pool, with large number of prefixes, which would not be scalable IMHO, if we can share the pool across all the s3 client it would be handle this concern. I am not sure if that something we do today.

For 2, Prefix level credProvider, are we thinking of creating a cred provider per path, and each cred provider will do a refresh logic and get the cred for the the path it corresponds to ? this would trigger a lot of request to the creds endpoint ?

I understand the ExecutionInterceptor is very specific to AWS, and touches the request at a very low level, but it helps in not having to manage a lot of connections to s3 (per prefix) or making a lot of calls to creds refresh endpoint. It is never the less something S3A supports, though not for cred vending for audit logging : https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/auditing.html

Sorry for hopping in this discussion, I wanted to contribute here based on what i have seen for solving these cred vending issues in past !

@danielcweeks danielcweeks self-requested a review April 18, 2025 18:52
@danielcweeks
Copy link
Contributor

@singhpk234

My only concern with the following was each client maintaining its own connection pool, with large number of prefixes, which would not be scalable IMHO
and each cred provider will do a refresh logic and get the cred for the the path it corresponds to

I don't think we would expect that there are going to be large numbers of prefixes. A table will typically have just a single prefix. We're trying to support cases where there are multiple but also cannot express the policy within the size limits of a single credential. Reusing the client would address these concerns, but I don't think we even need to address that case unless it's trivial to do as part of this support.

I feel like the ExecutionInterceptor approach is optimizing for the wrong goals and we can address those in a different way. I'd rather be consistent than clever here since we need to support it across multiple implementations.

@singhpk234
Copy link
Contributor

singhpk234 commented Apr 18, 2025

I see, thanks for the explanation @danielcweeks if we are sure that we don't want to have large number of prefixes and this is only there to suppport case is like support cases where there are multiple but also cannot express the policy within the size limits of a single credential
Then definitely its not worth the complexity, though we do use things like credProvider to attach creds, my understanding was they are replaceable, hence i suggested this approach !

@nastra
Copy link
Contributor Author

nastra commented Apr 22, 2025

@nastra I'm a little concerned about this approach because we're doing a lot of hand-crafting/manipulation of the request as opposed to using features of the SDK. There are two alternatives I can think of:

  1. Update S3FileIO::client() to take a path: S3FileIO::client(<path>) then fetch or create a new S3Client with the best matched StorageCredential. This does create more clients, but they're light-weight and we can even share the underlying http client.
  2. Use the request-level credentials provider override. Each request (e.g. s3.getObject(...)) supports a request-level override that includes setting a credential provider for that specific request (r -> r.overrideConfiguration(o -> o.credentialsProvider(storageCredentialProvider)))

While the ExecutionInterceptor may work, it's manipulating the request at a much lower level that makes me concerned that we'll run into issues.

I think I prefer #1 above because it is an approach we can replicate for the GCSFileIO and ADLSFileIO. The more similar the implementations, the less likely we'll have inconsistent handling.

The idea of this PR was to just explore how something with an ExecutionInterceptor would look like. While the approach requires less code and seems simpler, I do agree that we're modifying things at a lower level, which could introduce unwanted side-effects/bugs. There's also #12799 which effectively does what you described in Option 1.

@danielcweeks danielcweeks removed their request for review May 2, 2025 22:39
@nastra nastra closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants