-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-17851 Support user specified content encoding for S3A #3312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17851 Support user specified content encoding for S3A #3312
Conversation
…cts they are putting. This is useful for people loading the data into other tools in the AWS ecosystem which don't use file extensions to infer compression type (e.g. serving compressed files from S3 or importing into RDS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks OK at a quick glance, tests nice too.
Hadoop 3.3.1+ serves up the headers in the getXAttr call ... you can use that to test without asking for the AWS S3 client.
regarding generic setting of arbitrary headers; something which I could imagine the createFile() builder letting you set, with a set of .opt(s3a.header.x-something, "value") builder options which would get picked up and passed in.
not done any implementation for that createFile call yet tho'
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AContentEncoding.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AContentEncoding.java
Show resolved
Hide resolved
...adoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionSSEKMSUserDefinedKey.java
Outdated
Show resolved
Hide resolved
|
I need to re-run the tests after the changes but figured I'd push to see if I'm doing the right thing around getXattr since I haven't used it before. |
32d4895 to
bb6cdfe
Compare
|
not compiling I'm afraid. other than that though, looks good we need a test to verify that directory markers still are tagged as this. Actually, looking at that existing test suite, I think you could tune those tests to check the content option too. The current tests look for the default content encoding...if in the test setup you set it to gzip, all the probes in that test could be changed to look for it and they would then be validating your code. And then, your new suite only needs to cover new behaviours, like rename. (Yes, I know, mixing things is a key forbidden aspect of JUnit tests, but I'm looking at saving $ as well as execution time.) |
mehakmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting patch and looks useful as well, I would have to check the AWS SDK code to better understand how encoding and decoding are actually happening based on the property being set or is it happening on the AWS side? 🤔
Seems like one of the imports was a static one, so it wasn't building, fix that, and yetus might be happy.
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AContentEncoding.java
Outdated
Show resolved
Hide resolved
|
|
||
| StoreContext storeContext = fs.createStoreContext(); | ||
| String key = storeContext.pathToKey(path); | ||
| String encoding = fs.getXAttrs(XA_CONTENT_ENCODING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about how this test would work. File attributes would relate to a path, which would return Map<String, byte[]>. Then we would need to get XA_CONTENT_ENCODING's value from the map and decode the byte[] to get the actual value. I recently stumbled upon its use too, a little confusing to say the least, you can check ITestS3AClientSideEncryptionKms#assertEncrypted(path) where I used file attributes as well in a test for reference.
...adoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionSSEKMSUserDefinedKey.java
Outdated
Show resolved
Hide resolved
| initCannedAcls(getConf()); | ||
|
|
||
| // Any encoding type | ||
| String contentEncoding = getConf().get(CONTENT_ENCODING, DEFAULT_CONTENT_ENCODING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use getTrimmed()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if I pass any random value via the property? Should there be a check of any kind that verifies if the passed value is valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any content type is valid, even application/x-shockwave-flash. there's probably some rules but it's probably not trying to validate as it'll only create maintenance. This option should go into storediag for a bit of extra details...I'll add it once this patch is in
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
| public class ITestS3AContentEncoding extends AbstractS3ATestBase { | ||
|
|
||
| private static final Logger LOG = | ||
| LoggerFactory.getLogger(ITestS3ACannedACLs.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logger never used, and not the right class.
| String key = storeContext.pathToKey(path); | ||
| String encoding = fs.getXAttrs(XA_CONTENT_ENCODING); | ||
| Assertions.assertThat(encoding) | ||
| .describedAs("Encoding of object %s is gzip", path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the description here would represent what the error message would say in case of an assert failure. So, maybe something like "Mismatch in the encoding of object %s"?
| assertObjectHasEncoding(path); | ||
| Path path2 = new Path(dir, "2"); | ||
| fs.rename(path, path2); | ||
| assertObjectHasEncoding(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to assert here for path2?
| if (StringUtils.isBlank(kmsKey) || !c.get(SERVER_SIDE_ENCRYPTION_ALGORITHM) | ||
| .equals(S3AEncryptionMethods.CSE_KMS.name())) { | ||
| String encryptionAlgorithm = c.get(SERVER_SIDE_ENCRYPTION_ALGORITHM); | ||
| if (kmsKey == null || StringUtils.isBlank(kmsKey) || encryptionAlgorithm == null || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as stated above.
| private boolean requesterPays = false; | ||
|
|
||
| /** Content Encoding. */ | ||
| private String contentEncoding = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to set null.
|
@holdenk -we're looking at a 3.3.2 release very soon...can we get this in? There's only some minor nits left |
|
Sure, sorry about that I've been moving and just sort of dropped all of my tasks on the floor in the same way the movers did with my computers :p But I've got some cycles this weekend I can hack on this now that my desktop boots again. |
ouch |
|
🎊 +1 overall
This message was automatically generated. |
|
re-ran the content encoding tests since those changed during code review and they pass :) |
|
🎊 +1 overall
This message was automatically generated. |
|
@holdenk afraid other changes have broken the merge. Can you resync? |
|
With #3498 merged, should this be closed? |
|
you are right. will close this one |
Adds support for user specified content encoding for S3A in the option
fs.s3a.content.encoding