Enable Select pushdown on uncompressed files#12633
Enable Select pushdown on uncompressed files#12633arhimondr merged 2 commits intotrinodb:masterfrom preethiratnam:master
Conversation
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
pettyjamesm
left a comment
There was a problem hiding this comment.
Overall LGTM, one question about the use of the default value in non-CSV contexts. @findepi can you approve the unit test run and trigger the specific test that causes this failure?
| String getFieldDelimiter(Properties schema) | ||
| protected String getFieldDelimiter(Properties schema) | ||
| { | ||
| return schema.getProperty(FIELD_DELIM, schema.getProperty(SERIALIZATION_FORMAT)); |
There was a problem hiding this comment.
Out of curiousity, does it ever make sense to pass the default value of schema.getProperty(SERIALIZATION_FORMAT) ? I'm not sure how that would work in this context.
There was a problem hiding this comment.
Good point. I think as Select pushdown only support CSV files at the moment, this method will always be overridden. It's possibly useful for other file formats in the future, so I left it as is.
There was a problem hiding this comment.
What's an example value of the schema.getProperty(SERIALIZATION_FORMAT) ?
There was a problem hiding this comment.
schema.getProperty(SERIALIZATION_FORMAT) has default separators of single byte characters eg: byte 1, byte 2 (code).
|
@preethiratnam can you please check CI results? |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
Thank you, I've fixed the failing test now and all checks are passing. @findepi can you please review? |
I emailed my signed CLA on 1-Jun, how long does it take to get updated? Do I need to raise a PR on https://github.com/trinodb/cla/blob/master/contributors? @findepi |
|
@preethiratnam i posted #12702 to run the tests with secrets. please let me know if this is green too |
|
Test PR with secrets: #12702 |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java
Show resolved
Hide resolved
| @Override | ||
| protected String getFieldDelimiter(Properties schema) | ||
| { | ||
| // Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select). |
There was a problem hiding this comment.
What does it mean that null defaults to a comma?
There was a problem hiding this comment.
I updated the description in the latest commit. Basically, when you send the field delimiter as null on the SelectObjectContent request to S3 Select, the S3 Select service uses comma as the field delimiter. Since this class only deals with CSV file formats, it looks to be a reasonable default to me.
| // Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select). | ||
| return schema.getProperty(FIELD_DELIM); |
There was a problem hiding this comment.
It sounds like we should have S3 Select tests for csv files with various delimiters, not only the default one.
There was a problem hiding this comment.
Added in latest revision - we now test with an explicit comma (default) delimiter and pipe (non-default) delimiter on both compressed and uncompressed files.
| String getFieldDelimiter(Properties schema) | ||
| protected String getFieldDelimiter(Properties schema) | ||
| { | ||
| return schema.getProperty(FIELD_DELIM, schema.getProperty(SERIALIZATION_FORMAT)); |
There was a problem hiding this comment.
What's an example value of the schema.getProperty(SERIALIZATION_FORMAT) ?
|
The change looks input delimiter -related, not compression related, but it supposedly fixes s3 select on uncompressed data. also, please make sure there is test coverage for {compressed, uncompressed} x { default delimter, explicit default delimiter, non-default delimiter }. |
|
Hi @martint I emailed the CLA on 1-June, but do not see myself listed on the Contributors yet. Can you please check? Thank you. |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
The issue here was that we saw incorrect results as the field delimiter was passed incorrectly to S3 Select. In the uncompressed file test input Since I fixed the issue with input delimiter, we are now seeing correct results for uncompressed files. So I have enabled S3 Select pushdown for uncompressed files as well. The latest commit has test coverage for all cases. The default delimiter case is already covered by the @findepi could you please re-run the integration tests? Thank you! |
|
@preethiratnam sorry, i cannot help with the CLA process. |
|
Hi @findepi could you please run this PR with tests? I added a new commit yesterday. Thank you! |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java
Outdated
Show resolved
Hide resolved
|
@findepi I've addressed your review comments, can you please re-run the tests? |
this should be one commit, right? |
There was a problem hiding this comment.
This should be done right already in the Enable Select pushdown on uncompressed files commit
There was a problem hiding this comment.
This should be done right already in the Enable Select pushdown on uncompressed files commit
Fix test as S3 Select now supports pushdown on uncompressed files Add more tests for S3 Select with different field delimiters
Hi @findepi I've squashed the commits into one. Can you please review again? Thank you! |
| return getCompressionCodec((TextInputFormat) inputFormat, path) | ||
| .map(codec -> (codec instanceof GzipCodec) || (codec instanceof BZip2Codec)) | ||
| .orElse(false); // TODO (https://github.com/trinodb/trino/issues/2475) fix S3 Select when file not compressed | ||
| .orElse(true); |
There was a problem hiding this comment.
question: What if a file is compressed, but with a codec that is not supported? Maybe something like
Optional<Codec> codec = getCompressionCodec((TextInputFormat) inputFormat, path);
if(codec.isEmpty()){
// assume uncompressed
return true;
}
Also I wonder how safe is to assume that a file is uncompressed if a codec is not found for a given extension? (I guess it is as expected as I see similar assumptions being made in other parts of the code, but want to clarify).
There was a problem hiding this comment.
Hi Andrii, if a file is compressed with a codec that isn't supported, the codec would be different and this would return false:
codec -> (codec instanceof GzipCodec) || (codec instanceof BZip2Codec)
The default orElse(true) is only effective when the codec is null. So the method returns true only when codec is null (uncompressed), Gzip or Bzip2. We also have unit tests for the isCompressionCodecSupported method with different codec inputs.
Good point about the null codec assumption, though. I think it's reasonable- if there is no codec defined when a Hive table is created, it doesn't depend on codecs and is expected to have uncompressed files. This method internally uses Hadoop's CompressionCodecFactory, which I think is the standard.
Thank you so much for looking into this PR!
There was a problem hiding this comment.
The default orElse(true) is only effective when the codec is null.
Oh, right. I totally misread it. Sounds good.
Description
This PR fixes issue #2475 . We have identified the root cause to be a problem with the field delimiter passed in the S3 Select query. The Hive connector supplies a default field delimiter of byte 1 when it is not explicitly set in the Hive table. This is configured in LazySerDeParameters. The byte 1 value was getting passed as a String type "1" in the S3SelectCsvRecordReader.
In this case, the query to S3 Select used "1" as the field delimiter and caused unexpected results. This bug is present in both uncompressed and compressed file formats. It was detected in uncompressed files because the test input (test_table.csv) contained "1". However, the same issue can be reproduced with compressed files as well.
To fix this, I have removed the default field delimiter from S3SelectCsvRecordReader. We will now use the field delimiter from the schema only if it is explicitly set. In other cases, we will not pass any field delimiter, which defaults to a comma in S3 Select. This is a reasonable default for CSV file formats.
Fix.
Hive S3 Select connector
Enables S3 Select for uncompressed files. Fixes a bug in S3 Select queries.
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: