Skip to content

Enable Select pushdown on uncompressed files#12633

Merged
arhimondr merged 2 commits intotrinodb:masterfrom
preethiratnam:master
Jul 12, 2022
Merged

Enable Select pushdown on uncompressed files#12633
arhimondr merged 2 commits intotrinodb:masterfrom
preethiratnam:master

Conversation

@preethiratnam
Copy link
Copy Markdown
Contributor

@preethiratnam preethiratnam commented Jun 1, 2022

Description

This PR fixes issue #2475 . We have identified the root cause to be a problem with the field delimiter passed in the S3 Select query. The Hive connector supplies a default field delimiter of byte 1 when it is not explicitly set in the Hive table. This is configured in LazySerDeParameters. The byte 1 value was getting passed as a String type "1" in the S3SelectCsvRecordReader.

In this case, the query to S3 Select used "1" as the field delimiter and caused unexpected results. This bug is present in both uncompressed and compressed file formats. It was detected in uncompressed files because the test input (test_table.csv) contained "1". However, the same issue can be reproduced with compressed files as well.

To fix this, I have removed the default field delimiter from S3SelectCsvRecordReader. We will now use the field delimiter from the schema only if it is explicitly set. In other cases, we will not pass any field delimiter, which defaults to a comma in S3 Select. This is a reasonable default for CSV file formats.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive S3 Select connector

How would you describe this change to a non-technical end user or system administrator?

Enables S3 Select for uncompressed files. Fixes a bug in S3 Select queries.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Fix and reenable S3 Select for uncompressed files.

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jun 1, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

Copy link
Copy Markdown
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, one question about the use of the default value in non-CSV contexts. @findepi can you approve the unit test run and trigger the specific test that causes this failure?

String getFieldDelimiter(Properties schema)
protected String getFieldDelimiter(Properties schema)
{
return schema.getProperty(FIELD_DELIM, schema.getProperty(SERIALIZATION_FORMAT));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, does it ever make sense to pass the default value of schema.getProperty(SERIALIZATION_FORMAT) ? I'm not sure how that would work in this context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think as Select pushdown only support CSV files at the moment, this method will always be overridden. It's possibly useful for other file formats in the future, so I left it as is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example value of the schema.getProperty(SERIALIZATION_FORMAT) ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema.getProperty(SERIALIZATION_FORMAT) has default separators of single byte characters eg: byte 1, byte 2 (code).

@pettyjamesm pettyjamesm requested a review from findepi June 2, 2022 15:57
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 4, 2022

@preethiratnam can you please check CI results?

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jun 6, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@preethiratnam
Copy link
Copy Markdown
Contributor Author

preethiratnam commented Jun 6, 2022

@preethiratnam can you please check CI results?

Thank you, I've fixed the failing test now and all checks are passing. @findepi can you please review?

@preethiratnam
Copy link
Copy Markdown
Contributor Author

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

I emailed my signed CLA on 1-Jun, how long does it take to get updated? Do I need to raise a PR on https://github.com/trinodb/cla/blob/master/contributors? @findepi

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 6, 2022

@preethiratnam i posted #12702 to run the tests with secrets. please let me know if this is green too

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 6, 2022

Test PR with secrets: #12702
(cc @ilfrin @nineinchnick for potentially streamlining this)

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jun 7, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@Override
protected String getFieldDelimiter(Properties schema)
{
// Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that null defaults to a comma?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the description in the latest commit. Basically, when you send the field delimiter as null on the SelectObjectContent request to S3 Select, the S3 Select service uses comma as the field delimiter. Since this class only deals with CSV file formats, it looks to be a reasonable default to me.

Comment on lines +99 to +100
// Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select).
return schema.getProperty(FIELD_DELIM);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we should have S3 Select tests for csv files with various delimiters, not only the default one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in latest revision - we now test with an explicit comma (default) delimiter and pipe (non-default) delimiter on both compressed and uncompressed files.

String getFieldDelimiter(Properties schema)
protected String getFieldDelimiter(Properties schema)
{
return schema.getProperty(FIELD_DELIM, schema.getProperty(SERIALIZATION_FORMAT));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example value of the schema.getProperty(SERIALIZATION_FORMAT) ?

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 7, 2022

The change looks input delimiter -related, not compression related, but it supposedly fixes s3 select on uncompressed data.
What do the two (field delim and compression) have in common?

also, please make sure there is test coverage for {compressed, uncompressed} x { default delimter, explicit default delimiter, non-default delimiter }.

@preethiratnam
Copy link
Copy Markdown
Contributor Author

Hi @martint I emailed the CLA on 1-June, but do not see myself listed on the Contributors yet. Can you please check? Thank you.

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jun 9, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@preethiratnam
Copy link
Copy Markdown
Contributor Author

The change looks input delimiter -related, not compression related, but it supposedly fixes s3 select on uncompressed data. What do the two (field delim and compression) have in common?

also, please make sure there is test coverage for {compressed, uncompressed} x { default delimter, explicit default delimiter, non-default delimiter }.

The issue here was that we saw incorrect results as the field delimiter was passed incorrectly to S3 Select. In the uncompressed file test input test_table.csv, we had an input containing "1". This inadvertently helped catch the issue for uncompressed files. The same issue exists in compressed files as well, but we hadn't detected it so far, as none of the compressed file test inputs contained "1".

Since I fixed the issue with input delimiter, we are now seeing correct results for uncompressed files. So I have enabled S3 Select pushdown for uncompressed files as well.

The latest commit has test coverage for all cases. The default delimiter case is already covered by the testGetRecords method that was failing previously. I have added tests for explicit default delimiter and non-default delimiter.

@findepi could you please re-run the integration tests? Thank you!

@preethiratnam
Copy link
Copy Markdown
Contributor Author

Hi @martint I emailed the CLA on 1-June, but do not see myself listed on the Contributors yet. Can you please check? Thank you.

Also reaching out to @findepi , can you please help with this?

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 9, 2022

@preethiratnam sorry, i cannot help with the CLA process.

@preethiratnam
Copy link
Copy Markdown
Contributor Author

Hi @findepi could you please run this PR with tests? I added a new commit yesterday. Thank you!

@cla-bot cla-bot bot added the cla-signed label Jun 10, 2022
@preethiratnam
Copy link
Copy Markdown
Contributor Author

@findepi I've addressed your review comments, can you please re-run the tests?

@martint martint requested a review from findepi June 14, 2022 18:53
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 15, 2022

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done right already in the Enable Select pushdown on uncompressed files commit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done right already in the Enable Select pushdown on uncompressed files commit

Fix test as S3 Select now supports pushdown on uncompressed files

Add more tests for S3 Select with different field delimiters
@preethiratnam
Copy link
Copy Markdown
Contributor Author

Enable Select pushdown on uncompressed files
Fix test as S3 Select now supports pushdown on uncompressed files

this should be one commit, right?

Hi @findepi I've squashed the commits into one. Can you please review again? Thank you!

Copy link
Copy Markdown
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comment

return getCompressionCodec((TextInputFormat) inputFormat, path)
.map(codec -> (codec instanceof GzipCodec) || (codec instanceof BZip2Codec))
.orElse(false); // TODO (https://github.com/trinodb/trino/issues/2475) fix S3 Select when file not compressed
.orElse(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What if a file is compressed, but with a codec that is not supported? Maybe something like

Optional<Codec> codec = getCompressionCodec((TextInputFormat) inputFormat, path);
if(codec.isEmpty()){
  // assume uncompressed
  return true;
}

Also I wonder how safe is to assume that a file is uncompressed if a codec is not found for a given extension? (I guess it is as expected as I see similar assumptions being made in other parts of the code, but want to clarify).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrii, if a file is compressed with a codec that isn't supported, the codec would be different and this would return false:

codec -> (codec instanceof GzipCodec) || (codec instanceof BZip2Codec)

The default orElse(true) is only effective when the codec is null. So the method returns true only when codec is null (uncompressed), Gzip or Bzip2. We also have unit tests for the isCompressionCodecSupported method with different codec inputs.

Good point about the null codec assumption, though. I think it's reasonable- if there is no codec defined when a Hive table is created, it doesn't depend on codecs and is expected to have uncompressed files. This method internally uses Hadoop's CompressionCodecFactory, which I think is the standard.

Thank you so much for looking into this PR!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default orElse(true) is only effective when the codec is null.

Oh, right. I totally misread it. Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Fix and reenable S3 Select for uncompressed files

4 participants