Enable Select pushdown on uncompressed files by preethiratnam · Pull Request #12633 · trinodb/trino

preethiratnam · 2022-06-01T12:25:23Z

Description

This PR fixes issue #2475 . We have identified the root cause to be a problem with the field delimiter passed in the S3 Select query. The Hive connector supplies a default field delimiter of byte 1 when it is not explicitly set in the Hive table. This is configured in LazySerDeParameters. The byte 1 value was getting passed as a String type "1" in the S3SelectCsvRecordReader.

In this case, the query to S3 Select used "1" as the field delimiter and caused unexpected results. This bug is present in both uncompressed and compressed file formats. It was detected in uncompressed files because the test input (test_table.csv) contained "1". However, the same issue can be reproduced with compressed files as well.

To fix this, I have removed the default field delimiter from S3SelectCsvRecordReader. We will now use the field delimiter from the schema only if it is explicitly set. In other cases, we will not pass any field delimiter, which defaults to a comma in S3 Select. This is a reasonable default for CSV file formats.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive S3 Select connector

How would you describe this change to a non-technical end user or system administrator?

Enables S3 Select for uncompressed files. Fixes a bug in S3 Select queries.

Related issues, pull requests, and links

Fixes Fix and reenable S3 Select for uncompressed files #2475

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Fix and reenable S3 Select for uncompressed files.

cla-bot · 2022-06-01T12:25:26Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

pettyjamesm

Overall LGTM, one question about the use of the default value in non-CSV contexts. @findepi can you approve the unit test run and trigger the specific test that causes this failure?

pettyjamesm · 2022-06-01T14:41:15Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectLineRecordReader.java

-    String getFieldDelimiter(Properties schema)
+    protected String getFieldDelimiter(Properties schema)
    {
        return schema.getProperty(FIELD_DELIM, schema.getProperty(SERIALIZATION_FORMAT));


Out of curiousity, does it ever make sense to pass the default value of schema.getProperty(SERIALIZATION_FORMAT) ? I'm not sure how that would work in this context.

Good point. I think as Select pushdown only support CSV files at the moment, this method will always be overridden. It's possibly useful for other file formats in the future, so I left it as is.

What's an example value of the schema.getProperty(SERIALIZATION_FORMAT) ?

schema.getProperty(SERIALIZATION_FORMAT) has default separators of single byte characters eg: byte 1, byte 2 (code).

findepi · 2022-06-04T19:13:46Z

@preethiratnam can you please check CI results?

cla-bot · 2022-06-06T09:38:52Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

preethiratnam · 2022-06-06T09:48:16Z

@preethiratnam can you please check CI results?

Thank you, I've fixed the failing test now and all checks are passing. @findepi can you please review?

preethiratnam · 2022-06-06T09:49:56Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

I emailed my signed CLA on 1-Jun, how long does it take to get updated? Do I need to raise a PR on https://github.com/trinodb/cla/blob/master/contributors? @findepi

findepi · 2022-06-06T14:11:17Z

@preethiratnam i posted #12702 to run the tests with secrets. please let me know if this is green too

findepi · 2022-06-06T20:42:10Z

Test PR with secrets: #12702
(cc @ilfrin @nineinchnick for potentially streamlining this)

cla-bot · 2022-06-07T13:24:36Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java

findepi · 2022-06-07T14:23:22Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java

+    @Override
+    protected String getFieldDelimiter(Properties schema)
+    {
+        // Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select).


What does it mean that null defaults to a comma?

I updated the description in the latest commit. Basically, when you send the field delimiter as null on the SelectObjectContent request to S3 Select, the S3 Select service uses comma as the field delimiter. Since this class only deals with CSV file formats, it looks to be a reasonable default to me.

findepi · 2022-06-07T14:23:55Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java

+        // Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select).
+        return schema.getProperty(FIELD_DELIM);


It sounds like we should have S3 Select tests for csv files with various delimiters, not only the default one.

Added in latest revision - we now test with an explicit comma (default) delimiter and pipe (non-default) delimiter on both compressed and uncompressed files.

findepi · 2022-06-07T14:24:51Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectLineRecordReader.java

-    String getFieldDelimiter(Properties schema)
+    protected String getFieldDelimiter(Properties schema)
    {
        return schema.getProperty(FIELD_DELIM, schema.getProperty(SERIALIZATION_FORMAT));


What's an example value of the schema.getProperty(SERIALIZATION_FORMAT) ?

findepi · 2022-06-07T14:27:30Z

The change looks input delimiter -related, not compression related, but it supposedly fixes s3 select on uncompressed data.
What do the two (field delim and compression) have in common?

also, please make sure there is test coverage for {compressed, uncompressed} x { default delimter, explicit default delimiter, non-default delimiter }.

preethiratnam · 2022-06-08T16:13:37Z

Hi @martint I emailed the CLA on 1-June, but do not see myself listed on the Contributors yet. Can you please check? Thank you.

cla-bot · 2022-06-09T15:46:19Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

preethiratnam · 2022-06-09T15:58:44Z

The change looks input delimiter -related, not compression related, but it supposedly fixes s3 select on uncompressed data. What do the two (field delim and compression) have in common?

also, please make sure there is test coverage for {compressed, uncompressed} x { default delimter, explicit default delimiter, non-default delimiter }.

The issue here was that we saw incorrect results as the field delimiter was passed incorrectly to S3 Select. In the uncompressed file test input test_table.csv, we had an input containing "1". This inadvertently helped catch the issue for uncompressed files. The same issue exists in compressed files as well, but we hadn't detected it so far, as none of the compressed file test inputs contained "1".

Since I fixed the issue with input delimiter, we are now seeing correct results for uncompressed files. So I have enabled S3 Select pushdown for uncompressed files as well.

The latest commit has test coverage for all cases. The default delimiter case is already covered by the testGetRecords method that was failing previously. I have added tests for explicit default delimiter and non-default delimiter.

@findepi could you please re-run the integration tests? Thank you!

preethiratnam · 2022-06-09T16:00:02Z

Hi @martint I emailed the CLA on 1-June, but do not see myself listed on the Contributors yet. Can you please check? Thank you.

Also reaching out to @findepi , can you please help with this?

findepi · 2022-06-09T20:37:44Z

@preethiratnam sorry, i cannot help with the CLA process.

preethiratnam · 2022-06-10T08:46:06Z

Hi @findepi could you please run this PR with tests? I added a new commit yesterday. Thank you!

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java

preethiratnam · 2022-06-10T13:54:10Z

@findepi I've addressed your review comments, can you please re-run the tests?

findepi · 2022-06-15T10:10:01Z

Enable Select pushdown on uncompressed files

Fix test as S3 Select now supports pushdown on uncompressed files

this should be one commit, right?

findepi · 2022-06-15T10:10:15Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java

This should be done right already in the Enable Select pushdown on uncompressed files commit

findepi · 2022-06-15T10:10:23Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java

This should be done right already in the Enable Select pushdown on uncompressed files commit

Fix test as S3 Select now supports pushdown on uncompressed files Add more tests for S3 Select with different field delimiters

preethiratnam · 2022-06-30T13:07:37Z

Enable Select pushdown on uncompressed files
Fix test as S3 Select now supports pushdown on uncompressed files

this should be one commit, right?

Hi @findepi I've squashed the commits into one. Can you please review again? Thank you!

arhimondr

LGTM % comment

arhimondr · 2022-07-08T18:09:18Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java

            return getCompressionCodec((TextInputFormat) inputFormat, path)
                    .map(codec -> (codec instanceof GzipCodec) || (codec instanceof BZip2Codec))
-                    .orElse(false); // TODO (https://github.com/trinodb/trino/issues/2475) fix S3 Select when file not compressed
+                    .orElse(true);


question: What if a file is compressed, but with a codec that is not supported? Maybe something like

Optional<Codec> codec = getCompressionCodec((TextInputFormat) inputFormat, path); if(codec.isEmpty()){ // assume uncompressed return true; }

Also I wonder how safe is to assume that a file is uncompressed if a codec is not found for a given extension? (I guess it is as expected as I see similar assumptions being made in other parts of the code, but want to clarify).

Hi Andrii, if a file is compressed with a codec that isn't supported, the codec would be different and this would return false:

codec -> (codec instanceof GzipCodec) || (codec instanceof BZip2Codec)

The default orElse(true) is only effective when the codec is null. So the method returns true only when codec is null (uncompressed), Gzip or Bzip2. We also have unit tests for the isCompressionCodecSupported method with different codec inputs.

Good point about the null codec assumption, though. I think it's reasonable- if there is no codec defined when a Hive table is created, it doesn't depend on codecs and is expected to have uncompressed files. This method internally uses Hadoop's CompressionCodecFactory, which I think is the standard.

Thank you so much for looking into this PR!

The default orElse(true) is only effective when the codec is null.

Oh, right. I totally misread it. Sounds good.

github-actions bot added the tests:hive label Jun 1, 2022

pettyjamesm approved these changes Jun 1, 2022

View reviewed changes

pettyjamesm requested a review from findepi June 2, 2022 15:57

findepi mentioned this pull request Jun 6, 2022

Test: Enable Select pushdown on uncompressed files #12702

Closed

findepi reviewed Jun 7, 2022

View reviewed changes

findepi reviewed Jun 10, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectCsvRecordReader.java Outdated Show resolved Hide resolved

findepi reviewed Jun 10, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java Outdated Show resolved Hide resolved

findepi reviewed Jun 10, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectPushdown.java Outdated Show resolved Hide resolved

cla-bot bot added the cla-signed label Jun 10, 2022

martint requested a review from findepi June 14, 2022 18:53

findepi reviewed Jun 15, 2022

View reviewed changes

preethiratnam added 2 commits June 30, 2022 12:19

Enable Select pushdown on uncompressed files

d5c1472

Fix test as S3 Select now supports pushdown on uncompressed files Add more tests for S3 Select with different field delimiters

Supply default field delimiter when invoking S3 Select

db3fd91

pettyjamesm assigned arhimondr Jul 8, 2022

arhimondr reviewed Jul 8, 2022

View reviewed changes

arhimondr approved these changes Jul 12, 2022

View reviewed changes

arhimondr merged commit 0cb9524 into trinodb:master Jul 12, 2022

github-actions bot added this to the 390 milestone Jul 12, 2022

colebow mentioned this pull request Jul 12, 2022

Add Trino 390 release notes #13130

Merged

dnanuti mentioned this pull request Nov 15, 2022

Hive Connector with Amazon S3 documentation updates #15035

Merged

dnanuti mentioned this pull request Jun 7, 2023

Correctness issues in S3 Select Pushdown #17775

Closed

		// Use the field delimiter only if it is specified in the schema. If not, use null (defaults to ',' in S3 Select).
		return schema.getProperty(FIELD_DELIM);

Conversation

preethiratnam commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

cla-bot bot commented Jun 1, 2022

Uh oh!

pettyjamesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi commented Jun 4, 2022

Uh oh!

cla-bot bot commented Jun 6, 2022

Uh oh!

preethiratnam commented Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

preethiratnam commented Jun 6, 2022

Uh oh!

findepi commented Jun 6, 2022

Uh oh!

findepi commented Jun 6, 2022

Uh oh!

cla-bot bot commented Jun 7, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi commented Jun 7, 2022

Uh oh!

preethiratnam commented Jun 8, 2022

Uh oh!

cla-bot bot commented Jun 9, 2022

Uh oh!

preethiratnam commented Jun 9, 2022

Uh oh!

preethiratnam commented Jun 9, 2022

Uh oh!

findepi commented Jun 9, 2022

Uh oh!

preethiratnam commented Jun 10, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

preethiratnam commented Jun 10, 2022

Uh oh!

findepi commented Jun 15, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preethiratnam commented Jun 30, 2022

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

preethiratnam commented Jun 1, 2022 •

edited

Loading

preethiratnam commented Jun 6, 2022 •

edited

Loading