Skip to content

Add S3 Select scan range support for uncompressed input#18946

Merged
pettyjamesm merged 1 commit intoprestodb:masterfrom
dnanuti:s3-select-scan-range
Mar 1, 2023
Merged

Add S3 Select scan range support for uncompressed input#18946
pettyjamesm merged 1 commit intoprestodb:masterfrom
dnanuti:s3-select-scan-range

Conversation

@dnanuti
Copy link

@dnanuti dnanuti commented Jan 19, 2023

Test plan - Locally tested (including Select logs) and added Docker tests.

Scan range allows S3 Select to query uncompressed files at a finer granularity than the entire object, by providing a byte range to SelectObjectContent requests. This change enables Hive internal splits for S3 Select pushdown on uncompressed input using this feature.

JSON Tests:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running TestSuite
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.fs.HadoopExtendedFileSystemCache (file:/Users/<user>/.m2/repository/com/facebook/presto/presto-hive-common/0.280-SNAPSHOT/presto-hive-common-0.280-SNAPSHOT.jar) to field java.lang.reflect.Field.modifiers
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.fs.HadoopExtendedFileSystemCache
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-02-01T13:23:25.090-0600 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-02-01T13:23:25.209-0600 INFO Successfully loaded & initialized native-bzip2 library system-native
2023-02-01T13:23:25.215-0600 INFO Successfully loaded & initialized native-zlib library
2023-02-01T13:23:27.874-0600 WARNING NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
2023-02-01T13:23:29.263-0600 INFO io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2023-02-01T13:23:30.543-0600 INFO Got brand-new decompressor [.bz2]
2023-02-01T13:23:30.739-0600 INFO Got brand-new decompressor [.gz]
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.742 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  42.725 s
[INFO] Finished at: 2023-02-01T19:23:36Z
[INFO] ------------------------------------------------------------------------

CSV tests:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running TestSuite
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.fs.HadoopExtendedFileSystemCache (file:/Users/<user>/.m2/repository/com/facebook/presto/presto-hive-common/0.280-SNAPSHOT/presto-hive-common-0.280-SNAPSHOT.jar) to field java.lang.reflect.Field.modifiers
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.fs.HadoopExtendedFileSystemCache
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-02-01T14:13:33.821-0600 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-02-01T14:13:33.954-0600 INFO Successfully loaded & initialized native-bzip2 library system-native
2023-02-01T14:13:33.959-0600 INFO Successfully loaded & initialized native-zlib library
2023-02-01T14:13:35.687-0600 INFO io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2023-02-01T14:13:36.494-0600 WARNING NoSuchMethodException was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
2023-02-01T14:13:38.336-0600 INFO Got brand-new decompressor [.bz2]
2023-02-01T14:13:38.770-0600 INFO Got brand-new decompressor [.gz]
2023-02-01T14:13:38.861-0600 INFO Got brand-new decompressor [.lz4]
2023-02-01T14:13:39.867-0600 INFO mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2023-02-01T14:13:39.872-0600 INFO mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
2023-02-01T14:13:52.815-0600 INFO mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2023-02-01T14:13:53.147-0600 INFO Got brand-new compressor [.gz]
2023-02-01T14:14:19.276-0600 INFO Got brand-new decompressor [.gz]
2023-02-01T14:14:19.277-0600 INFO Got brand-new decompressor [.gz]
2023-02-01T14:14:19.277-0600 INFO Got brand-new decompressor [.gz]
2023-02-01T14:14:32.543-0600 INFO Test com.facebook.presto.hive.s3select.TestHiveFileSystemS3SelectCsvPushdown::testTableCreation took 53.58s
2023-02-01T14:14:32.548-0600 WARNING Tests from com.facebook.presto.hive.s3select.TestHiveFileSystemS3SelectCsvPushdown took 1.04m
2023-02-01T14:14:46.758-0600 INFO Test com.facebook.presto.hive.TestHiveFileSystemS3::testTableCreation took 36.97s
[INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 77.221 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:45 min
[INFO] Finished at: 2023-02-01T20:14:47Z
[INFO] ------------------------------------------------------------------------
== RELEASE NOTES ==

Hive Changes
* Enable Hive splits for uncompressed inputs in S3 Select connector by leveraging the scan range feature of the service

@dnanuti dnanuti requested a review from a team as a code owner January 19, 2023 13:29
@dnanuti dnanuti requested a review from presto-oss January 19, 2023 13:29
@dnanuti dnanuti force-pushed the s3-select-scan-range branch 3 times, most recently from dbbdbaa to 60aaa22 Compare January 26, 2023 09:48
@dnanuti dnanuti changed the title WIP: S3 select scan range S3 select scan range Jan 26, 2023
@dnanuti dnanuti changed the title S3 select scan range Add S3 Select scan range support for uncompressed input Jan 26, 2023
@dnanuti dnanuti changed the title Add S3 Select scan range support for uncompressed input Enable Hive splits with Select pushdown for uncompressed input Jan 26, 2023
@dnanuti dnanuti changed the title Enable Hive splits with Select pushdown for uncompressed input Add S3 Select scan range support for uncompressed input Jan 26, 2023
@dnanuti dnanuti force-pushed the s3-select-scan-range branch 2 times, most recently from e922dcd to 1cf1a75 Compare January 26, 2023 10:26
Copy link
Contributor

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues I spotted in addition to the specific comments here:

  • We should add a test for the splits generated for a Bzip2Compressor compressed JSON file, because that would have been marked as "splittable" by InternalHiveSplitFactory logic but should not be as far as I understand the S3Select limitations
  • Not added in this or the earlier PR's but S3SelectLineRecordReader is marked @Threadsafe but it definitely is not, with lots of internal unsynchronized mutation of fields- we should remove that annotation.

@rohanpednekar
Copy link
Contributor

@dnanuti Thanks for your contributions, do you think we can address the feedback, we can target this feature for next release.

@dnanuti
Copy link
Author

dnanuti commented Feb 1, 2023

@dnanuti Thanks for your contributions, do you think we can address the feedback, we can target this feature for next release.

Hey! Thanks for the update. This was de-prioritized on our side, but I'll try to update it by the end of the week. Hope this fits your schedule.

@dnanuti dnanuti force-pushed the s3-select-scan-range branch 2 times, most recently from 7261a29 to 416f89f Compare February 1, 2023 20:07
@dnanuti dnanuti force-pushed the s3-select-scan-range branch 8 times, most recently from e4d28ba to b79cf19 Compare February 28, 2023 15:37
@dnanuti dnanuti force-pushed the s3-select-scan-range branch from ef1c118 to e606131 Compare February 28, 2023 17:37
Scan range allows S3 Select to query uncompressed files
at a finer granularity than the entire object, by
providing a byte range to SelectObjectContent requests.
This change enables Hive internal splits for S3 Select
pushdown on uncompressed input using this feature.
@dnanuti dnanuti force-pushed the s3-select-scan-range branch from 27ae270 to b6f2e45 Compare February 28, 2023 18:05
Copy link
Contributor

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the latest changes

@pettyjamesm pettyjamesm merged commit f3adc70 into prestodb:master Mar 1, 2023
@pettyjamesm
Copy link
Contributor

Merged, thanks @dnanuti for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants