Skip to content

Enabled hive splits for uncompressed CSV files with S3 Select pushdown#13754

Merged
arhimondr merged 1 commit intotrinodb:masterfrom
dnanuti:master
Aug 30, 2022
Merged

Enabled hive splits for uncompressed CSV files with S3 Select pushdown#13754
arhimondr merged 1 commit intotrinodb:masterfrom
dnanuti:master

Conversation

@dnanuti
Copy link
Copy Markdown
Member

@dnanuti dnanuti commented Aug 19, 2022

Description

Scan range allows S3 Select to query uncompressed files at a finer granularity than the entire object, by providing a byte range to SelectObjectContent requests. This change enables hive internal splits for S3 Select by sending scan range requests for uncompressed CSV files.

Is this change a fix, improvement, new feature, refactoring, or other?

This PR is a performance optimization for Hive S3 Select connector with uncompressed CSV input, leveraging the scan range feature of the service. JSON support will be added in a separate PR.
File splitting is configurable on the client side through the already existing session properties, such as:

set SESSION hive.max_initial_split_size='5MB';
set SESSION hive.max_split_size='7MB';

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive S3 Select connector

How would you describe this change to a non-technical end user or system administrator?

Trino client will return results faster when S3 Select pushdown is enabled for uncompressed CSV files:
set SESSION hive.s3_select_pushdown_enabled=true;

Related issues, pull requests, and links

Accidentally closed previous PR: #13417 with a wrong fork sync.

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Enabled Hive splits for S3 Select connector by leveraging the scan range feature of the service

Copy link
Copy Markdown
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@dnanuti dnanuti requested a review from findinpath August 24, 2022 13:50
@findinpath
Copy link
Copy Markdown
Contributor

nit: Please keep the number of chars per line in the commit detail less than 80 (as described in https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages)

@dnanuti
Copy link
Copy Markdown
Member Author

dnanuti commented Aug 24, 2022

nit: Please keep the number of chars per line in the commit detail less than 80 (as described in https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages)

Totally missed that, thanks a lot for flagging this, updated!

Scan range allows S3 Select to query uncompressed files at a finer granularity
than the entire object, by providing a byte range to SelectObjectContent
requests. This change enables hive internal splits for S3 Select by sending scan
range requests for uncompressed CSV files.
@arhimondr arhimondr merged commit 0b8d11c into trinodb:master Aug 30, 2022
@github-actions github-actions bot added this to the 395 milestone Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants