Skip to content

Add S3 Select pushdown for JSON files#13354

Merged
arhimondr merged 1 commit intotrinodb:masterfrom
preethiratnam:json-support
Aug 1, 2022
Merged

Add S3 Select pushdown for JSON files#13354
arhimondr merged 1 commit intotrinodb:masterfrom
preethiratnam:json-support

Conversation

@preethiratnam
Copy link
Copy Markdown
Contributor

Enable S3 Select pushdown for JSON files.

This change includes some refactoring of the IonSqlQueryBuilder to support query generation of both CSV and JSON files. Also upgrades the Hadoop testing image to 3.1 as it contains the JSONSerDe class.

The pushdown logic is restricted to only base columns, similar to CSV. S3 Select does support nested column filtering on JSON files, which we plan to enable in a later PR (to keep this PR's scope limited).

Description

Is this change a fix, improvement, new feature, refactoring, or other?

New feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Trino Hive Connector (S3 Select)

How would you describe this change to a non-technical end user or system administrator?

Enable S3 Select pushdown for JSON files.

Related issues, pull requests, and links

None

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
*  Add Amazon S3 Select pushdown for JSON files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here. I wonder if this method should simply be static (or maybe even a private method of S3SelectRecordCursorProvider)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I made it a static method in the new commit. Did not want to make it a part of the S3SelectRecordCursorProvider as I wanted to keep the RecordReader separate from it.

@preethiratnam preethiratnam requested a review from arhimondr July 29, 2022 14:57
Copy link
Copy Markdown
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

@arhimondr
Copy link
Copy Markdown
Contributor

Please update the commit messages according to the guildelines: https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages

@preethiratnam preethiratnam requested a review from arhimondr July 29, 2022 18:30
@github-actions github-actions bot added docs jdbc Relates to Trino JDBC driver release-notes labels Aug 1, 2022
@arhimondr
Copy link
Copy Markdown
Contributor

@preethiratnam We usually don't terminate a commit message with a dot. Could you please remove it? Once done I should be able to merge this.

@preethiratnam
Copy link
Copy Markdown
Contributor Author

@arhimondr Updated the commit message, thank you for reviewing!

@arhimondr arhimondr merged commit b97e7fa into trinodb:master Aug 1, 2022
@arhimondr
Copy link
Copy Markdown
Contributor

@preethiratnam Merged, thanks!

@github-actions github-actions bot added this to the 392 milestone Aug 1, 2022
@colebow
Copy link
Copy Markdown
Member

colebow commented Aug 1, 2022

Do we need user-facing docs for how to leverage this on this page?

@arhimondr

@arhimondr
Copy link
Copy Markdown
Contributor

@preethiratnam From what I understand it's an optimization and should be transparent to the end user, right?

@preethiratnam
Copy link
Copy Markdown
Contributor Author

Yes, that's right. JSON files will be pushed down to Select whenever hive.s3select-pushdown.enabled is set to true. @arhimondr

cleanup_hadoop_docker_containers

# Use Hadoop version 3.1 for S3 tests as the JSON SerDe class is not available in lower versions.
export HADOOP_BASE_IMAGE="ghcr.io/trinodb/testing/hdp3.1-hive"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @findepi (in case you weren't already aware).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hashhar

If I am reading this correctly, this removes the point of running these tests in a matrix, which we still do

source plugin/trino-hive-hadoop2/conf/hive-tests-${{ matrix.config }}.sh &&
plugin/trino-hive-hadoop2/bin/run_hive_s3_tests.sh

@arhimondr was it a conscious decision?
did you want to remove matrix from that CI job?

cc @nineinchnick @alexjo2144

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should either make testGetRecordsJson run only on HDP3 (via surefire config probably) or remove matrix (we probably shouldn't do this to maintain coverage across distros).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should maintain the matrix. The Hive version may matter for what exactly gets created on S3.

@arhimondr @preethiratnam
can you please remove export HADOOP_BASE_IMAGE=... hack from run_hive_s3_tests.sh?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @findepi @hashhar I want to use the HADOOP_BASE_IMAGE 3.1 to run Hive S3 tests. This is because the JSONSerDe class is not available in the 2.6 version.

Is there a better way of configuring the JSON tests to use 3.1 version?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preethiratnam We need to use test exclusions via maven profiles for that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll raise a new PR to fix this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi No, unfortunately it wasn't a conscious decision. I think I misread and assumed that it will only change the default value for S3 tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #13486 to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed docs jdbc Relates to Trino JDBC driver release-notes

Development

Successfully merging this pull request may close these issues.

6 participants