Add S3 Select pushdown for JSON files#13354
Add S3 Select pushdown for JSON files#13354arhimondr merged 1 commit intotrinodb:masterfrom preethiratnam:json-support
Conversation
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectSerDeDataTypeMapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Similar here. I wonder if this method should simply be static (or maybe even a private method of S3SelectRecordCursorProvider)
There was a problem hiding this comment.
Yes, I made it a static method in the new commit. Did not want to make it a part of the S3SelectRecordCursorProvider as I wanted to keep the RecordReader separate from it.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/csv/CsvIonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestIonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/IonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/IonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3select/S3SelectSerDeDataTypeMapper.java
Outdated
Show resolved
Hide resolved
|
Please update the commit messages according to the guildelines: https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages |
|
@preethiratnam We usually don't terminate a commit message with a dot. Could you please remove it? Once done I should be able to merge this. |
|
@arhimondr Updated the commit message, thank you for reviewing! |
|
@preethiratnam Merged, thanks! |
|
Do we need user-facing docs for how to leverage this on this page? |
|
@preethiratnam From what I understand it's an optimization and should be transparent to the end user, right? |
|
Yes, that's right. JSON files will be pushed down to Select whenever |
| cleanup_hadoop_docker_containers | ||
|
|
||
| # Use Hadoop version 3.1 for S3 tests as the JSON SerDe class is not available in lower versions. | ||
| export HADOOP_BASE_IMAGE="ghcr.io/trinodb/testing/hdp3.1-hive" |
There was a problem hiding this comment.
Thanks @hashhar
If I am reading this correctly, this removes the point of running these tests in a matrix, which we still do
trino/.github/workflows/ci.yml
Lines 207 to 208 in b646676
@arhimondr was it a conscious decision?
did you want to remove matrix from that CI job?
There was a problem hiding this comment.
Yes, we should either make testGetRecordsJson run only on HDP3 (via surefire config probably) or remove matrix (we probably shouldn't do this to maintain coverage across distros).
There was a problem hiding this comment.
Yeah, I think we should maintain the matrix. The Hive version may matter for what exactly gets created on S3.
@arhimondr @preethiratnam
can you please remove export HADOOP_BASE_IMAGE=... hack from run_hive_s3_tests.sh?
There was a problem hiding this comment.
@preethiratnam We need to use test exclusions via maven profiles for that.
There was a problem hiding this comment.
Thanks, I'll raise a new PR to fix this.
There was a problem hiding this comment.
@findepi No, unfortunately it wasn't a conscious decision. I think I misread and assumed that it will only change the default value for S3 tests.
Enable S3 Select pushdown for JSON files.
This change includes some refactoring of the IonSqlQueryBuilder to support query generation of both CSV and JSON files. Also upgrades the Hadoop testing image to 3.1 as it contains the JSONSerDe class.
The pushdown logic is restricted to only base columns, similar to CSV. S3 Select does support nested column filtering on JSON files, which we plan to enable in a later PR (to keep this PR's scope limited).
Description
New feature
Trino Hive Connector (S3 Select)
Enable S3 Select pushdown for JSON files.
Related issues, pull requests, and links
None
Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: