Read parquet column chunks in small steps#15374
Merged
raunaqmorarka merged 6 commits intotrinodb:masterfrom Dec 23, 2022
Merged
Read parquet column chunks in small steps#15374raunaqmorarka merged 6 commits intotrinodb:masterfrom
raunaqmorarka merged 6 commits intotrinodb:masterfrom
Conversation
fb2380e to
8c1f314
Compare
8c1f314 to
5972765
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PageReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/PrimitiveColumnReader.java
Outdated
Show resolved
Hide resolved
5972765 to
24af6be
Compare
1b732cf to
973e7e0
Compare
Member
Author
|
I added tests for |
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lukasz-stec
commented
Dec 15, 2022
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
973e7e0 to
21de362
Compare
lukasz-stec
commented
Dec 15, 2022
Member
Author
lukasz-stec
left a comment
There was a problem hiding this comment.
most comments addressed
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
21de362 to
90dbd07
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
90dbd07 to
20e0a82
Compare
lukasz-stec
commented
Dec 16, 2022
Member
Author
lukasz-stec
left a comment
There was a problem hiding this comment.
most comments addressed (javadoc for ChunkedInputStream still pending)
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetColumnChunkIterator.java
Outdated
Show resolved
Hide resolved
b90c5b1 to
8ae0089
Compare
lukasz-stec
commented
Dec 20, 2022
Member
Author
lukasz-stec
left a comment
There was a problem hiding this comment.
comments addressed. Test of memory accounting are pending
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
8ae0089 to
d28cd54
Compare
Member
Author
|
memory accounting test added in |
raunaqmorarka
approved these changes
Dec 20, 2022
TestParquetDataSource tests only a generic AbstractParquetDataSource logic so it should be in trino-parquet.
Before this change, parquet column chunks were read in one go, copying everything into one big Slice. This had two issues. One, for limit queries, we potentially don't need to read entire column chunk to finish the query as first page may satisfy the limit. Second, for files with big row group size the allocated Slice can exceed the jvm limits for native byte array, and even if it doesn't, it makes memory usage not efficient due to how humungous allocations are implemented in the jvm.
d28cd54 to
7be2201
Compare
Dith3r
reviewed
Dec 21, 2022
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
Dith3r
reviewed
Dec 21, 2022
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
Dith3r
reviewed
Dec 21, 2022
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ChunkedInputStream.java
Show resolved
Hide resolved
raunaqmorarka
approved these changes
Dec 21, 2022
lib/trino-parquet/src/main/java/io/trino/parquet/AbstractParquetDataSource.java
Outdated
Show resolved
Hide resolved
We can allow for a big DiskRange to be passed to the ParquetDataSource.planRead, since it's going to split the ranges into small chunks anyway.
5914833 to
b581c52
Compare
Dith3r
approved these changes
Dec 21, 2022
Comments were addressed. Thanks @lukasz-stec, @raunaqmorarka
Member
Author
|
Benchmark results for tpch/tpcds parquet sf1k partitioned with default |
This was referenced Dec 23, 2022
Closed
This was referenced Dec 29, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
Currently,
ParquetReaderwill read the entire row group at once. This means that:limit Nqueries will take longer time than needed because the reader might read hundreds of megabytes before returning a few rows.Reader might allocate and read hundreds of megabytes at once. This is problematic because:
The issue can especially be seen when Parquet row group size is large.
This PR fixes this by reading parquet column chunks in small (8MB by default) pieces.
Additional context and related issues
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( X) Release notes are required, with the following suggested text: