Skip to content

Improve native parquet writer performance for flat types#13030

Merged
raunaqmorarka merged 3 commits intotrinodb:masterfrom
raunaqmorarka:pq-iter
Jul 4, 2022
Merged

Improve native parquet writer performance for flat types#13030
raunaqmorarka merged 3 commits intotrinodb:masterfrom
raunaqmorarka:pq-iter

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Jun 29, 2022

Description

Improve native parquet writer performance for flat types

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

native parquet writer

How would you describe this change to a non-technical end user or system administrator?

Improve native parquet writer performance for flat types

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Iceberg, Delta Lake and Hive
* Improve native parquet writer performance for flat types. ({issue}`13030`)

@cla-bot cla-bot bot added the cla-signed label Jun 29, 2022
@raunaqmorarka raunaqmorarka marked this pull request as ready for review June 29, 2022 09:55
Copy link
Copy Markdown
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the spec, but this implies that rep level is optional, but def level is not?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

I haven't checked the spec, but this implies that rep level is optional, but def level is not?

Definition level can also be skipped when it's a required primitive (max definition level is 0). However, I don't see a way to practically encounter that in Trino parquet writer right now as we don't have a way to define always non-null column in Trino (please correct if I'm wrong). If we do have such a case, the ValuesWriter will be a DevNullValuesWriter and we'll still benefit avoiding the complexity of DefLevelIterables.

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comment about definition level = 0

@raunaqmorarka raunaqmorarka requested a review from sopel39 July 1, 2022 07:37
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetTester#testRoundTrip already tests with and without nulls
AbstractTestParquetReader also has tests with flat as well as nested types

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Before
Benchmark (benchmarkFileFormat) (compression) (dataSet)     Score        Error
write     LINEITEM              NONE          TRINO_PARQUET  62.4MB/s ±  2731.4kB/s ( 4.27%) (N = 45, α = 99.9%)
write     BIGINT_SEQUENTIAL     NONE          TRINO_PARQUET 114.3MB/s ±  4368.2kB/s ( 3.73%) (N = 45, α = 99.9%)

After
write     LINEITEM              NONE          TRINO_PARQUET  85.0MB/s ±  2937.1kB/s ( 3.37%) (N = 45, α = 99.9%)
write     BIGINT_SEQUENTIAL     NONE          TRINO_PARQUET 159.7MB/s ±  2377.8kB/s ( 1.45%) (N = 45, α = 99.9%)
Benchmark (benchmarkFileFormat) (compression) (dataSet)     Score        Error
write     LINEITEM              NONE          TRINO_PARQUET 117.1MB/s ±  5061.0kB/s ( 4.22%) (N = 45, α = 99.9%)
write     BIGINT_SEQUENTIAL     NONE          TRINO_PARQUET 305.5MB/s ±  6889.1kB/s ( 2.20%) (N = 45, α = 99.9%)
@raunaqmorarka raunaqmorarka merged commit fc7d92a into trinodb:master Jul 4, 2022
@raunaqmorarka raunaqmorarka deleted the pq-iter branch July 4, 2022 13:46
@github-actions github-actions bot added this to the 389 milestone Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants