Skip to content

Refactor DefLevelIterables to improve optimized parquet writer#13714

Merged
raunaqmorarka merged 2 commits intotrinodb:masterfrom
raunaqmorarka:pqw-def
Sep 8, 2022
Merged

Refactor DefLevelIterables to improve optimized parquet writer#13714
raunaqmorarka merged 2 commits intotrinodb:masterfrom
raunaqmorarka:pqw-def

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Aug 17, 2022

Description

Avoid iterators, streams and optionals when writing definition levels
to improve performance.

Benchmark HiveFileFormat#write (compression NONE, benchmarkFileFormat TRINO_PARQUET)
Before
MAP_VARCHAR_DOUBLE              80.6MB/s ±  1964.5kB/s ( 2.38%) (N = 45, α = 99.9%)
LARGE_MAP_VARCHAR_DOUBLE       108.4MB/s ±  3725.3kB/s ( 3.36%) (N = 45, α = 99.9%)
MAP_INT_DOUBLE                  90.6MB/s ±  1461.7kB/s ( 1.58%) (N = 45, α = 99.9%)
LARGE_MAP_INT_DOUBLE            94.4MB/s ±  1490.0kB/s ( 1.54%) (N = 45, α = 99.9%)
LARGE_ARRAY_VARCHAR             91.9MB/s ±  1458.6kB/s ( 1.55%) (N = 45, α = 99.9%)

After
MAP_VARCHAR_DOUBLE             114.9MB/s ±  5665.1kB/s ( 4.82%) (N = 45, α = 99.9%)
LARGE_MAP_VARCHAR_DOUBLE       136.8MB/s ±  3532.4kB/s ( 2.52%) (N = 45, α = 99.9%)
MAP_INT_DOUBLE                 114.9MB/s ±  3012.9kB/s ( 2.56%) (N = 45, α = 99.9%)
LARGE_MAP_INT_DOUBLE           124.3MB/s ±  3292.7kB/s ( 2.59%) (N = 45, α = 99.9%)
LARGE_ARRAY_VARCHAR            102.9MB/s ±  2475.0kB/s ( 2.35%) (N = 45, α = 99.9%)

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

optimized parquet writer

How would you describe this change to a non-technical end user or system administrator?

Improves performance of writes through optimized parquet writer for nested data types.

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Iceberg
* Improve performance when writing Parquet files with [structural data types](https://trino.io/docs/current/language/types.html#structural-data-types). ({issue}`13714`)

# Delta Lake, Hive
* Improve optimized Parquet writer performance for [structural data types](https://trino.io/docs/current/language/types.html#structural-data-types). ({issue}`13714`)

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a commit where DefLevelIterables are renamed to DefLevelWriterProviders only for the sake of easier reviewing and then squashing it before the merge.
Right now most of the code disappears and reemmeger in a different place which is difficult to review

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Do you mind adding a commit where DefLevelIterables are renamed to DefLevelWriterProviders only for the sake of easier reviewing and then squashing it before the merge.
Right now most of the code disappears and reemmeger in a different place which is difficult to review

Done

@skrzypo987
Copy link
Copy Markdown
Member

Do you mind adding a commit where DefLevelIterables are renamed to DefLevelWriterProviders only for the sake of easier reviewing and then squashing it before the merge.
Right now most of the code disappears and reemmeger in a different place which is difficult to review

Done

Wow. That was fast

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit % Someone that knows ORC standard more should also take a look.
Did you run macrobenchmarks?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Looks legit % Someone that knows ORC standard more should also take a look.
Did you run macrobenchmarks?

Yes, I saw about 13% reduction in CPU time in our in-house benchmarks
Screenshot 2022-08-19 at 12 27 16 PM

@skrzypo987
Copy link
Copy Markdown
Member

Yes, I saw about 13% reduction in CPU time in our in-house benchmarks <img alt="Screenshot 2022-08-19 at 12 27 16 PM"

#dancingBananaEmoji

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more testing for interleaved data (UT for io.trino.parquet.writer.PrimitiveColumnWriter#writeBlock), e.g:

            // 1 -> non-null, 2 -> non-null, 3 -> null

            //row1: a: {b: {c: null}}}   :head+2 writes 3
            //row2: a: {b: null}}        :head+1 writes 2
            //row3: a: null              :head writes 1
            //row4: a: ...
  • a maybe not have nulls, but only child has nulls
  • a, b have null interleaved
  • no rows

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: separate commit

@raunaqmorarka raunaqmorarka force-pushed the pqw-def branch 3 times, most recently from 47c9e04 to a5a4ad7 Compare September 6, 2022 05:51
Avoid iterators, streams and optionals when writing definition levels
to improve performance.

Before
Benchmark HiveFileFormat#write (compression NONE, benchmarkFileFormat TRINO_PARQUET)
MAP_VARCHAR_DOUBLE              80.6MB/s ±  1964.5kB/s ( 2.38%) (N = 45, α = 99.9%)
LARGE_MAP_VARCHAR_DOUBLE       108.4MB/s ±  3725.3kB/s ( 3.36%) (N = 45, α = 99.9%)
MAP_INT_DOUBLE                  90.6MB/s ±  1461.7kB/s ( 1.58%) (N = 45, α = 99.9%)
LARGE_MAP_INT_DOUBLE            94.4MB/s ±  1490.0kB/s ( 1.54%) (N = 45, α = 99.9%)
LARGE_ARRAY_VARCHAR             91.9MB/s ±  1458.6kB/s ( 1.55%) (N = 45, α = 99.9%)

After
MAP_VARCHAR_DOUBLE             114.9MB/s ±  5665.1kB/s ( 4.82%) (N = 45, α = 99.9%)
LARGE_MAP_VARCHAR_DOUBLE       136.8MB/s ±  3532.4kB/s ( 2.52%) (N = 45, α = 99.9%)
MAP_INT_DOUBLE                 114.9MB/s ±  3012.9kB/s ( 2.56%) (N = 45, α = 99.9%)
LARGE_MAP_INT_DOUBLE           124.3MB/s ±  3292.7kB/s ( 2.59%) (N = 45, α = 99.9%)
LARGE_ARRAY_VARCHAR            102.9MB/s ±  2475.0kB/s ( 2.35%) (N = 45, α = 99.9%)
@raunaqmorarka raunaqmorarka merged commit 44178f7 into trinodb:master Sep 8, 2022
@raunaqmorarka raunaqmorarka deleted the pqw-def branch September 8, 2022 04:07
@github-actions github-actions bot added this to the 396 milestone Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants