Refactor DefLevelIterables to improve optimized parquet writer#13714
Refactor DefLevelIterables to improve optimized parquet writer#13714raunaqmorarka merged 2 commits intotrinodb:masterfrom
Conversation
skrzypo987
left a comment
There was a problem hiding this comment.
Do you mind adding a commit where DefLevelIterables are renamed to DefLevelWriterProviders only for the sake of easier reviewing and then squashing it before the merge.
Right now most of the code disappears and reemmeger in a different place which is difficult to review
Done |
Wow. That was fast |
skrzypo987
left a comment
There was a problem hiding this comment.
Looks legit % Someone that knows ORC standard more should also take a look.
Did you run macrobenchmarks?
#dancingBananaEmoji |
There was a problem hiding this comment.
Can we add more testing for interleaved data (UT for io.trino.parquet.writer.PrimitiveColumnWriter#writeBlock), e.g:
// 1 -> non-null, 2 -> non-null, 3 -> null
//row1: a: {b: {c: null}}} :head+2 writes 3
//row2: a: {b: null}} :head+1 writes 2
//row3: a: null :head writes 1
//row4: a: ...
- a maybe not have nulls, but only child has nulls
- a, b have null interleaved
- no rows
lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java
Outdated
Show resolved
Hide resolved
47c9e04 to
a5a4ad7
Compare
a5a4ad7 to
858baf3
Compare
Avoid iterators, streams and optionals when writing definition levels to improve performance. Before Benchmark HiveFileFormat#write (compression NONE, benchmarkFileFormat TRINO_PARQUET) MAP_VARCHAR_DOUBLE 80.6MB/s ± 1964.5kB/s ( 2.38%) (N = 45, α = 99.9%) LARGE_MAP_VARCHAR_DOUBLE 108.4MB/s ± 3725.3kB/s ( 3.36%) (N = 45, α = 99.9%) MAP_INT_DOUBLE 90.6MB/s ± 1461.7kB/s ( 1.58%) (N = 45, α = 99.9%) LARGE_MAP_INT_DOUBLE 94.4MB/s ± 1490.0kB/s ( 1.54%) (N = 45, α = 99.9%) LARGE_ARRAY_VARCHAR 91.9MB/s ± 1458.6kB/s ( 1.55%) (N = 45, α = 99.9%) After MAP_VARCHAR_DOUBLE 114.9MB/s ± 5665.1kB/s ( 4.82%) (N = 45, α = 99.9%) LARGE_MAP_VARCHAR_DOUBLE 136.8MB/s ± 3532.4kB/s ( 2.52%) (N = 45, α = 99.9%) MAP_INT_DOUBLE 114.9MB/s ± 3012.9kB/s ( 2.56%) (N = 45, α = 99.9%) LARGE_MAP_INT_DOUBLE 124.3MB/s ± 3292.7kB/s ( 2.59%) (N = 45, α = 99.9%) LARGE_ARRAY_VARCHAR 102.9MB/s ± 2475.0kB/s ( 2.35%) (N = 45, α = 99.9%)
858baf3 to
390448e
Compare

Description
Avoid iterators, streams and optionals when writing definition levels
to improve performance.
improvement
optimized parquet writer
Improves performance of writes through optimized parquet writer for nested data types.
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: