Fix parquet reader batch size calculation by kabunchi · Pull Request #14094 · trinodb/trino

kabunchi · 2022-09-11T13:23:47Z

Description

Fixed regression in calculation of maxBytesPerCell that
caused maxBatchSize to be small and degrade performance

Non-technical explanation

Fixes possible perf regression in parquet reader introduced by a change in #13757

Release notes

( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Iceberg, Delta
* Fix regression in performance of reading parquet files. ({issue}`14094`)

sopel39 · 2022-09-11T13:40:56Z

Could you elaborate on regression that it fixes?

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

kabunchi · 2022-09-11T15:00:54Z

The issue here, is that since we didn't kept the maxBytesPerCell, we always assumed we hit the max and called this line :
maxCombinedBytesPerRow = maxCombinedBytesPerRow - maxBytesPerCell.getOrDefault(fieldId, 0L) + bytesPerCell;
over and over while maxBytesPerCell.getOrDefault(fieldId, 0L) always returned zero.
Following that we miscalculated the maxBatchSize to be lower than needed.
That causes significant larger number of getNextPage calls from this pageSource.

skrzypo987 · 2022-09-11T17:38:42Z

Is it possible to write a unit test that shows the regression?

raunaqmorarka · 2022-09-13T05:55:46Z

Is it possible to write a unit test that shows the regression?

@kabunchi could you try testing this by checking size of pages produced by ParquetReader#nextPage ?

Fixed regression in calculation of maxBytesPerCell that caused maxBatchSize to be small and degrade performance

kabunchi · 2022-09-13T14:03:19Z

Took a look at testing, not that simple as the batch size is not fixed and/or exposed so predicting the page sizes or number of pages is not trivial...

cla-bot bot added the cla-signed label Sep 11, 2022

github-actions bot added the tests:hive label Sep 11, 2022

sopel39 requested a review from charlesjmorgan September 11, 2022 13:40

sopel39 reviewed Sep 11, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java Outdated Show resolved Hide resolved

skrzypo987 self-requested a review September 11, 2022 17:38

sopel39 added the bug label Sep 12, 2022

sopel39 requested a review from raunaqmorarka September 12, 2022 05:49

Fix parquet reader batch size calculation

d6c6e2a

Fixed regression in calculation of maxBytesPerCell that caused maxBatchSize to be small and degrade performance

raunaqmorarka changed the title ~~ParquetReader Put in the maxBytesPerCell HashMap instead of replace t…~~ Fix parquet reader batch size calculation Sep 13, 2022

raunaqmorarka approved these changes Sep 13, 2022

View reviewed changes

raunaqmorarka merged commit 33c69b8 into trinodb:master Sep 13, 2022

raunaqmorarka mentioned this pull request Sep 13, 2022

Release notes for 396 #14047

Closed

github-actions bot added this to the 396 milestone Sep 13, 2022

kabunchi deleted the fix-parquet-reader-max-bytes branch September 13, 2022 18:33

colebow mentioned this pull request Sep 13, 2022

Add Trino 396 release notes #14109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet reader batch size calculation#14094

Fix parquet reader batch size calculation#14094
raunaqmorarka merged 1 commit intotrinodb:masterfrom
kabunchi:fix-parquet-reader-max-bytes

kabunchi commented Sep 11, 2022 •

edited by raunaqmorarka

Loading

Uh oh!

sopel39 commented Sep 11, 2022

Uh oh!

Uh oh!

kabunchi commented Sep 11, 2022

Uh oh!

skrzypo987 commented Sep 11, 2022

Uh oh!

raunaqmorarka commented Sep 13, 2022

Uh oh!

kabunchi commented Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Conversation

kabunchi commented Sep 11, 2022 • edited by raunaqmorarka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Non-technical explanation

Release notes

Uh oh!

sopel39 commented Sep 11, 2022

Uh oh!

Uh oh!

kabunchi commented Sep 11, 2022

Uh oh!

skrzypo987 commented Sep 11, 2022

Uh oh!

raunaqmorarka commented Sep 13, 2022

Uh oh!

kabunchi commented Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

kabunchi commented Sep 11, 2022 •

edited by raunaqmorarka

Loading