Skip to content

Fix parquet reader batch size calculation#14094

Merged
raunaqmorarka merged 1 commit intotrinodb:masterfrom
kabunchi:fix-parquet-reader-max-bytes
Sep 13, 2022
Merged

Fix parquet reader batch size calculation#14094
raunaqmorarka merged 1 commit intotrinodb:masterfrom
kabunchi:fix-parquet-reader-max-bytes

Conversation

@kabunchi
Copy link
Copy Markdown

@kabunchi kabunchi commented Sep 11, 2022

Description

Fixed regression in calculation of maxBytesPerCell that
caused maxBatchSize to be small and degrade performance

Non-technical explanation

Fixes possible perf regression in parquet reader introduced by a change in #13757

Release notes

( ) This is not user-visible and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Iceberg, Delta
* Fix regression in performance of reading parquet files. ({issue}`14094`)

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Sep 11, 2022

Could you elaborate on regression that it fixes?

@kabunchi
Copy link
Copy Markdown
Author

The issue here, is that since we didn't kept the maxBytesPerCell, we always assumed we hit the max and called this line :
maxCombinedBytesPerRow = maxCombinedBytesPerRow - maxBytesPerCell.getOrDefault(fieldId, 0L) + bytesPerCell;
over and over while maxBytesPerCell.getOrDefault(fieldId, 0L) always returned zero.
Following that we miscalculated the maxBatchSize to be lower than needed.
That causes significant larger number of getNextPage calls from this pageSource.

@skrzypo987
Copy link
Copy Markdown
Member

Is it possible to write a unit test that shows the regression?

@skrzypo987 skrzypo987 self-requested a review September 11, 2022 17:38
@sopel39 sopel39 added the bug label Sep 12, 2022
@raunaqmorarka
Copy link
Copy Markdown
Member

Is it possible to write a unit test that shows the regression?

@kabunchi could you try testing this by checking size of pages produced by ParquetReader#nextPage ?

Fixed regression in calculation of maxBytesPerCell that
caused maxBatchSize to be small and degrade performance
@kabunchi
Copy link
Copy Markdown
Author

Took a look at testing, not that simple as the batch size is not fixed and/or exposed so predicting the page sizes or number of pages is not trivial...

@raunaqmorarka raunaqmorarka changed the title ParquetReader Put in the maxBytesPerCell HashMap instead of replace t… Fix parquet reader batch size calculation Sep 13, 2022
@raunaqmorarka raunaqmorarka merged commit 33c69b8 into trinodb:master Sep 13, 2022
@github-actions github-actions bot added this to the 396 milestone Sep 13, 2022
@kabunchi kabunchi deleted the fix-parquet-reader-max-bytes branch September 13, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants