Skip to content

Group small columns together in parquet files#17404

Merged
raunaqmorarka merged 1 commit intotrinodb:masterfrom
raunaqmorarka:pqw-reorder
May 9, 2023
Merged

Group small columns together in parquet files#17404
raunaqmorarka merged 1 commit intotrinodb:masterfrom
raunaqmorarka:pqw-reorder

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented May 9, 2023

Description

Modified parquet writer to store columns in order of their size inside row groups
so that the reader can fetch small columns in fewer filesystem requests

Additional context and related issues

Based on similar logic in ORC writer at

Collections.sort(dataStreams);

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Improve the layout of data in parquet files produced by the optimized parquet writer for faster reads. ({issue}`17404`)

# Hudi, Iceberg, Delta
* Improve the layout of data in parquet files for faster reads. ({issue}`17404`)

@cla-bot cla-bot bot added the cla-signed label May 9, 2023
Modified parquet writer to store columns in order of
their size inside row groups so that the reader can fetch
small columns in fewer filesystem requests
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might drop performance in some scenarios?

E.g. it seems optimally it should group columns that are mostly used together

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Do you think it might drop performance in some scenarios?

E.g. it seems optimally it should group columns that are mostly used together

This change matters only for columns with size less than parquet.max-buffer-size (default 8mb). Given that we're only affecting smaller columns, the penalty for a bad decision shouldn't be high and the current heuristic is the best we can do without knowing usage patterns. I'm also relying on this having been a successful optimization (or at least no complaints) in ORC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants