Skip to content

Avoid reading Iceberg column stats when they are not needed#14504

Merged
findepi merged 1 commit intotrinodb:masterfrom
alexjo2144:iceberg/scan-include-column-statistics
Oct 13, 2022
Merged

Avoid reading Iceberg column stats when they are not needed#14504
findepi merged 1 commit intotrinodb:masterfrom
alexjo2144:iceberg/scan-include-column-statistics

Conversation

@alexjo2144
Copy link
Copy Markdown
Member

Description

Fixes: #14004

The column statistics in Iceberg can be large when there are many files or many columns with statistics collected. Avoid reading these stats if they are not used.

Non-technical explanation

Reduce memory and I/O of Iceberg metadata reads.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Reduce overhead of some Iceberg metadata reads by selectively reading data file statistics. ({issue}`14004`)

@cla-bot cla-bot bot added the cla-signed label Oct 6, 2022
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raunaqmorarka can you sanity check for me that isComplete() means that the filter we get from getCurrentPredicate() will not change later in the query?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is right

@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 7, 2022

@alexjo2144 please check the CI

@alexjo2144
Copy link
Copy Markdown
Member Author

I forgot that the $partitions table has min/max values in it. Let me see if I can avoid reading the stats if that column isn't read

@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 7, 2022

@alexjo2144 i wouldn't be concerned about $partitions performance. Let's focus on mainstream use-cases more.

@alexjo2144 alexjo2144 force-pushed the iceberg/scan-include-column-statistics branch from aac0464 to 9273735 Compare October 7, 2022 14:17
@alexjo2144
Copy link
Copy Markdown
Member Author

Sounds good, I was hoping that information would be readily available but it'd take some refactoring to SystemTable. I've reverted that change

@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 7, 2022

@sopel39 @raunaqmorarka do you want to take a look?

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Oct 13, 2022

Feel free to continue

@findepi findepi merged commit 21a9290 into trinodb:master Oct 13, 2022
@findepi findepi mentioned this pull request Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

It seems system will behave the same for iceberg connector when planTasks without include column stats and it will deplete memory consumed

4 participants