Prune unused stats columns when reading Delta checkpoint#19848
Prune unused stats columns when reading Delta checkpoint#19848findepi merged 2 commits intotrinodb:masterfrom
Conversation
a934893 to
cd7a3a3
Compare
There was a problem hiding this comment.
These changes were necessary to be able to showcase the effectiveness of the stats_parsed projection functionality.
jkylling
left a comment
There was a problem hiding this comment.
Thank you for adding this!
I mostly have nitpicks for you :)
We could perhaps just pass the underlying set for addStatsColumnFilter around, instead of a function. Then it could be renamed projectedColumns. Maybe we could even pass the projectedColumns to the CheckpointEntryIterator. That would set us up for supporting statistics projection of non-base columns in the future.
Perhaps unrelated to this PR, but it might be good to have a test case which tests performance of checkpoint reads from tables with a very wide schema (e.g., internally we have a table with 276 columns).
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointSchemaManager.java
Outdated
Show resolved
Hide resolved
...st/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TestCheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
...st/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TestCheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
...st/java/io/trino/plugin/deltalake/transactionlog/checkpoint/TestCheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Outdated
Show resolved
Hide resolved
cd7a3a3 to
ac0d251
Compare
My initial version of the code contained See in It seems weird to pass |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java
Outdated
Show resolved
Hide resolved
...ake/src/main/java/io/trino/plugin/deltalake/statistics/FileBasedTableStatisticsProvider.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointSchemaManager.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Normalized (lowercased) or not?
There was a problem hiding this comment.
We are using here the original column names - e.g a_NuMbEr instead of a_number.
See corresponding test io.trino.plugin.deltalake.TestDeltaLakeBasic#testCheckpointFilteringForParsedStatsWithCaseSensitiveColumnNames and test resource: databricks133/parsed_stats_case_sensitive/_delta_log/00000000000000000002.checkpoint.parquet
optional group stats_parsed {
optional int64 numRecords;
optional group minValues {
optional int32 a_NuMbEr;
optional binary a_StRiNg (STRING);
}
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java
Outdated
Show resolved
Hide resolved
...n/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java
Outdated
Show resolved
Hide resolved
5f2e6bd to
9487eeb
Compare
e27c336 to
b425db6
Compare
Add support for stats projection in Delta checkpoint iterator
b425db6 to
5478483
Compare
Description
Follow-up work from #19588 used to retrieve from the checkpoint file only the add file statistics for the columns projected in the query.
This change can save up CPU and IO time from deserializing unnecessary
add.stats_parsed....columns in a Delta Lake query.Used for testing a multi-part checkpoint file (25 parts , each around 12MB ~ 300MB in total) for testing this feature while storing the checkpoint in local MinIO and came up with the following results:
As can be seen on the listing above, compared to the partition pruning change from #19588 , the change here does not provide an explosive improvement in efficiency, but it manages to shave off a noticeable wait time in processing the checkpoint.
Fixes #19733
Additional context and related issues
Same as in #19588 , the add stats projection functioanlity is effective only when the session setting
checkpoint_filtering_enabledis set totrueRelease notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: