Populate stats when missing in transaction log#16743
Populate stats when missing in transaction log#16743ebyhr merged 5 commits intotrinodb:masterfrom pajaks:pajaks/stats_empty_transaction_log
Conversation
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
|
general question: Is this comment still valid? |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
|
1st push with comments addressed, partition handling and various types handling |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
findepi
left a comment
There was a problem hiding this comment.
"Add handling for grouped statistics in delta lake"
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
after filtering this is FILE_MODIFIED_TIME_COLUMN_NAME, verify that singleStatistics.getGroupingColumns() is empty
There was a problem hiding this comment.
Grouping is defined for whole table, so each columns will have grouping (including FILE_MODIFIED_TIME_COLUMN_NAME). In case of grouping by $path in following commit we receive $file_modified_time for each file and calculate max value.
There was a problem hiding this comment.
Why?
Someone could have extended statistics (created by Trino 413) and want to ANALYZE table to collect also file-level stats.
Let's discuss and improve explanation in the code.
There was a problem hiding this comment.
The idea was to collect file-level statistics only for initial ANALYZE in this PR. Checking if extended statistics are empty is currently used to determine if it's initial statistics collection.
There was a problem hiding this comment.
For mentioned case maybe drop_extended_stats before ANALYZE (or force_recalculate_statistics with #16634) would be a easiest solution?
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
findepi
left a comment
There was a problem hiding this comment.
didn't review the main commit yet
There was a problem hiding this comment.
Why the condition?
With incremental analyze we could do this as well. It's just that we would fill min/max for subset of files only.
(we need to revisit
that code assumes ANALYZE covers data only but now it became aware of file boundaries)
There was a problem hiding this comment.
I would like to exclude incremental ANALYZE as separate PR if that's ok.
There was a problem hiding this comment.
We collect those stats for all columns in the table. and then write back to transaction log.
This will inflate metadata for wide tables and affect query planning times and coordinator memory. I think we should follow Databricks's approach where they analyze some initial columns only.
cc @alexjo2144
There was a problem hiding this comment.
How can we know which columns are initial? Is it related to property delta.dataSkippingNumIndexedCols?
https://docs.delta.io/latest/optimizations-oss.html#data-skipping
The idea of this PR was to generate stats regardless of this property #15135
I cannot also find any check for this property in code so Trino collects statistics regardless during write.
There was a problem hiding this comment.
This is preexisting issue as currently Trino analyses all columns during write. Issue for improvement: #17057
@findinpath We use |
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
ebyhr
left a comment
There was a problem hiding this comment.
Still reviewing the last commit.
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/resources/databricks/no_column_stats/README.md
Outdated
Show resolved
Hide resolved
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
|
rebase to resolve conflicts |
plugin/trino-delta-lake/src/test/resources/databricks/column_mapping_id/README.md
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/resources/databricks/column_mapping_id/README.md
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/resources/databricks/no_stats/README.md
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
|
First push is rebase, second addresses comments |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
|
Could you rebase on master to resolve conflicts? |
|
First push to resolve conflicts, second with addressed comments. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java
Outdated
Show resolved
Hide resolved
|
@alexjo2144 can you ptal? |
|
First push -> rebase |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
|
/test-with-secrets sha=1422b6a4102e9cc424432fcc37e4ab4698af8aa9 |
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java
Outdated
Show resolved
Hide resolved
|
The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/5820602635 |
Description
Relates to #15967
In case transaction log does not have statistics for some files we want to add this information.
After this change statistics are collected during ANALYZE per each file and new transaction log entry is created wit results.
For now collection includes:
Right now it work only for initial ANALYZE.
Additional context and related issues
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x ) Release notes are required, with the following suggested text: