-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Collect Delta extended statistics during insert #16026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
findepi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxFileModificationTime probably doesn't need to be optional, since for empty insert there should be no stats collected.
@findinpath does empty insert still create a transaction log entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and insert creates transaction log entry for that case.
I will add check for that case, but I would leave Optional here to meet updateTableStatistics arguments, as in general maxFileModificationTime can be empty (for example during ANALYZE when it can't be retrieved from computed statistics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findinpath does empty insert still create a transaction log entry?
I remember doing something like this for Iceberg:
trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Lines 812 to 815 in 607decc
| if (commitTasks.isEmpty()) { | |
| transaction = null; | |
| return Optional.empty(); | |
| } |
I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have coverage for ANALYZE filling in the NDV information if it was not present before?
should we run the above INSERT with stats collection disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NDV information update by ANALYZE is covered in testStatisticsOnInsertWhenCollectionOnWriteDisabled.
Filling previously empty NVD with ANALYZE is covered in testCreateTableStatisticsWhenCollectionOnWriteDisabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we put all arguments on one line, or each on separate line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we put all arguments on one line, or each on separate line
|
Excluding a column when the existing stats don't have an entry for it seems like the right thing to do, but we might want to try to do better. We can't tell the difference right now between two things:
If we know the column is new and that's why we don't have stats for it, we could include it in the stats during the next insert. |
In ANALYZE WITH user specify columns to be analyzed. So it’s more include than exclude. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pls create a ticket for tracking this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate commit pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate commit -unrelated to the current commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findinpath does empty insert still create a transaction log entry?
I remember doing something like this for Iceberg:
trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Lines 812 to 815 in 607decc
| if (commitTasks.isEmpty()) { | |
| transaction = null; | |
| return Optional.empty(); | |
| } |
I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that disabling the collection stats is a pragmatical choice here, given that the table had no stats collected (it is being registered) before doing operations on it . No change requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we update the assertions to include stats, that seems like a better way to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to verify on the file level whether the extended_stats.json file remains untouched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-existing:
Running
assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey', 'regionkey'])", tableName));
succeeds (as expected).
Now running any of the commands:
assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey'])", tableName));
or
assertUpdate(format("ANALYZE %s", tableName));
fails with the message:
List of columns to be analyzed must be a subset of previously used. To extend list of analyzed columns drop table statistics
Now the user needs to know what columns were actually analyzed before to match them exactly in the ANALYZE statement.
Maybe we should add in the exception message (separate PR) more information about the columns currently having stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment change should move to Remove split count verification for ANALYZE commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should move to Remove split count verification for ANALYZE commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is check is added to test functionality, so I think it should stay in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verifySplitCount is unused. can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getOperatorStats can also be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can name the table name prefix as per the test name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: follow the same in other test methods
alexjo2144
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. We mentioned offline that a test which adds a column and then does an insert should include stats for the new column. Would be nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit TODO: https://github.com/trinodb/trino/issues/16088 -> TODO (https://github.com/trinodb/trino/issues/16088)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for finishInsert is not relevant, but eventually for finishMerge (follow-up PR) do take into account that you'll need to filter out only data files dataFile.getDataFileType() == DATA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be inlined in the assignment for analyzedColumns
It turned out that added column is not added to statistics even for legacy ANALYZE. I think it should be handled by different PR (proposed solution #16109). |
Sounds good to me. We do need to decide if we merge this before support for column mapping and adding/dropping columns. Thinking, if a column is dropped and added back with the same name our stats should be reset instead of reviving the old data. |
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
|
Rebased to resolve conflict with #16108. |
|
CI hit: #11131 |
|
Merged, thanks! |
Description
Collect delta lake statistics for INSERT.
Additional context and related issues
Release notes
(x) Release notes are required, with the following suggested text: