Collect Delta extended statistics during insert #16026

pajaks · 2023-02-08T13:11:37Z

Description

Collect delta lake statistics for INSERT.

Additional context and related issues

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Improve query performance on tables written by Trino with `INSERT`. ({issue}`16026 `)

findepi

lgtm!

findepi · 2023-02-09T15:48:17Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

maxFileModificationTime probably doesn't need to be optional, since for empty insert there should be no stats collected.
@findinpath does empty insert still create a transaction log entry?

I checked and insert creates transaction log entry for that case.
I will add check for that case, but I would leave Optional here to meet updateTableStatistics arguments, as in general maxFileModificationTime can be empty (for example during ANALYZE when it can't be retrieved from computed statistics).

@findinpath does empty insert still create a transaction log entry?

I remember doing something like this for Iceberg:

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

Lines 812 to 815 in 607decc

if (commitTasks.isEmpty()) {

transaction = null;

return Optional.empty();

}

I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.

#16125 (comment)

findepi · 2023-02-09T15:49:03Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

separate commit

findepi · 2023-02-09T15:50:18Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

do we have coverage for ANALYZE filling in the NDV information if it was not present before?
should we run the above INSERT with stats collection disabled?

I think NDV information update by ANALYZE is covered in testStatisticsOnInsertWhenCollectionOnWriteDisabled.
Filling previously empty NVD with ANALYZE is covered in testCreateTableStatisticsWhenCollectionOnWriteDisabled.

findepi · 2023-02-09T15:51:09Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

nit: we put all arguments on one line, or each on separate line

findepi · 2023-02-09T15:51:14Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

nit: we put all arguments on one line, or each on separate line

alexjo2144 · 2023-02-10T18:12:13Z

Excluding a column when the existing stats don't have an entry for it seems like the right thing to do, but we might want to try to do better.

We can't tell the difference right now between two things:

A table was previously analyzed and a column was excluded
A table was previously analyzed and then a new column was added

If we know the column is new and that's why we don't have stats for it, we could include it in the stats during the next insert.

pajaks · 2023-02-13T10:54:55Z

Excluding a column when the existing stats don't have an entry for it seems like the right thing to do, but we might want to try to do better.

We can't tell the difference right now between two things:

A table was previously analyzed and a column was excluded

A table was previously analyzed and then a new column was added

If we know the column is new and that's why we don't have stats for it, we could include it in the stats during the next insert.

In ANALYZE WITH user specify columns to be analyzed. So it’s more include than exclude.
If we would expand this list with newly added column it would be hidden from user and probably misleading.

findinpath · 2023-02-13T10:59:56Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

Can you pls create a ticket for tracking this issue?

findinpath · 2023-02-13T11:04:26Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

separate commit pls

findinpath · 2023-02-13T11:09:13Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

separate commit -unrelated to the current commit

findinpath · 2023-02-13T11:37:46Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@findinpath does empty insert still create a transaction log entry?

I remember doing something like this for Iceberg:

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

Lines 812 to 815 in 607decc

if (commitTasks.isEmpty()) {

transaction = null;

return Optional.empty();

}

I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.

findinpath · 2023-02-13T11:45:44Z

...rino-delta-lake/src/test/java/io/trino/plugin/deltalake/BaseDeltaLakeConnectorSmokeTest.java

I find that disabling the collection stats is a pragmatical choice here, given that the table had no stats collected (it is being registered) before doing operations on it . No change requested.

Or we update the assertions to include stats, that seems like a better way to me.

findinpath · 2023-02-13T11:49:44Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

Would it make sense to verify on the file level whether the extended_stats.json file remains untouched?

findinpath · 2023-02-13T12:01:11Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

pre-existing:

Running

assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey', 'regionkey'])", tableName));

succeeds (as expected).

Now running any of the commands:

assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey'])", tableName));

or

assertUpdate(format("ANALYZE %s", tableName));

fails with the message:

List of columns to be analyzed must be a subset of previously used. To extend list of analyzed columns drop table statistics

Now the user needs to know what columns were actually analyzed before to match them exactly in the ANALYZE statement.
Maybe we should add in the exception message (separate PR) more information about the columns currently having stats.

krvikash · 2023-02-13T11:20:11Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

comment change should move to Remove split count verification for ANALYZE commit

krvikash · 2023-02-13T11:22:08Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

should move to Remove split count verification for ANALYZE commit?

This is check is added to test functionality, so I think it should stay in this commit.

krvikash · 2023-02-13T11:45:13Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

verifySplitCount is unused. can be removed.

getOperatorStats can also be removed.

krvikash · 2023-02-13T11:54:25Z

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

nit: we can name the table name prefix as per the test name?

nit: follow the same in other test methods

alexjo2144

Overall LGTM. We mentioned offline that a test which adds a column and then does an insert should include stats for the new column. Would be nice to have.

findinpath · 2023-02-14T11:52:06Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

nit TODO: https://github.com/trinodb/trino/issues/16088 -> TODO (https://github.com/trinodb/trino/issues/16088)

findinpath · 2023-02-14T11:56:01Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

I guess for finishInsert is not relevant, but eventually for finishMerge (follow-up PR) do take into account that you'll need to filter out only data files dataFile.getDataFileType() == DATA

findinpath · 2023-02-14T12:06:41Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

nit: can be inlined in the assignment for analyzedColumns

pajaks · 2023-02-14T13:11:19Z

Overall LGTM. We mentioned offline that a test which adds a column and then does an insert should include stats for the new column. Would be nice to have.

It turned out that added column is not added to statistics even for legacy ANALYZE. I think it should be handled by different PR (proposed solution #16109).

alexjo2144 · 2023-02-14T15:45:39Z

It turned out that added column is not added to statistics even for legacy ANALYZE. I think it should be handled by different PR

Sounds good to me.

We do need to decide if we merge this before support for column mapping and adding/dropping columns. Thinking, if a column is dropped and added back with the same name our stats should be reset instead of reviving the old data.

@findepi @ebyhr

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

pajaks · 2023-02-16T10:56:05Z

Rebased to resolve conflict with #16108.
Also small refactoring to reduce storage access calls.

pajaks · 2023-02-16T13:23:26Z

CI hit: #11131

ebyhr · 2023-02-17T05:10:48Z

Merged, thanks!

cla-bot bot added the cla-signed label Feb 8, 2023

pajaks requested review from alexjo2144, ebyhr, findepi and findinpath February 9, 2023 13:47

findepi reviewed Feb 9, 2023

View reviewed changes

pajaks marked this pull request as ready for review February 10, 2023 13:02

findinpath reviewed Feb 13, 2023

View reviewed changes

krvikash reviewed Feb 13, 2023

View reviewed changes

alexjo2144 approved these changes Feb 13, 2023

View reviewed changes

findinpath reviewed Feb 14, 2023

View reviewed changes

pajaks mentioned this pull request Feb 15, 2023

Prevent delta lake transaction log entry creation on empty insert #16125

Closed

ebyhr approved these changes Feb 16, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

pajaks added 3 commits February 16, 2023 11:32

Refactor statistics session property handling

8ed240d

Collect Delta extended statistics during insert

9457268

Remove split count verification for ANALYZE

ab82da8

empty

d3a8070

ebyhr merged commit 7bf9aea into trinodb:master Feb 17, 2023

ebyhr mentioned this pull request Feb 17, 2023

Release notes for 408 #16147

Closed

github-actions bot added this to the 408 milestone Feb 17, 2023

colebow mentioned this pull request Feb 21, 2023

Add Trino 408 release notes #16209

Merged

findepi mentioned this pull request Mar 29, 2023

Collect Delta extended statistics during writes #14575

Closed

	if (commitTasks.isEmpty()) {
	transaction = null;
	return Optional.empty();
	}

Collect Delta extended statistics during insert #16026

Collect Delta extended statistics during insert #16026

Uh oh!

Conversation

pajaks commented Feb 8, 2023 • edited by ebyhr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

findepi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexjo2144 commented Feb 10, 2023

Uh oh!

pajaks commented Feb 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexjo2144 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findinpath Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pajaks commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexjo2144 commented Feb 14, 2023

pajaks commented Feb 8, 2023 •

edited by ebyhr

Loading

findinpath Feb 14, 2023 •

edited

Loading

pajaks commented Feb 14, 2023 •

edited

Loading