Skip to content

Conversation

@pajaks
Copy link
Member

@pajaks pajaks commented Feb 8, 2023

Description

Collect delta lake statistics for INSERT.

Additional context and related issues

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Improve query performance on tables written by Trino with `INSERT`. ({issue}`16026 `)

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxFileModificationTime probably doesn't need to be optional, since for empty insert there should be no stats collected.
@findinpath does empty insert still create a transaction log entry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and insert creates transaction log entry for that case.
I will add check for that case, but I would leave Optional here to meet updateTableStatistics arguments, as in general maxFileModificationTime can be empty (for example during ANALYZE when it can't be retrieved from computed statistics).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath does empty insert still create a transaction log entry?

I remember doing something like this for Iceberg:

if (commitTasks.isEmpty()) {
transaction = null;
return Optional.empty();
}

I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have coverage for ANALYZE filling in the NDV information if it was not present before?
should we run the above INSERT with stats collection disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NDV information update by ANALYZE is covered in testStatisticsOnInsertWhenCollectionOnWriteDisabled.
Filling previously empty NVD with ANALYZE is covered in testCreateTableStatisticsWhenCollectionOnWriteDisabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we put all arguments on one line, or each on separate line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we put all arguments on one line, or each on separate line

@pajaks pajaks marked this pull request as ready for review February 10, 2023 13:02
@alexjo2144
Copy link
Member

Excluding a column when the existing stats don't have an entry for it seems like the right thing to do, but we might want to try to do better.

We can't tell the difference right now between two things:

  • A table was previously analyzed and a column was excluded
  • A table was previously analyzed and then a new column was added

If we know the column is new and that's why we don't have stats for it, we could include it in the stats during the next insert.

@pajaks
Copy link
Member Author

pajaks commented Feb 13, 2023

Excluding a column when the existing stats don't have an entry for it seems like the right thing to do, but we might want to try to do better.

We can't tell the difference right now between two things:

  • A table was previously analyzed and a column was excluded
  • A table was previously analyzed and then a new column was added

If we know the column is new and that's why we don't have stats for it, we could include it in the stats during the next insert.

In ANALYZE WITH user specify columns to be analyzed. So it’s more include than exclude.
If we would expand this list with newly added column it would be hidden from user and probably misleading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pls create a ticket for tracking this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate commit pls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate commit -unrelated to the current commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath does empty insert still create a transaction log entry?

I remember doing something like this for Iceberg:

if (commitTasks.isEmpty()) {
transaction = null;
return Optional.empty();
}

I don't think we did this for Delta Lake.
We should probably add an issue to handle this aspect as well on Delta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that disabling the collection stats is a pragmatical choice here, given that the table had no stats collected (it is being registered) before doing operations on it . No change requested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we update the assertions to include stats, that seems like a better way to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to verify on the file level whether the extended_stats.json file remains untouched?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-existing:

Running

assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey', 'regionkey'])", tableName));

succeeds (as expected).

Now running any of the commands:

assertUpdate(format("ANALYZE %s WITH(columns = ARRAY['nationkey'])", tableName));

or

assertUpdate(format("ANALYZE %s", tableName));

fails with the message:

List of columns to be analyzed must be a subset of previously used. To extend list of analyzed columns drop table statistics

Now the user needs to know what columns were actually analyzed before to match them exactly in the ANALYZE statement.
Maybe we should add in the exception message (separate PR) more information about the columns currently having stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment change should move to Remove split count verification for ANALYZE commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move to Remove split count verification for ANALYZE commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is check is added to test functionality, so I think it should stay in this commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verifySplitCount is unused. can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getOperatorStats can also be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can name the table name prefix as per the test name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: follow the same in other test methods

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. We mentioned offline that a test which adds a column and then does an insert should include stats for the new column. Would be nice to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit TODO: https://github.com/trinodb/trino/issues/16088 -> TODO (https://github.com/trinodb/trino/issues/16088)

Copy link
Contributor

@findinpath findinpath Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for finishInsert is not relevant, but eventually for finishMerge (follow-up PR) do take into account that you'll need to filter out only data files dataFile.getDataFileType() == DATA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be inlined in the assignment for analyzedColumns

@pajaks
Copy link
Member Author

pajaks commented Feb 14, 2023

Overall LGTM. We mentioned offline that a test which adds a column and then does an insert should include stats for the new column. Would be nice to have.

It turned out that added column is not added to statistics even for legacy ANALYZE. I think it should be handled by different PR (proposed solution #16109).

@alexjo2144
Copy link
Member

It turned out that added column is not added to statistics even for legacy ANALYZE. I think it should be handled by different PR

Sounds good to me.

We do need to decide if we merge this before support for column mapping and adding/dropping columns. Thinking, if a column is dropped and added back with the same name our stats should be reset instead of reviving the old data.

@findepi @ebyhr

@pajaks
Copy link
Member Author

pajaks commented Feb 16, 2023

Rebased to resolve conflict with #16108.
Also small refactoring to reduce storage access calls.

@pajaks
Copy link
Member Author

pajaks commented Feb 16, 2023

CI hit: #11131

@ebyhr ebyhr merged commit 7bf9aea into trinodb:master Feb 17, 2023
@ebyhr
Copy link
Member

ebyhr commented Feb 17, 2023

Merged, thanks!

@ebyhr ebyhr mentioned this pull request Feb 17, 2023
@github-actions github-actions bot added this to the 408 milestone Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants