-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Collect Delta extended statistics when creating table #15878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
rebase on master to use CI fix #15879 |
alexjo2144
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall. Couple questions/nitpicks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just collectExtendedColumnStatisticsOnWrite ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save the result of extractColumnMetadata so that you don't have to call it again at the bottom of this method.
| Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream() | |
| .map(ColumnMetadata::getName) | |
| .collect(toImmutableSet()); | |
| List<ColumnMetadata> columnMetadata = extractColumnMetadata(metadata, typeManager); | |
| Set<String> allColumnNames = columnMetadata.stream() | |
| .map(ColumnMetadata::getName) | |
| .collect(toImmutableSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per other comment, don't have to call extractColumnMetadata again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not includeMaxFileModifiedTime in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Statistic aggregation during table creation does not have information about file_modified_time yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, right. Then if the modified time isn't present we just use the current time when the collection is done. Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a code comment explaining this consideration?
What do we need to have this information available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still test the old thing too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do create a compatibility test with spark to verify that after a CTAS DESC EXTENDED works as intended on Databricks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do consider documenting this new property in delta-lake.rst - either in this PR or a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would wait with documentation until other write operations are implemented if that's ok.
alexjo2144
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There's no need to change this line. I would revert.
Reduce map iterations and lookups to minimum, while also simplifying the code flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this kind of cosmetic changes can be done in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change make sense only with this commit as it allows collection to have 0 elements. It should throw exception before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a code comment explaining this consideration?
What do we need to have this information available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time is added during statistics update.
Do you mean Maximum File modified time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateTableStatistics(
session,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is now over line length limit, so --
we put all arguments on one line, or each on separate line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's minimal change, but that's not how you'd write the code if you were writing the code anew.
.flatMap(entry -> {
....
if (....) {
return Stream.of();
}
return Stream.of(Instant.ofEpochMilli(....));
})
.collect(toOptional());There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like a problem and a workaround, but there isn't a problem
// File modified time does not need to be collected as a statistics because it gets derived directly from files being written
false);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_analyze_ -> test_ctats_stats_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you paste this method contents into testCreateTableAsStatistics above?
testCreateTableAsStatistics has good name and a javadoc, just the contents are worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unrelated fmt change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unrelated fmt change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: each arg on separate line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know it's preexisting but i don't think we need to assert split count in every test method here. It blurs the test's intent
(perhaps, we don't need it in any test, i don't know, but i am not requesting any change to existing tests)
this would be better:
assertUpdate("ANALYZE " + tableName);
|
@pajaks @findinpath @alexjo2144 thank you, this is awesome! |
In particular this improves Delta query performance on data sets created in the connector using CTAS.
Description
Collect delta lake statistics for CREATE TABLE AS.
Additional context and related issues
Release notes
(x) Release notes are required, with the following suggested text: