Skip to content

Conversation

@pajaks
Copy link
Member

@pajaks pajaks commented Jan 27, 2023

Description

Collect delta lake statistics for CREATE TABLE AS.

Additional context and related issues

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Collect statistics for CREATE TABLE AS

@cla-bot cla-bot bot added the cla-signed label Jan 27, 2023
@pajaks pajaks marked this pull request as ready for review January 30, 2023 12:45
@pajaks
Copy link
Member Author

pajaks commented Jan 30, 2023

rebase on master to use CI fix #15879

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall. Couple questions/nitpicks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just collectExtendedColumnStatisticsOnWrite ?

Comment on lines 2095 to 2104
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save the result of extractColumnMetadata so that you don't have to call it again at the bottom of this method.

Suggested change
Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream()
.map(ColumnMetadata::getName)
.collect(toImmutableSet());
List<ColumnMetadata> columnMetadata = extractColumnMetadata(metadata, typeManager);
Set<String> allColumnNames = columnMetadata.stream()
.map(ColumnMetadata::getName)
.collect(toImmutableSet());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other comment, don't have to call extractColumnMetadata again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not includeMaxFileModifiedTime in this situation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistic aggregation during table creation does not have information about file_modified_time yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, right. Then if the modified time isn't present we just use the current time when the collection is done. Makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a code comment explaining this consideration?
What do we need to have this information available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still test the old thing too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do create a compatibility test with spark to verify that after a CTAS DESC EXTENDED works as intended on Databricks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do consider documenting this new property in delta-lake.rst - either in this PR or a follow-up PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait with documentation until other write operations are implemented if that's ok.

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There's no need to change this line. I would revert.

Reduce map iterations and lookups to minimum, while also simplifying the
code flow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this kind of cosmetic changes can be done in a separate commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change make sense only with this commit as it allows collection to have 0 elements. It should throw exception before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a code comment explaining this consideration?
What do we need to have this information available?

@pajaks
Copy link
Member Author

pajaks commented Feb 2, 2023

CI #12535, #15809

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time is added during statistics update.

Do you mean Maximum File modified time ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateTableStatistics(
   session,

Comment on lines 2240 to 2247
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is now over line length limit, so --

we put all arguments on one line, or each on separate line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's minimal change, but that's not how you'd write the code if you were writing the code anew.

.flatMap(entry -> {
    ....
    if (....) {
        return Stream.of();
    }
    return Stream.of(Instant.ofEpochMilli(....));
})
.collect(toOptional());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a problem and a workaround, but there isn't a problem

// File modified time does not need to be collected as a statistics because it gets derived directly from files being written
                false);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_analyze_ -> test_ctats_stats_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you paste this method contents into testCreateTableAsStatistics above?

testCreateTableAsStatistics has good name and a javadoc, just the contents are worse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated fmt change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated fmt change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: each arg on separate line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know it's preexisting but i don't think we need to assert split count in every test method here. It blurs the test's intent
(perhaps, we don't need it in any test, i don't know, but i am not requesting any change to existing tests)

this would be better:

assertUpdate("ANALYZE " + tableName);

@findepi
Copy link
Member

findepi commented Feb 3, 2023

@pajaks @findinpath @alexjo2144 thank you, this is awesome!

findepi and others added 2 commits February 6, 2023 10:41
In particular this improves Delta query performance on data sets created
in the connector using CTAS.
@pajaks
Copy link
Member Author

pajaks commented Feb 7, 2023

CI:
suite-7-non-generic #14441
suite-iceberg #16013

@findepi findepi merged commit 6ed0ad5 into trinodb:master Feb 7, 2023
@findepi findepi mentioned this pull request Feb 7, 2023
@github-actions github-actions bot added this to the 407 milestone Feb 7, 2023
@pajaks pajaks deleted the findepi/delta-analyze-on-write branch February 8, 2023 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants