Skip to content

Support converting column stats on row type to json in Delta Lake#14314

Merged
ebyhr merged 3 commits intomasterfrom
ebi/delta-json-stats-row-type
Oct 11, 2022
Merged

Support converting column stats on row type to json in Delta Lake#14314
ebyhr merged 3 commits intomasterfrom
ebi/delta-json-stats-row-type

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Sep 27, 2022

Description

Fixes #13996

Release notes

(x) This is not user-visible or docs only and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label Sep 27, 2022
@ebyhr ebyhr force-pushed the ebi/delta-json-stats-row-type branch from a5a1264 to 497e650 Compare October 4, 2022 08:26
@ebyhr ebyhr marked this pull request as ready for review October 4, 2022 08:29
@ebyhr ebyhr force-pushed the ebi/delta-json-stats-row-type branch from 497e650 to 8c61387 Compare October 5, 2022 00:25
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Oct 5, 2022

CI hit #14391 at

  • TestDeltaLakeWriteDatabricksCompatibility.testCaseUpdatePartitionColumnFails
  • TestDeltaLakeDatabricksPartitioningCompatibility.testTrinoCanReadFromTablePartitionChangedByDatabricks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than getChildren I think you want to convert the rowBlock to a ColumnarRow

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument is SingleRowBlock which is unsupported in ColumnarRow#toColumnarRow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising, toColumnarRow checks that the input is an instance of AbstractRowBlock, which SingleRowBlock extends. Seems like it should work.

Where does the error come from?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toColumnarRow checks that the input is an instance of AbstractRowBlock, which SingleRowBlock extends. Seems like it should work.

SingleRowBlock extends AbstractSingleRowBlock, not AbstractRowBlock.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry I can't read

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Oct 6, 2022

CI hit #14391

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertions for addFileEntries.get(0) and addFileEntries.get(1) are not relevant. The stats already existed there before running the test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they're relevant. Those two assertions fail if we don't copy the statistics.

import static io.trino.plugin.hive.HiveTestUtils.HDFS_ENVIRONMENT;
import static io.trino.testing.TestingConnectorSession.SESSION;

public final class TestDeltaLakeUtils
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test -> Testing

{
private TestDeltaLakeUtils() {}

public static List<AddFileEntry> getAddFileEntries(SchemaTableName table, String tableLocation)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table has no impact on the result of this method, so you can remove this parameter and use eg new SchemaTableName("dummy_schema_placeholder", "dummy_table_placeholder") below

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include the key sets in the message

also, would be nice to add a comment why this is expected. it's not obvious to me

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw instead of this check here, i'd rather have a non-null check on type after Type type = columnTypeMapping.get(value.getKey()); line

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a "verify ..." should verify, i.e. ensure something is true

as a follow-up we could rename this to eg skipUnlessInsertsSupported

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will send a follow-up PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the test, do we need transaction json files before the checkpoint (0 and 1) ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files aren't required. Removed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the getAddFileEntries come from a new snapshot that we just created, or from previous snapshot + transaction log files?

i think the intention is that we create transaction 4 and a checkpoint, so let's verify that happened

Comment on lines 1 to 21
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@ebyhr ebyhr force-pushed the ebi/delta-json-stats-row-type branch from 2312115 to bcbbc9f Compare October 11, 2022 01:44
@ebyhr ebyhr merged commit a9480bd into master Oct 11, 2022
@ebyhr ebyhr deleted the ebi/delta-json-stats-row-type branch October 11, 2022 04:58
@github-actions github-actions bot added this to the 400 milestone Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Support converting column stats on ROW type to JSON from Parquet in Delta Lake connector

4 participants