Skip to content

Support DML operations on Delta tables with name column mapping#15837

Closed
mx123 wants to merge 2 commits intotrinodb:masterfrom
mx123:delta-brx-dml-column-mapping
Closed

Support DML operations on Delta tables with name column mapping#15837
mx123 wants to merge 2 commits intotrinodb:masterfrom
mx123:delta-brx-dml-column-mapping

Conversation

@mx123
Copy link
Copy Markdown
Contributor

@mx123 mx123 commented Jan 25, 2023

Description

This functionality correspond to writer version 5 for DML operations
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-version-requirements

Relates to #12638

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Support DML operations on Delta tables with `name` column mapping. ({issue}`12638`)

@findinpath
Copy link
Copy Markdown
Contributor

Please address the conflicts with master.

Do consider relying now on Trino to perform DML statements instead of doing this over Databricks in io.trino.tests.product.deltalake.TestDeltaLakeColumnMappingMode

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from c6d6b65 to 0a94f95 Compare January 26, 2023 13:56
@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch 3 times, most recently from 95c6be2 to 1f2d377 Compare January 27, 2023 11:23
@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 1f2d377 to 1e1d87e Compare January 27, 2023 13:31
@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch 2 times, most recently from 9095186 to 30bf882 Compare January 27, 2023 14:44
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a test for nested types as well?

Copy link
Copy Markdown
Contributor Author

@mx123 mx123 Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's unrelated since no SQL syntax change this PR contained.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to add nested column type case because we sometimes face row type specific issue.

Copy link
Copy Markdown
Contributor Author

@mx123 mx123 Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for nested types. but tests are failed with column mapping id tables. looking into...

Copy link
Copy Markdown
Contributor

@findinpath findinpath Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

We seem to have an issue in dealing with nested types during the creation of the parquet schema for delta.columnMapping.mode set to id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limitation is probably related to d468fb7
I'm not sure though why it works on the read side.

cc @ebyhr

Copy link
Copy Markdown
Contributor

@findinpath findinpath Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility that I see is to add into DeltaLakeColumnMetadata & DeltaLakeColumnHandle a mapping of <physical name String, field id Optionalnt> to have all the field ids available in the ParquetSchemaConverter.

This construct would replace

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 30bf882 to dbe553f Compare January 30, 2023 12:00
@findinpath
Copy link
Copy Markdown
Contributor

Tests are failing

Error:  Failures: 
Error:    TestDeltaLakeConnectorSmokeTest>BaseDeltaLakeConnectorSmokeTest.testInsertIntoPartitionedNonLowercaseColumnTable:1029->AbstractTestQueryFramework.assertUpdate:370->AbstractTestQueryFramework.assertUpdate:375 » QueryFailed
Error:    TestDeltaLakeLegacyWriterConnectorSmokeTest>BaseDeltaLakeConnectorSmokeTest.testInsertIntoPartitionedNonLowercaseColumnTable:1029->AbstractTestQueryFramework.assertUpdate:370->AbstractTestQueryFramework.assertUpdate:375 » QueryFailed

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from dbe553f to ad5d1d9 Compare January 31, 2023 10:22
@mx123
Copy link
Copy Markdown
Contributor Author

mx123 commented Jan 31, 2023

Tests are failing

Error:  Failures: 
Error:    TestDeltaLakeConnectorSmokeTest>BaseDeltaLakeConnectorSmokeTest.testInsertIntoPartitionedNonLowercaseColumnTable:1029->AbstractTestQueryFramework.assertUpdate:370->AbstractTestQueryFramework.assertUpdate:375 » QueryFailed
Error:    TestDeltaLakeLegacyWriterConnectorSmokeTest>BaseDeltaLakeConnectorSmokeTest.testInsertIntoPartitionedNonLowercaseColumnTable:1029->AbstractTestQueryFramework.assertUpdate:370->AbstractTestQueryFramework.assertUpdate:375 » QueryFailed

fixed by ad5d1d9

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from ad5d1d9 to 1490323 Compare January 31, 2023 15:14
@mx123 mx123 marked this pull request as ready for review January 31, 2023 16:54
@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 1490323 to a6e5dfb Compare February 1, 2023 08:04
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to add nested column type case because we sometimes face row type specific issue.

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch 2 times, most recently from db22c34 to 3ee29d9 Compare February 3, 2023 10:28
@findepi
Copy link
Copy Markdown
Member

findepi commented Feb 8, 2023

per offline discussion, please split the work into name and id mappings separately.
Let's focus on the name first, since this is the portion that works already, right?

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 3ee29d9 to b7f8872 Compare February 9, 2023 14:43
@ebyhr ebyhr changed the title Support DML operations on Delta tables with id / name column mapping Support DML operations on Delta tables with name column mapping Feb 10, 2023
Copy link
Copy Markdown
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check CI failure.

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch 2 times, most recently from 7932ff5 to 9c45512 Compare February 10, 2023 12:02
@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 9c45512 to a53a7e8 Compare February 10, 2023 14:56
@findinpath
Copy link
Copy Markdown
Contributor

@mx123 there seem to be some issues discovered by the tests (about collecting stats):

see https://github.com/trinodb/trino/pull/15837/checks?check_run_id=11266685083

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from a53a7e8 to 3716ee9 Compare February 13, 2023 12:11
@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 3716ee9 to 8536585 Compare February 13, 2023 16:12
Copy link
Copy Markdown
Contributor

@findinpath findinpath Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the stats settings play any role?

Copy link
Copy Markdown
Contributor

@findinpath findinpath Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply only for column mapping NONE ?
If yes, please specify the column mapping as a parameter for the method and use it in the if statement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, if the behavior is very different for name mapping I'd rewrite this method to only do the name mapping and call it getPartitionColumnsForNameMapping

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a none mapping you're just returning the input handle.getMetadataEntry().getOriginalPartitionColumns() right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is no need for the "Note:" prefix in the statement. The comment implies that this is a developer note.

If we add the column mapping as a parameter for the method, we may not need this comment anymore.
If the column mapping is NONE perform the mapping of the column names as in the original code, otherwise for NAME use physical column name and for ID throw illegal argument exception.

return new DeltaLakeColumnHandle(FILE_MODIFIED_TIME_COLUMN_NAME, FILE_MODIFIED_TIME_TYPE, OptionalInt.empty(), FILE_MODIFIED_TIME_COLUMN_NAME, FILE_MODIFIED_TIME_TYPE, SYNTHESIZED);
}

public Type getSupportedType()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an @JsonIgnore annotation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also call this getPhysicalType

Comment on lines 355 to 356
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid iterating over the column list twice

Suggested change
List<String> dataColumnNames = dataColumns.stream().map(DeltaLakeColumnHandle::getPhysicalName).collect(toImmutableList());
List<Type> parquetTypes = dataColumns.stream().map(DeltaLakeColumnHandle::getSupportedType).collect(toImmutableList());
ImmutableList.Builder<String> dataColumnNames = ImmutableList.builder();
ImmutableList.Builder<Type> parquetTypes = ImmutableList.builder();
for (DeltaLakeColumnHandle column : dataColumns) {
dataColumnNames.add(..);
parquetTypes.add(...);
}

Comment on lines 1420 to 1422
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is O(n^2) on the column list size, not terrible but not awesome. Can we do better by pre-generating a Map lookup for the list traversal you're doing here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, if the behavior is very different for name mapping I'd rewrite this method to only do the name mapping and call it getPartitionColumnsForNameMapping

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a none mapping you're just returning the input handle.getMetadataEntry().getOriginalPartitionColumns() right?

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Feb 13, 2023

/test-with-secrets sha=853658543ad1a765ba3c89b557fbdad664f4236d

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Feb 13, 2023

CheckpointWriterManager#writeCheckpoint method seems failing internally:

spark> CREATE TABLE default.test (c1 int) using delta LOCATION 's3://trino-ci-test/test' TBLPROPERTIES('delta.columnMapping.mode'='name', 'delta.checkpointInterval' = 1);

trino> INSERT INTO delta.default.test VALUES (1);
INSERT: 1 row
2023-02-14T08:13:42.832+0900	ERROR	20230213_231339_00001_vs9gg.0.0.0-8-131	io.trino.plugin.deltalake.DeltaLakeMetadata	Failed to write checkpoint for table default.test for version 1
java.lang.IllegalArgumentException: Error: : expected at the position 495 of 'struct<id:string,name:string,description:string,format:struct<provider:string,options:map<string,string>>,schemaString:string,partitionColumns:array<string>,configuration:map<string,string>,createdTime:bigint>:struct<minReaderVersion:int,minWriterVersion:int>:struct<appId:string,version:bigint,lastUpdated:bigint>:struct<path:string,partitionValues:map<string,string>,size:bigint,modificationTime:bigint,dataChange:boolean,stats:string,stats_parsed:struct<numRecords:bigint,minValues:struct<col-f6f46225-adf5-4d70-ba3c-27f55e909663:int>,maxValues:struct<col-f6f46225-adf5-4d70-ba3c-27f55e909663:int>,nullCount:struct<col-f6f46225-adf5-4d70-ba3c-27f55e909663:bigint>>,tags:map<string,string>>:struct<path:string,deletionTimestamp:bigint,dataChange:boolean>' but '-' is found.
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:413)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:384)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:541)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:542)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:542)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:358)
	at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:848)
	at io.trino.plugin.hive.HiveType.toHiveTypes(HiveType.java:206)
	at io.trino.plugin.hive.util.HiveUtil.getColumnTypes(HiveUtil.java:1085)
	at io.trino.plugin.hive.RecordFileWriter.<init>(RecordFileWriter.java:97)
	at io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointWriter.write(CheckpointWriter.java:136)
	at io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointWriterManager.writeCheckpoint(CheckpointWriterManager.java:135)
	at io.trino.plugin.deltalake.DeltaLakeMetadata.writeCheckpointIfNeeded(DeltaLakeMetadata.java:1866)
	at io.trino.plugin.deltalake.DeltaLakeMetadata.finishInsert(DeltaLakeMetadata.java:1403)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.finishInsert(ClassLoaderSafeConnectorMetadata.java:519)
	at io.trino.metadata.MetadataManager.finishInsert(MetadataManager.java:911)
	at io.trino.sql.planner.LocalExecutionPlanner.lambda$createTableFinisher$4(LocalExecutionPlanner.java:4039)
	at io.trino.operator.TableFinishOperator.getOutput(TableFinishOperator.java:319)
	at io.trino.operator.Driver.processInternal(Driver.java:394)
	at io.trino.operator.Driver.lambda$process$8(Driver.java:297)
	at io.trino.operator.Driver.tryWithLock(Driver.java:689)
	at io.trino.operator.Driver.process(Driver.java:289)
	at io.trino.operator.Driver.processForDuration(Driver.java:260)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:773)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:523)
	at io.trino.$gen.Trino_dev____20230213_231142_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

@github-actions
Copy link
Copy Markdown

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4168182591

@findinpath
Copy link
Copy Markdown
Contributor

@mx123 regarding #15837 (comment)
please verify the checkpoint creation:

@mx123 mx123 force-pushed the delta-brx-dml-column-mapping branch from 8536585 to d469d0c Compare February 14, 2023 09:15
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Feb 15, 2023

I'm looking into the above checkpoints creation bug now and taking over this PR.

@mx123
Copy link
Copy Markdown
Contributor Author

mx123 commented Feb 20, 2023

closed due to #15837 (comment)

@mx123 mx123 closed this Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants