Support Iceberg row-level delete and update by jackye1995 · Pull Request #10075 · trinodb/trino

jackye1995 · 2021-11-26T08:30:42Z

This PR continues the effort of #8534 and #8565 to provide full support for reading Iceberg position and equality deletes, and writing Iceberg position deletes in Parquet.

I have added some tests to ensure the correctness of the implementation, and I will continue to add more tests in the following days. I will leave some comments in the code as discussion points.

This is a big PR and we can separate it to multiple for actual contribution, but anyone interested can also try this patch out, I have made sure all related tests pass.

jackye1995 · 2021-11-26T08:32:08Z

@alexjo2144 @findepi @losipiuk @phd3

jackye1995 · 2021-11-26T08:33:26Z

@electrum @rdblue

jackye1995 · 2021-11-26T08:38:29Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java

Added support for users to directly create a v2 table using property format_version

Does this mean that I can define the iceberg table spec in the with clause by specifying format_version?

jackye1995 · 2021-11-26T08:39:34Z

...rino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/IcebergPositionDeletePageSink.java

My understanding is that all deletes received from the updatable page source will always come from the same split, which means the same task, which means the same partition. So there is no need to use the fanout writer for writing deletes. Please let me know if it is not the case.

jackye1995 · 2021-11-26T08:47:12Z

...n/trino-iceberg/src/main/java/io/trino/plugin/iceberg/serdes/IcebergFileScanTaskWrapper.java

As we discussed, Iceberg is adding full Jackson support for its object models. This wrapper will be updated to use Jackson serialization once that is completed. Meanwhile I delegate everything to Java serialization that Iceberg provides. Because of that, some related classes are added Serializable or Externalizable implementations.

What is the timeline for that. Will it be feasible to wait for Jackson serialization before merging this PR?

yeah it's up to the community, I am just posting this out for review first while working on that. The 0.13 train is going to be out soon so we will target 0.14 for this support, so likely around Jan 2022.

Java serialization is a hard no for Trino. It is huge ugly mess with lots of security problems. Instead, for temporary storage you could do something like annotate the iceberg objects with Jackson annotations or use Jackson features to serialize third party objects. For long term storage, you should write your own serialization objects and copy in and out of them.

jackye1995 · 2021-11-26T09:00:31Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

By adding row-level deletes, now some behaviors of DELETE becomes unclear to users. Some queries run as metadata delete, some as row-level delete. We might want to introduce the following configs:

if metadata delete should be used when possible

for metadata delete, should it use DeleteFiles API or global equality delete

for row-level delete, should it use position delete or equality delete

We know Trino architecture fits position delete better than equality delete, but please read document apache/iceberg#3432 for the tradeoff between the 2. There might be use cases that people desire to use equality delete instead.

We will probably add these configs when the feature request comes.

I think it's better to avoid new setting if possible. So questions:

If metadata delete is possible you'd always want that, right?

If so, why would anybody want the equality delete when DeleteFiles works?

Not clear here. That's seems to be the only one we want to configure. And probably on a query-by-query basis.

yes, we should extend the feature of metadata delete to also consider Iceberg hidden partitions. As of today it only works for direct partition columns, and deleting based on partition transform predicate would result in row-level delete. That is the "unclear behavior to user" I meant.

because it does not touch data file, so will not cause commit conflict. However, I also do not want to add use cases for global equality delete for performance reason, I am just listing the theoretical use cases here.

agree, I think we will need session config and ideally per-query config for this (if there is a need from community). But for now position delete definitely works the best with Trino row-level delete design.

We don't need a config

when "metadata delete" is possible, we should just do it. This is at most same work as split generation. We determine whole files should 'disappear' from the table and we should just do that, without any delete files.

we should not use "equality deletes" ever. We discussed this somewhere in iceberg community already. Trino can do the work to produce position-based delete files, so subsequent read queries are better handled, and also Trino wants to return # deleted rows to the user, so processing data files is desireable anyway.

so "metadata delete" is possible, position-based otherwise.

losipiuk · 2021-11-30T08:49:35Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergColumnHandle.java

+            BIGINT,
+            Optional.empty());
+
+    // use Integer.MIN_VALUE as $row_id field ID, which is currently not reserved by Iceberg


would we know if iceberg starts using this id internally? Would sth break?

Because Iceberg column name and type could change, comparing column ID is needed for equality check. Using a duplicated column ID would cause issue in that area.

All the Iceberg metadata columns have IDs counting down from Integer.MAX_VALUE. That's why I am choosing Integer.MIN_VALUE here. If there is any risk of conflicting IDs, Iceberg will inform the Trino community, and switching to another ID is fully backwards compatible.

losipiuk · 2021-11-30T08:51:29Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java

+    public static final int FORMAT_VERSION_SUPPORT_MIN = 1;
+    public static final int FORMAT_VERSION_SUPPORT_MAX = 2;


nit: MIN/MAX_SUPPORTED_VERSION

losipiuk · 2021-11-30T09:16:39Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+            }
+        }
+
+        Optional<ReaderColumns> readerColumns = Optional.of(new ReaderColumns(projectedColumns.build(), outputColumnMapping.build()));


why optional if never empty?

losipiuk · 2021-11-30T09:20:53Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+        boolean isDeleteOrUpdateQuery = false;
+        for (int idx = 0; idx < queriedColumns.size(); idx++) {
+            IcebergColumnHandle column = queriedColumns.get(idx);
+            if (column.isTrinoRowIdColumn()) {


Should we be more explicit and encode fact that we are doing DELETE/UPDATE in IcebergTableHandle instead of inferring it from list of columns?

losipiuk · 2021-11-30T09:25:57Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

                .collect(toImmutableList());

        Map<Integer, Optional<String>> partitionKeys = split.getPartitionKeys();
+        Optional<StructLike> partition = task.spec().isUnpartitioned() ? Optional.empty() : Optional.of(task.file().partition());


The whole createPageSource is hard to follow. Block of code with lots of raw index operations. Would be great to add some higher level structure to it.

Yes I agree, I can explore that a bit more to see what's the best way to refactor this to be more readable.

losipiuk · 2021-11-30T09:27:49Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

                        IcebergPageSourceProvider::applyProjection));
-        return new IcebergPageSource(icebergColumns, partitionKeys, dataPageSource.get(), projectionsAdapter);
+
+        return new IcebergPageSource(


nit: wrap after each arg

losipiuk · 2021-11-30T09:30:31Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+            IcebergColumnHandle column,
+            List<Integer> fileReadColumnIds,
+            List<IcebergColumnHandle> fileReadColumns,
+            Object[] prefillValues,


findepi

(skimming)

findepi · 2021-12-01T09:52:10Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+            if (column.isTrinoRowIdColumn()) {
+                // TODO: it's a bit late to fail here, but failing earlier would cause metadata delete to also fail
+                if (ORC == getFileFormat(table.getTable())) {
+                    throw new TrinoException(GENERIC_USER_ERROR, "Row level delete and update are not supported for ORC type");


Suggested change

throw new TrinoException(GENERIC_USER_ERROR, "Row level delete and update are not supported for ORC type");

throw new TrinoException(GENERIC_USER_ERROR, "Row-level delete and update are not supported for ORC");

findepi · 2021-12-01T09:52:50Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

+        Block[] queriedColumnPrefillValues = new Block[queriedColumns.size()];
+        int[] queriedColumnFileReadChannels = new int[queriedColumns.size()];
+        boolean isDeleteOrUpdateQuery = false;
+        for (int idx = 0; idx < queriedColumns.size(); idx++) {


findepi · 2021-12-01T09:54:43Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HdfsEnvironment.java

+    public void writeExternal(ObjectOutput out)
+            throws IOException
+    {
+    }
+
+    @Override
+    public void readExternal(ObjectInput in)
+            throws IOException, ClassNotFoundException
+    {
+    }


That's unlikely correct serialization of HdfsEnvironment state, since it doesn't seem to store anything.

Anyway, we should not make this class serializable (nor Externalizable) at all

findepi · 2021-12-01T09:55:04Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HdfsEnvironment.java

+        public void writeExternal(ObjectOutput out)
+                throws IOException
+        {
+        }
+
+        @Override
+        public void readExternal(ObjectInput in)
+                throws IOException, ClassNotFoundException
+        {
+        }


as above. This is neither correct, nor desired.

findepi · 2021-12-01T10:00:40Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java

+                        FORMAT_VERSION_PROPERTY,
+                        "Iceberg table format version",
+                        icebergConfig.getFormatVersion(),
+                        false))


The validation should happen here, not in getFormatVersion below.

see availablke integerProperty overload

findepi · 2021-12-01T13:54:47Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

+            Block dictionary = ((DictionaryBlock) rowIds).getDictionary();
+            if (dictionary instanceof RowBlock) {
+                rows = (RowBlock) dictionary;


The semantics of resolveRowIdBlock are ill-defined. You get "some block, with some values".
the underlying DictionaryBlock.getDictionary may contain bogus data, or pretty much anything

remove this method.

you probably need it because you didn't the proper class, see the other comment about Block.getChildren

findepi · 2021-12-01T13:55:37Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

+    {
+        try {
+            Collection<Slice> slices = new ArrayList<>();
+            if (posDeleteSink != null) {


if the field is nullable, annotate the field and constructor param with @Nullable

findepi · 2021-12-01T13:55:40Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

+            if (posDeleteSink != null) {
+                slices.addAll(posDeleteSink.finish().get());
+            }
+            if (updateRowSink != null) {


if the field is nullable, annotate the field and constructor param with @Nullable

findepi · 2021-12-01T13:56:05Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

+        posDeleteSink.abort();
+        updateRowSink.abort();


finish() assumes the fields are nullable.

findepi · 2021-12-01T13:56:17Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

+    }
+
    @Override
    public void close()


close posDeleteSink, updateRowSink too

littleWT · 2021-12-03T03:15:06Z

This PR continues the effort of #8534 and #8565 to provide full support for reading Iceberg position and equality deletes, and writing Iceberg position deletes in Parquet.

I have added some tests to ensure the correctness of the implementation, and I will continue to add more tests in the following days. I will leave some comments in the code as discussion points.

This is a big PR and we can separate it to multiple for actual contribution, but anyone interested can also try this patch out, I have made sure all related tests pass.

HI，Recently，I tested this PR on flink CDC data in iceberg，and found that when no update/delete data it would be ok and query very fast。But it can not work when there are a lot of update/delete data.

jackye1995 · 2021-12-03T06:27:58Z

thanks for the information @littleWT !

This PR focuses on getting the feature in, I would expect the performance to be not super good because vectorization of reading deletes is still work in progress in Iceberg. Position delete vectorized read will be out in 0.13, equality delete vectorized read will be out in 0.14.

It would be great if you can provide some information related to your test scale for other people to refer.

littleWT · 2021-12-03T07:28:33Z

@jackye1995 Haha,we had talk aboult the problem in Dingding!
We tested about 13 millions rows data and when there are only thounds delete/update rows the query would not work.
Now we have combined the CDC data with SPARK（MOR）， that would be better than before. But not so good.

alexjo2144 · 2021-12-03T16:01:31Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+        List<Types.NestedField> icebergFields = new ArrayList<>();
+        List<RowType.Field> trinoFields = new ArrayList<>();


Usually ImmutableList.Builder is preferred

alexjo2144 · 2021-12-03T16:05:51Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+                            .withFileSizeInBytes(task.getFileSizeInBytes())
+                            .withMetrics(task.getMetrics().metrics());
+
+                    if (!icebergTable.spec().fields().isEmpty()) {


Maybe

Suggested change

if (!icebergTable.spec().fields().isEmpty()) {

if (partitionColumnTypes.length > 0) {

alexjo2144 · 2021-12-03T16:18:15Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

-                delegateIndex++;
-            }
-            outputIndex++;
+        this.queriedColumnPrefillValues = requireNonNull(queriedColumnPrefillValues, "queriedColumnPrefillValues is null");


Are there any verify's we can put in for these array sizes?

alexjo2144 · 2021-12-03T16:36:47Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSource.java

+            if (fileReadChannel == -1) {
+                blocks[i] = new RunLengthEncodedBlock(prefillValues[i], batchSize);
+            }
+            else if (fileReadChannel == -2) {


Or maybe an emum rather than integer sentinel values

alexjo2144 · 2021-12-09T15:05:49Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+    @Override
+    public ConnectorTableHandle beginUpdate(ConnectorSession session, ConnectorTableHandle tableHandle, List<ColumnHandle> updatedColumns)
+    {
+        IcebergTableHandle table = (IcebergTableHandle) tableHandle;


This should throw an unsupported exception if the table is v1 right? Same for beginDelete

findepi · 2022-01-13T09:03:15Z

Please add a test with OPTIMIZE after some rows modified with DELETE and UPDATE.

alexjo2144 · 2022-02-25T15:01:27Z

@jackye1995 @findepi I'm going to take a look at the conflicts and see if I can split some of this out into smaller chunks. It looks like some of the serialization work for the REST catalog should be useful here too.

findepi · 2022-02-25T15:27:36Z

Thanks @alexjo2144 !
also cc @kbendick re serialization

electrum · 2022-04-07T23:30:06Z

@jackye1995 thanks for your work here. @alexjo2144 can this be closed now, or is there additional code to be merged?

alexjo2144 · 2022-04-08T17:00:40Z

So far I've only pulled out the read support pieces from here, I'm working on the merge conflicts for write support now.

I have a copy of Jack's code though, so I think we can close this PR.

findepi · 2022-04-13T08:21:30Z

Superseded by #11886

cla-bot bot added the cla-signed label Nov 26, 2021

This was referenced Nov 26, 2021

Iceberg: support row-level delete and update #8565

Closed

Iceberg: support Parquet read with delete filter #8534

Closed

jackye1995 commented Nov 26, 2021

View reviewed changes

jackye1995 force-pushed the mor branch 4 times, most recently from 02907c3 to cc52d49 Compare November 26, 2021 22:25

Support Iceberg row-level delete and update

63bc2c3

jackye1995 force-pushed the mor branch from cc52d49 to 63bc2c3 Compare November 27, 2021 01:30

losipiuk reviewed Nov 30, 2021

View reviewed changes

findepi reviewed Dec 1, 2021

View reviewed changes

alexjo2144 reviewed Dec 9, 2021

View reviewed changes

This was referenced Jan 21, 2022

Iceberg : Delete Or Update is not support when read #7522

Closed

Iceberg Connector #1324

Closed

homar mentioned this pull request Mar 17, 2022

[WIP] Iceberg row level updates and deletes for ORC files #11543

Closed

alexjo2144 mentioned this pull request Mar 23, 2022

Iceberg support for reading v2 row level deletes #11642

Merged

findepi closed this Apr 13, 2022

		public static final int FORMAT_VERSION_SUPPORT_MIN = 1;
		public static final int FORMAT_VERSION_SUPPORT_MAX = 2;

	throw new TrinoException(GENERIC_USER_ERROR, "Row level delete and update are not supported for ORC type");
	throw new TrinoException(GENERIC_USER_ERROR, "Row-level delete and update are not supported for ORC");

		List<Types.NestedField> icebergFields = new ArrayList<>();
		List<RowType.Field> trinoFields = new ArrayList<>();

	if (!icebergTable.spec().fields().isEmpty()) {
	if (partitionColumnTypes.length > 0) {

Conversation

jackye1995 commented Nov 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackye1995 commented Nov 26, 2021

Uh oh!

jackye1995 commented Nov 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Nov 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Nov 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

losipiuk Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

littleWT commented Dec 3, 2021

jackye1995 commented Nov 26, 2021 •

edited

Loading

jackye1995 Nov 26, 2021 •

edited

Loading

jackye1995 Nov 26, 2021 •

edited

Loading

losipiuk Nov 30, 2021 •

edited

Loading