Support Iceberg row-level delete and update#10075
Support Iceberg row-level delete and update#10075jackye1995 wants to merge 1 commit intotrinodb:masterfrom
Conversation
There was a problem hiding this comment.
Added support for users to directly create a v2 table using property format_version
There was a problem hiding this comment.
Does this mean that I can define the iceberg table spec in the with clause by specifying format_version?
There was a problem hiding this comment.
My understanding is that all deletes received from the updatable page source will always come from the same split, which means the same task, which means the same partition. So there is no need to use the fanout writer for writing deletes. Please let me know if it is not the case.
There was a problem hiding this comment.
As we discussed, Iceberg is adding full Jackson support for its object models. This wrapper will be updated to use Jackson serialization once that is completed. Meanwhile I delegate everything to Java serialization that Iceberg provides. Because of that, some related classes are added Serializable or Externalizable implementations.
There was a problem hiding this comment.
What is the timeline for that. Will it be feasible to wait for Jackson serialization before merging this PR?
There was a problem hiding this comment.
yeah it's up to the community, I am just posting this out for review first while working on that. The 0.13 train is going to be out soon so we will target 0.14 for this support, so likely around Jan 2022.
There was a problem hiding this comment.
Java serialization is a hard no for Trino. It is huge ugly mess with lots of security problems. Instead, for temporary storage you could do something like annotate the iceberg objects with Jackson annotations or use Jackson features to serialize third party objects. For long term storage, you should write your own serialization objects and copy in and out of them.
There was a problem hiding this comment.
By adding row-level deletes, now some behaviors of DELETE becomes unclear to users. Some queries run as metadata delete, some as row-level delete. We might want to introduce the following configs:
- if metadata delete should be used when possible
- for metadata delete, should it use
DeleteFilesAPI or global equality delete - for row-level delete, should it use position delete or equality delete
We know Trino architecture fits position delete better than equality delete, but please read document apache/iceberg#3432 for the tradeoff between the 2. There might be use cases that people desire to use equality delete instead.
We will probably add these configs when the feature request comes.
There was a problem hiding this comment.
I think it's better to avoid new setting if possible. So questions:
- If metadata delete is possible you'd always want that, right?
- If so, why would anybody want the equality delete when DeleteFiles works?
- Not clear here. That's seems to be the only one we want to configure. And probably on a query-by-query basis.
There was a problem hiding this comment.
- yes, we should extend the feature of metadata delete to also consider Iceberg hidden partitions. As of today it only works for direct partition columns, and deleting based on partition transform predicate would result in row-level delete. That is the "unclear behavior to user" I meant.
- because it does not touch data file, so will not cause commit conflict. However, I also do not want to add use cases for global equality delete for performance reason, I am just listing the theoretical use cases here.
- agree, I think we will need session config and ideally per-query config for this (if there is a need from community). But for now position delete definitely works the best with Trino row-level delete design.
There was a problem hiding this comment.
We don't need a config
- when "metadata delete" is possible, we should just do it. This is at most same work as split generation. We determine whole files should 'disappear' from the table and we should just do that, without any delete files.
- we should not use "equality deletes" ever. We discussed this somewhere in iceberg community already. Trino can do the work to produce position-based delete files, so subsequent read queries are better handled, and also Trino wants to return # deleted rows to the user, so processing data files is desireable anyway.
- so "metadata delete" is possible, position-based otherwise.
02907c3 to
cc52d49
Compare
| BIGINT, | ||
| Optional.empty()); | ||
|
|
||
| // use Integer.MIN_VALUE as $row_id field ID, which is currently not reserved by Iceberg |
There was a problem hiding this comment.
would we know if iceberg starts using this id internally? Would sth break?
There was a problem hiding this comment.
Because Iceberg column name and type could change, comparing column ID is needed for equality check. Using a duplicated column ID would cause issue in that area.
All the Iceberg metadata columns have IDs counting down from Integer.MAX_VALUE. That's why I am choosing Integer.MIN_VALUE here. If there is any risk of conflicting IDs, Iceberg will inform the Trino community, and switching to another ID is fully backwards compatible.
| public static final int FORMAT_VERSION_SUPPORT_MIN = 1; | ||
| public static final int FORMAT_VERSION_SUPPORT_MAX = 2; |
There was a problem hiding this comment.
nit: MIN/MAX_SUPPORTED_VERSION
| } | ||
| } | ||
|
|
||
| Optional<ReaderColumns> readerColumns = Optional.of(new ReaderColumns(projectedColumns.build(), outputColumnMapping.build())); |
| boolean isDeleteOrUpdateQuery = false; | ||
| for (int idx = 0; idx < queriedColumns.size(); idx++) { | ||
| IcebergColumnHandle column = queriedColumns.get(idx); | ||
| if (column.isTrinoRowIdColumn()) { |
There was a problem hiding this comment.
Should we be more explicit and encode fact that we are doing DELETE/UPDATE in IcebergTableHandle instead of inferring it from list of columns?
| .collect(toImmutableList()); | ||
|
|
||
| Map<Integer, Optional<String>> partitionKeys = split.getPartitionKeys(); | ||
| Optional<StructLike> partition = task.spec().isUnpartitioned() ? Optional.empty() : Optional.of(task.file().partition()); |
There was a problem hiding this comment.
The whole createPageSource is hard to follow. Block of code with lots of raw index operations. Would be great to add some higher level structure to it.
There was a problem hiding this comment.
Yes I agree, I can explore that a bit more to see what's the best way to refactor this to be more readable.
| IcebergPageSourceProvider::applyProjection)); | ||
| return new IcebergPageSource(icebergColumns, partitionKeys, dataPageSource.get(), projectionsAdapter); | ||
|
|
||
| return new IcebergPageSource( |
| IcebergColumnHandle column, | ||
| List<Integer> fileReadColumnIds, | ||
| List<IcebergColumnHandle> fileReadColumns, | ||
| Object[] prefillValues, |
| if (column.isTrinoRowIdColumn()) { | ||
| // TODO: it's a bit late to fail here, but failing earlier would cause metadata delete to also fail | ||
| if (ORC == getFileFormat(table.getTable())) { | ||
| throw new TrinoException(GENERIC_USER_ERROR, "Row level delete and update are not supported for ORC type"); |
There was a problem hiding this comment.
| throw new TrinoException(GENERIC_USER_ERROR, "Row level delete and update are not supported for ORC type"); | |
| throw new TrinoException(GENERIC_USER_ERROR, "Row-level delete and update are not supported for ORC"); |
| Block[] queriedColumnPrefillValues = new Block[queriedColumns.size()]; | ||
| int[] queriedColumnFileReadChannels = new int[queriedColumns.size()]; | ||
| boolean isDeleteOrUpdateQuery = false; | ||
| for (int idx = 0; idx < queriedColumns.size(); idx++) { |
| public void writeExternal(ObjectOutput out) | ||
| throws IOException | ||
| { | ||
| } | ||
|
|
||
| @Override | ||
| public void readExternal(ObjectInput in) | ||
| throws IOException, ClassNotFoundException | ||
| { | ||
| } |
There was a problem hiding this comment.
That's unlikely correct serialization of HdfsEnvironment state, since it doesn't seem to store anything.
Anyway, we should not make this class serializable (nor Externalizable) at all
| public void writeExternal(ObjectOutput out) | ||
| throws IOException | ||
| { | ||
| } | ||
|
|
||
| @Override | ||
| public void readExternal(ObjectInput in) | ||
| throws IOException, ClassNotFoundException | ||
| { | ||
| } |
There was a problem hiding this comment.
as above. This is neither correct, nor desired.
| FORMAT_VERSION_PROPERTY, | ||
| "Iceberg table format version", | ||
| icebergConfig.getFormatVersion(), | ||
| false)) |
There was a problem hiding this comment.
The validation should happen here, not in getFormatVersion below.
see availablke integerProperty overload
| Block dictionary = ((DictionaryBlock) rowIds).getDictionary(); | ||
| if (dictionary instanceof RowBlock) { | ||
| rows = (RowBlock) dictionary; |
There was a problem hiding this comment.
The semantics of resolveRowIdBlock are ill-defined. You get "some block, with some values".
the underlying DictionaryBlock.getDictionary may contain bogus data, or pretty much anything
remove this method.
you probably need it because you didn't the proper class, see the other comment about Block.getChildren
| { | ||
| try { | ||
| Collection<Slice> slices = new ArrayList<>(); | ||
| if (posDeleteSink != null) { |
There was a problem hiding this comment.
if the field is nullable, annotate the field and constructor param with @Nullable
| if (posDeleteSink != null) { | ||
| slices.addAll(posDeleteSink.finish().get()); | ||
| } | ||
| if (updateRowSink != null) { |
There was a problem hiding this comment.
if the field is nullable, annotate the field and constructor param with @Nullable
| posDeleteSink.abort(); | ||
| updateRowSink.abort(); |
There was a problem hiding this comment.
finish() assumes the fields are nullable.
| } | ||
|
|
||
| @Override | ||
| public void close() |
There was a problem hiding this comment.
close posDeleteSink, updateRowSink too
HI,Recently,I tested this PR on flink CDC data in iceberg,and found that when no update/delete data it would be ok and query very fast。But it can not work when there are a lot of update/delete data. |
|
thanks for the information @littleWT ! This PR focuses on getting the feature in, I would expect the performance to be not super good because vectorization of reading deletes is still work in progress in Iceberg. Position delete vectorized read will be out in 0.13, equality delete vectorized read will be out in 0.14. It would be great if you can provide some information related to your test scale for other people to refer. |
|
@jackye1995 Haha,we had talk aboult the problem in Dingding! |
| List<Types.NestedField> icebergFields = new ArrayList<>(); | ||
| List<RowType.Field> trinoFields = new ArrayList<>(); |
There was a problem hiding this comment.
Usually ImmutableList.Builder is preferred
| .withFileSizeInBytes(task.getFileSizeInBytes()) | ||
| .withMetrics(task.getMetrics().metrics()); | ||
|
|
||
| if (!icebergTable.spec().fields().isEmpty()) { |
There was a problem hiding this comment.
Maybe
| if (!icebergTable.spec().fields().isEmpty()) { | |
| if (partitionColumnTypes.length > 0) { |
| delegateIndex++; | ||
| } | ||
| outputIndex++; | ||
| this.queriedColumnPrefillValues = requireNonNull(queriedColumnPrefillValues, "queriedColumnPrefillValues is null"); |
There was a problem hiding this comment.
Are there any verify's we can put in for these array sizes?
| if (fileReadChannel == -1) { | ||
| blocks[i] = new RunLengthEncodedBlock(prefillValues[i], batchSize); | ||
| } | ||
| else if (fileReadChannel == -2) { |
There was a problem hiding this comment.
Or maybe an emum rather than integer sentinel values
| @Override | ||
| public ConnectorTableHandle beginUpdate(ConnectorSession session, ConnectorTableHandle tableHandle, List<ColumnHandle> updatedColumns) | ||
| { | ||
| IcebergTableHandle table = (IcebergTableHandle) tableHandle; |
There was a problem hiding this comment.
This should throw an unsupported exception if the table is v1 right? Same for beginDelete
|
Please add a test with OPTIMIZE after some rows modified with DELETE and UPDATE. |
|
@jackye1995 @findepi I'm going to take a look at the conflicts and see if I can split some of this out into smaller chunks. It looks like some of the serialization work for the REST catalog should be useful here too. |
|
Thanks @alexjo2144 ! |
|
@jackye1995 thanks for your work here. @alexjo2144 can this be closed now, or is there additional code to be merged? |
|
So far I've only pulled out the read support pieces from here, I'm working on the merge conflicts for write support now. I have a copy of Jack's code though, so I think we can close this PR. |
|
Superseded by #11886 |
This PR continues the effort of #8534 and #8565 to provide full support for reading Iceberg position and equality deletes, and writing Iceberg position deletes in Parquet.
I have added some tests to ensure the correctness of the implementation, and I will continue to add more tests in the following days. I will leave some comments in the code as discussion points.
This is a big PR and we can separate it to multiple for actual contribution, but anyone interested can also try this patch out, I have made sure all related tests pass.