Core: add delete row reader #2320

chenjunjiedada · 2021-03-11T07:28:02Z

This adds a row reader to read matched delete row for spark side. The next is to implement the reader in other engines.

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

core/src/main/java/org/apache/iceberg/util/ChainedFilter.java

openinx · 2021-03-11T09:10:45Z

spark/src/main/java/org/apache/iceberg/spark/source/DeleteRowReader.java

+import org.apache.spark.rdd.InputFileBlockHolder;
+import org.apache.spark.sql.catalyst.InternalRow;
+
+public class DeleteRowReader extends RowDataReader {


To be more precise, this reader will find all rows that has been deleted by equality deletions. How about using the EqualityDeleteRowReader as the class name ?

openinx · 2021-03-11T09:11:23Z

spark/src/main/java/org/apache/iceberg/spark/source/DeleteRowReader.java

+  public DeleteRowReader(CombinedScanTask task, Schema schema, Schema expectedSchema, String nameMapping,
+                         FileIO io, EncryptionManager encryptionManager, boolean caseSensitive) {
+    super(task, schema, schema, nameMapping, io, encryptionManager,
+        caseSensitive);


Nit: don't have to start a newline.

openinx · 2021-03-11T09:14:06Z

spark/src/main/java/org/apache/iceberg/spark/source/DeleteRowReader.java

+    return matches.matchEqDeletes(open(task, requiredSchema, idToConstant)).iterator();
+  }
+
+  protected class SparkDeleteMatcher extends DeleteFilter<InternalRow> {


We actually could share the same DeleteFilter for spark engine ? Don't have to introduce the same SparkDeleteFilter for both RowDataReader and DeleteRowReader

openinx · 2021-03-11T09:24:20Z

spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

+
+  @Test
+  public void testReadDeleteRow() throws IOException {
+    String tableName = "testDeleteRowRead";


I think we need an unit test to cover the case that have multiple version of equality delete schema to verify that the Predicates concatted by OR is working as expected.

openinx · 2021-03-11T09:33:58Z

spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

  }
+
+  @Test
+  public void testReadDeleteRow() throws IOException {


Another point, I think we'd better to move those newly added unit tests to the abstracted DeleteReadTests as possible as we can. Then we don't have to introduce this test again for other engines.

Make sense to me. I plan to add an abstract function to read delete rows and that should be implemented in the engine specific test classes. While the engine specific read logic is not ready, I think we can refactor when those are ready.

openinx

LGTM, left a minor comment.

openinx · 2021-03-12T07:11:06Z

spark/src/main/java/org/apache/iceberg/spark/source/EqualityDeleteRowReader.java

+  public EqualityDeleteRowReader(CombinedScanTask task, Schema schema, Schema expectedSchema, String nameMapping,
+                                 FileIO io, EncryptionManager encryptionManager, boolean caseSensitive) {
+    super(task, schema, schema, nameMapping, io, encryptionManager, caseSensitive);
+    this.tableSchema = schema;


Nit: this tableSchema is actually defined in its parent class RowDataReader, right ? we usually introduce a protected tableSchema() method in RowDataReader to access the schema , rather than defining an extra private member in sub-class.

openinx · 2021-03-13T02:41:02Z

Got this merged, Thanks @chenjunjiedada for contributing !

rdblue · 2021-03-23T22:03:42Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+    return isInDeleteSets;
+  }
+
+  public CloseableIterable<T> findEqualityDeleteRows(CloseableIterable<T> records) {


Does this method need to be public? I'm surprised that it is given that applyEqDeletes is not.

I also think it would be good to have a better name for it, if it does need to be public. This is really just applying a filter, so I think something like keepDeletedRows would be more descriptive.

rdblue · 2021-03-23T22:08:45Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

  }

-  private CloseableIterable<T> applyEqDeletes(CloseableIterable<T> records) {
+  private List<Predicate<T>> applyEqDeletes() {


Looks like this wasn't renamed. It doesn't apply equality deletes, so I'd rather use a more descriptive name, like buildEqDeletePredicates.

rdblue · 2021-03-23T22:11:39Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

-      filteredRecords = Deletes.filter(filteredRecords,
-          record -> projectRow.wrap(asStructLike(record)), deleteSet);
+      Predicate<T> isInDeleteSet = record -> deleteSet.contains(projectRow.wrap(asStructLike(record)));
+      isInDeleteSets.add(isInDeleteSet);


Why doesn't this return a single predicate, isDeleted? Both findEqualityDeleteRows and applyEqDeletes end up producing a Predicate that determines whether a row is deleted. The only difference is that deletedRows and remainingRows are negations of each other. But those methods could just as easily use isDeleted and negate in shouldKeep.

rdblue · 2021-03-23T22:12:36Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+
+  private CloseableIterable<T> applyEqDeletes(CloseableIterable<T> records) {
+    // Predicate to test whether a row should be visible to user after applying equality deletions.
+    Predicate<T> remainingRows = applyEqDeletes().stream()


If there are no delete predicates, then this should return the original iterable.

rdblue · 2021-03-23T22:12:53Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+        return deletedRows.test(item);
+      }
+    };
+    return deletedRowsFilter.filter(records);


If there are no delete predicates, this should return CloseableIterable.empty.

rdblue · 2021-03-23T22:18:01Z

data/src/test/java/org/apache/iceberg/data/DeleteReadTests.java

    return set;
  }
+
+  protected StructLikeSet rowSetWitIds(int... idsToRetain) {


Typo: should be rowSetWithIds

rdblue · 2021-03-23T22:21:39Z

@chenjunjiedada and @openinx, thanks for making progress on this! I wanted to catch up on what's happening in this area so I went ahead and did a round of review as well. I think there are a few minor things to improve in a follow-up.

I'm also wondering about why this is only focused on equality deletes. Why not return all deleted rows? Is that because we only want equality deletes where this is used?

chenjunjiedada · 2021-03-24T01:58:32Z

Thanks @rdblue! I will update your comment in the following PR.

The reason for reading deleted rows only from equality deletes is that we want to handle equality delete and position delete separately since the filtering logic and cost are different between equality delete and position delete. So that we could choose proper rewrite actions when streaming the CDC data. I'm also working on position deletes rewrite action to clustering the position deletes inside the partition, which would include a position delete row reader. Does this make sense to you?

These two actions are minor compactions and @openinx have a PR that remove all delete rows which I think is major compaction.

chenjunjiedada · 2021-03-24T13:35:28Z

@rdblue , Most of the comments are addressed in #2372.

github-actions bot added core data spark labels Mar 11, 2021

openinx reviewed Mar 11, 2021

View reviewed changes

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java Show resolved Hide resolved

openinx reviewed Mar 11, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/util/ChainedFilter.java Outdated Show resolved Hide resolved

openinx reviewed Mar 11, 2021

View reviewed changes

chenjunjiedada added 2 commits March 12, 2021 14:40

Core: add delete row reader

fd52d94

update chained filter and unit test

e41e4ba

chenjunjiedada force-pushed the add-delete-row-reader branch from 3bf38c8 to e41e4ba Compare March 12, 2021 06:46

remove chained filter

461b68c

openinx approved these changes Mar 12, 2021

View reviewed changes

minor update

1a2fc7b

openinx merged commit 713136d into apache:master Mar 13, 2021

openinx mentioned this pull request Mar 18, 2021

Spark: support replace equality deletes to position deletes #2216

Closed

rdblue reviewed Mar 23, 2021

View reviewed changes

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Core: Add EqualityDeleteRowReader (apache#2320)

8852b23

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Core: Add EqualityDeleteRowReader (apache#2320)

2c8f57a

chenjunjiedada deleted the add-delete-row-reader branch December 27, 2023 15:00

Core: add delete row reader #2320

Core: add delete row reader #2320

Uh oh!

Conversation

chenjunjiedada commented Mar 11, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx commented Mar 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Mar 23, 2021

Uh oh!

chenjunjiedada commented Mar 24, 2021

Uh oh!

chenjunjiedada commented Mar 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants