Core: Support _deleted metadata column in vectorized read #4888

flyrain · 2022-05-28T00:22:10Z

The vectorized version of #4683.

cc @aokolnychyi @szehon-ho @RussellSpitzer @chenjunjiedada @stevenzwu @Reo-LEI @hameizi @singhpk234 @rajarshisarkar @kbendick @rdblue

Benchmarks for pos delete and eq delete

flyrain · 2022-05-28T00:27:12Z

...4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java

      required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,
-      MetadataColumns.IS_DELETED
+      MetadataColumns.ROW_POSITION


We need this change since the class VectorizedReaderBuilder is shared by all spark versions. The change in line 94 of VectorizedReaderBuilder changes the type of the reader as the following code shows. Then, the read throws exception in the method IcebergArrowColumnVector.forHolder() of the old Spark version. This change should be fine due to the old Spark doesn't really support _deleted metadata column.

reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());

RussellSpitzer · 2022-06-15T15:02:13Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+        numRowsUndeleted = applyEqDelete(newColumnarBatch);
+      }
+
+      if (hasColumnIsDeleted) {


This is a nit but, i think this makes more sense read a hasIsDeletedColumn

RussellSpitzer · 2022-06-15T20:19:32Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

    return new ConstantVectorHolder(numRows, constantValue);
  }

+  public static <T> VectorHolder isDeletedHolder(int numRows) {


This is another kinda on the fence for me, while the return type isn't boolean, this does seem like a boolean method. Maybe it should just be deletedHolder ? Just skip the "is" since it's a bit confusing in this context?

I think the class name is fine, just this method seems a little confusing to me, but maybe it's just me :)

We may keep it as is to keep the naming consistent since we don't change the class name.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

RussellSpitzer · 2022-06-15T22:19:22Z

....2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java

+  }
+
+  @Override
+  public byte getByte(int rowId) {


Not sure if we did this in the others, but IMHO all the accessors should throw UnsupportedOperationException except for getBoolean

Make sense. We did in class RowPositionColumnVector.

RussellSpitzer · 2022-06-15T22:22:10Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

    private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
+    private boolean[] isDeleted; // the array to indicate if a row is deleted or not


For my confusion below can we indicate here for these two arrays to describe when these two can be null?

...2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java

RussellSpitzer · 2022-06-15T22:24:50Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java


  @Test
  public void testPosDeletesWithDeletedColumn() throws IOException {
-    Assume.assumeFalse(vectorized);


RussellSpitzer

I think this is pretty close, I just have a few questions about how we propagate around null rowIdMapping and isDeletedColumns and a few naming nits.

flyrain · 2022-06-17T23:28:05Z

Thanks @RussellSpitzer for the review. Refactor class ColumnBatchReader a bit to remove two loops on isDeleted array. My benchmark shows a bit perf gain.

flyrain · 2022-06-22T23:37:38Z

Hi @aokolnychyi and @RussellSpitzer, vectorized read is enabled by default several months ago. But the benchmark still assumes it false by default. I have set it false explicitly, and run the benchmark again. Now we can see the big performance gain between vectorized and non-vectorized read, as the following diagram shows.

I also profile the benchmarks. Here is the flame graph for vectorized read with 25% row deleted. It looks normal to me. The program spent majority time to read pos delete file and the data file. The read of position delete file is still non-vectorized, that's why it takes a big portion. Would suggest to enable the vectorized read on delete files to improve the overall perf. That's probably next step we can do.

RussellSpitzer

I'm good to go on this, @aokolnychyi are you ready as well?

aokolnychyi · 2022-07-22T01:53:58Z

Sorry for the delay. Let me see.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

...4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java

aokolnychyi · 2022-07-22T02:23:30Z

spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/source/IcebergSourceDeleteBenchmark.java

+  public void readIceberg(Blackhole blackhole) {
    Map<String, String> tableProperties = Maps.newHashMap();
    tableProperties.put(SPLIT_OPEN_FILE_COST, Integer.toString(128 * 1024 * 1024));
+    tableProperties.put(TableProperties.PARQUET_VECTORIZATION_ENABLED, "false");


nit: Let's add a static import like we have for SPLIT_OPEN_FILE_COST for consistency

Applies to all places in this class.

aokolnychyi · 2022-07-22T02:29:01Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java

+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+public class ColumnVectorBuilder {


Do this class and its constructors/methods have to be public?

It's not necessary. Let me change it to package-level.

aokolnychyi · 2022-07-22T15:48:42Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+      if (hasIsDeletedColumn && rowIdMapping != null) {
+        // reset the row id mapping array, so that it doesn't filter out the deleted rows
+        for (int i = 0; i < numRowsToRead; i++) {
+          rowIdMapping[i] = i;


Question: do we have to populate the row ID mapping initially if we know we have _deleted metadata column?

That's a good question. In short, I'm using row ID mapping to improve eq deletes perf when we have both pos deletes and eq deletes. I think it is worth to do that since applying eq deletes is expensive, it has to go row by row. Here is an example, after the pos deletes, we will only need to iterate 6 rows instead of 8 rows for applying eq delete.

* Filter out the equality deleted rows. Here is an example, * [0,1,2,3,4,5,6,7] -- Original status of the row id mapping array * [F,F,F,F,F,F,F,F] -- Original status of the isDeleted array * Position delete 2, 6 * [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num records to 6] * [F,F,T,F,F,F,T,F] -- After applying position deletes * Equality delete 1 <= x <= 3 * [0,4,5,7,-,-,-,-] -- After applying equality deletes [Set Num records to 4] * [F,T,T,T,F,F,T,F] -- After applying equality deletes

Sounds good to me.

aokolnychyi · 2022-07-22T15:56:06Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

-        arrowColumnVectors[i] = hasDeletes() ?
-            ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMapping, numRows) :
-            IcebergArrowColumnVector.forHolder(vectorHolders[i], numRowsInVector);
+        arrowColumnVectors[i] = new ColumnVectorBuilder(vectorHolders[i], numRowsInVector)


Do we have to construct a column vector builder for every column? What about having a constructor accepting the row ID mapping and is deleted array and making build(VectorHolder holder, int numRows)? That way you can init the builder outside of the for loop and call build inside the loop for a particular vectorHolder.

ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder(rowIdMapping, isDeleted); for (int i = 0; i < readers.length; i += 1) { ... arrowColumnVectors[i] = columnVectorBuilder.build(vectorHolders[i], numRowsInVector); }

Nice suggestion. Made the change.

aokolnychyi · 2022-07-22T15:57:49Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java

+        return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
+      } else if (holder instanceof ConstantVectorHolder) {
+        return new ConstantColumnVector(Types.IntegerType.get(), numRows,
+            ((ConstantVectorHolder) holder).getConstant());


nit: ConstantVectorHolder -> ConstantVectorHolder<?>.

aokolnychyi · 2022-07-22T15:58:07Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java

+        return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
+      } else if (holder instanceof ConstantVectorHolder) {
+        return new ConstantColumnVector(Types.IntegerType.get(), numRows,
+            ((ConstantVectorHolder) holder).getConstant());


nit: I think this should fit on a single line

it cannot with the <?>

aokolnychyi · 2022-07-22T16:00:31Z

....2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java

+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+public class DeletedMetaColumnVector extends ColumnVector {


The naming in new classes is a bit inconsistent. Can we align that?

IsDeletedVectorHolder DeletedMetaColumnVector DeletedVectorReader

Made the following changes

IsDeletedVectorHolder -> DeletedVectorHolder DeletedMetaColumnVector -> DeletedColumnVector DeletedVectorReader

aokolnychyi · 2022-07-22T16:04:14Z

This seems correct to me. I had only a few questions/comments.

flyrain · 2022-07-22T23:19:50Z

Hi @aokolnychyi, this is ready for review. I have to apply the same changes to Spark 3.3, otherwise unit test won't pass.

aokolnychyi

LGTM. I had only one nit (same in 3.2 and 3.3).

aokolnychyi · 2022-07-25T23:36:43Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java

+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {


nit: I feel we better make this a constructor and pass these arrays only once during the construction.

I am trying to make the builder more generic so that it can also be used for creation of vectors without deletes.

Okay, I see now. Then it is fine.

aokolnychyi · 2022-07-25T23:40:39Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+    ColumnVector[] readDataToColumnVectors() {
      ColumnVector[] arrowColumnVectors = new ColumnVector[readers.length];

+      ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder();


nit: This is probably where you can pass rowIdMapping and isDeleted as those two don't change.

aokolnychyi · 2022-07-25T23:41:42Z

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+      if (hasIsDeletedColumn && rowIdMapping != null) {
+        // reset the row id mapping array, so that it doesn't filter out the deleted rows
+        for (int i = 0; i < numRowsToRead; i++) {
+          rowIdMapping[i] = i;


Sounds good to me.

aokolnychyi · 2022-07-25T23:44:16Z

...k/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java

+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {


nit: Same for 3.3

aokolnychyi · 2022-07-25T23:53:10Z

Thanks, @flyrain! Great to have this done. Thanks for reviewing, @RussellSpitzer!

flyrain · 2022-07-25T23:57:53Z

Thanks for the review, @aokolnychyi @RussellSpitzer. Per discussion with @aokolnychyi, I will file a followup to throw an exception when _deleted column is used in spark 2.4/3.0/3.1. It always return false no matter whether a row is deleted.

flyrain · 2022-07-28T00:55:17Z

@aokolnychyi, checked the module Spark2.4/3.0/3.1. Metadata column _deleted isn't supported. For example, I've added the following tests into class TestSelect.

@Test
public void testSelectDeletedMetaColumn() {
  List<Object[]> expected = ImmutableList.of(
      row(1L, "a", 1.0F), row(2L, "b", 2.0F), row(3L, "c", Float.NaN));

  assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s where _deleted=false", tableName));

  expected = ImmutableList.of();
  assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s where _deleted=true", tableName));
}

It reports the following errors. We don't need to change anything in that sense.

cannot resolve '`_deleted`' given input columns: [table.id, table.data, table.doubleVal]; line 1 pos 26;
'Project [*]
+- 'Filter ('_deleted = false)
   +- SubqueryAlias `table`
      +- RelationV2 iceberg[id#12, data#13, doubleVal#14] (Options: [path=file:/var/folders/69/j9m_r8gj69753xfnsjnlsl_m0000gn/T/junit2794346723320750539/junit3760903...)

org.apache.spark.sql.AnalysisException: cannot resolve '`_deleted`' given input columns: [table.id, table.data, table.doubleVal]; line 1 pos 26;
'Project [*]
+- 'Filter ('_deleted = false)
   +- SubqueryAlias `table`
      +- RelationV2 iceberg[id#12, data#13, doubleVal#14] (Options: [path=file:/var/folders/69/j9m_r8gj69753xfnsjnlsl_m0000gn/T/junit2794346723320750539/junit3760903...)

And the change I did in class TestSparkParquetReadMetadataColumns.java, they are unit tests for internal classes. User cannot use them directly.
In conclusion, we are OK to leave it as is.

github-actions bot added arrow spark labels May 28, 2022

flyrain commented May 28, 2022

View reviewed changes

RussellSpitzer reviewed Jun 15, 2022

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jun 15, 2022

View reviewed changes

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jun 15, 2022

View reviewed changes

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jun 15, 2022

View reviewed changes

...k/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jun 15, 2022

View reviewed changes

...2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java Show resolved Hide resolved

RussellSpitzer reviewed Jun 15, 2022

View reviewed changes

RussellSpitzer approved these changes Jul 12, 2022

View reviewed changes

aokolnychyi reviewed Jul 22, 2022

View reviewed changes

flyrain added 8 commits July 22, 2022 13:19

Core: Support _deleted metadata column in vectorized read

f8889b7

Add benchmark for vectorized read

15d8b72

Resolve comments

f1dfb30

Refactor for the perf.

aa919f6

Add Java doc

685f94e

Use Blackhole in benchmark.

e3cbd97

Need to set non-vectorized read explicitly

a40d6b1

Resolve comments.

d4e90bd

flyrain force-pushed the vr-deletedColumn branch from 3487b88 to 411bdda Compare July 22, 2022 20:37

Apply the same change to Spark3.3

a81ebc6

flyrain force-pushed the vr-deletedColumn branch from 411bdda to a81ebc6 Compare July 22, 2022 21:48

aokolnychyi approved these changes Jul 25, 2022

View reviewed changes

aokolnychyi merged commit 2a6d17f into apache:master Jul 25, 2022

Core: Support _deleted metadata column in vectorized read #4888

Core: Support _deleted metadata column in vectorized read #4888

Uh oh!

Conversation

flyrain commented May 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain May 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RussellSpitzer Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

flyrain commented Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 22, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 22, 2022

flyrain commented May 28, 2022 •

edited

Loading

flyrain May 28, 2022 •

edited

Loading

RussellSpitzer Jun 15, 2022 •

edited

Loading

flyrain commented Jun 17, 2022 •

edited

Loading

flyrain commented Jun 22, 2022 •

edited

Loading