Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented May 28, 2022

required(101, "data", Types.StringType.get()),
MetadataColumns.ROW_POSITION,
MetadataColumns.IS_DELETED
MetadataColumns.ROW_POSITION
Copy link
Contributor Author

@flyrain flyrain May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this change since the class VectorizedReaderBuilder is shared by all spark versions. The change in line 94 of VectorizedReaderBuilder changes the type of the reader as the following code shows. Then, the read throws exception in the method IcebergArrowColumnVector.forHolder() of the old Spark version. This change should be fine due to the old Spark doesn't really support _deleted metadata column.

        reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());

numRowsUndeleted = applyEqDelete(newColumnarBatch);
}

if (hasColumnIsDeleted) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit but, i think this makes more sense read a hasIsDeletedColumn

return new ConstantVectorHolder(numRows, constantValue);
}

public static <T> VectorHolder isDeletedHolder(int numRows) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another kinda on the fence for me, while the return type isn't boolean, this does seem like a boolean method. Maybe it should just be deletedHolder ? Just skip the "is" since it's a bit confusing in this context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the class name is fine, just this method seems a little confusing to me, but maybe it's just me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may keep it as is to keep the naming consistent since we don't change the class name.

}

@Override
public byte getByte(int rowId) {
Copy link
Member

@RussellSpitzer RussellSpitzer Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we did this in the others, but IMHO all the accessors should throw UnsupportedOperationException except for getBoolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. We did in class RowPositionColumnVector.

private ColumnarBatch columnarBatch;
private final int numRowsToRead;
private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
private boolean[] isDeleted; // the array to indicate if a row is deleted or not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my confusion below can we indicate here for these two arrays to describe when these two can be null?


@Test
public void testPosDeletesWithDeletedColumn() throws IOException {
Assume.assumeFalse(vectorized);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty close, I just have a few questions about how we propagate around null rowIdMapping and isDeletedColumns and a few naming nits.

@flyrain
Copy link
Contributor Author

flyrain commented Jun 17, 2022

Thanks @RussellSpitzer for the review. Refactor class ColumnBatchReader a bit to remove two loops on isDeleted array. My benchmark shows a bit perf gain.

@flyrain
Copy link
Contributor Author

flyrain commented Jun 22, 2022

Hi @aokolnychyi and @RussellSpitzer, vectorized read is enabled by default several months ago. But the benchmark still assumes it false by default. I have set it false explicitly, and run the benchmark again. Now we can see the big performance gain between vectorized and non-vectorized read, as the following diagram shows.
Screen Shot 2022-06-22 at 4 22 09 PM
I also profile the benchmarks. Here is the flame graph for vectorized read with 25% row deleted. It looks normal to me. The program spent majority time to read pos delete file and the data file. The read of position delete file is still non-vectorized, that's why it takes a big portion. Would suggest to enable the vectorized read on delete files to improve the overall perf. That's probably next step we can do.
Screen Shot 2022-06-22 at 4 26 57 PM

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good to go on this, @aokolnychyi are you ready as well?

@aokolnychyi
Copy link
Contributor

Sorry for the delay. Let me see.

public void readIceberg(Blackhole blackhole) {
Map<String, String> tableProperties = Maps.newHashMap();
tableProperties.put(SPLIT_OPEN_FILE_COST, Integer.toString(128 * 1024 * 1024));
tableProperties.put(TableProperties.PARQUET_VECTORIZATION_ENABLED, "false");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's add a static import like we have for SPLIT_OPEN_FILE_COST for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applies to all places in this class.

import org.apache.iceberg.types.Types;
import org.apache.spark.sql.vectorized.ColumnVector;

public class ColumnVectorBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this class and its constructors/methods have to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary. Let me change it to package-level.

if (hasIsDeletedColumn && rowIdMapping != null) {
// reset the row id mapping array, so that it doesn't filter out the deleted rows
for (int i = 0; i < numRowsToRead; i++) {
rowIdMapping[i] = i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we have to populate the row ID mapping initially if we know we have _deleted metadata column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. In short, I'm using row ID mapping to improve eq deletes perf when we have both pos deletes and eq deletes. I think it is worth to do that since applying eq deletes is expensive, it has to go row by row. Here is an example, after the pos deletes, we will only need to iterate 6 rows instead of 8 rows for applying eq delete.

     * Filter out the equality deleted rows. Here is an example,
     * [0,1,2,3,4,5,6,7] -- Original status of the row id mapping array
     * [F,F,F,F,F,F,F,F] -- Original status of the isDeleted array
     * Position delete 2, 6
     * [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num records to 6]
     * [F,F,T,F,F,F,T,F] -- After applying position deletes
     * Equality delete 1 <= x <= 3
     * [0,4,5,7,-,-,-,-] -- After applying equality deletes [Set Num records to 4]
     * [F,T,T,T,F,F,T,F] -- After applying equality deletes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

arrowColumnVectors[i] = hasDeletes() ?
ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMapping, numRows) :
IcebergArrowColumnVector.forHolder(vectorHolders[i], numRowsInVector);
arrowColumnVectors[i] = new ColumnVectorBuilder(vectorHolders[i], numRowsInVector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to construct a column vector builder for every column? What about having a constructor accepting the row ID mapping and is deleted array and making build(VectorHolder holder, int numRows)? That way you can init the builder outside of the for loop and call build inside the loop for a particular vectorHolder.

ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder(rowIdMapping, isDeleted);
for (int i = 0; i < readers.length; i += 1) {
  ...
  arrowColumnVectors[i] = columnVectorBuilder.build(vectorHolders[i], numRowsInVector);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Made the change.

return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
} else if (holder instanceof ConstantVectorHolder) {
return new ConstantColumnVector(Types.IntegerType.get(), numRows,
((ConstantVectorHolder) holder).getConstant());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ConstantVectorHolder -> ConstantVectorHolder<?>.

return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
} else if (holder instanceof ConstantVectorHolder) {
return new ConstantColumnVector(Types.IntegerType.get(), numRows,
((ConstantVectorHolder) holder).getConstant());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should fit on a single line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot with the <?>

import org.apache.spark.sql.vectorized.ColumnarMap;
import org.apache.spark.unsafe.types.UTF8String;

public class DeletedMetaColumnVector extends ColumnVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming in new classes is a bit inconsistent. Can we align that?

IsDeletedVectorHolder
DeletedMetaColumnVector
DeletedVectorReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the following changes

IsDeletedVectorHolder -> DeletedVectorHolder 
DeletedMetaColumnVector -> DeletedColumnVector
DeletedVectorReader

@aokolnychyi
Copy link
Contributor

This seems correct to me. I had only a few questions/comments.

@flyrain flyrain force-pushed the vr-deletedColumn branch from 3487b88 to 411bdda Compare July 22, 2022 20:37
@flyrain flyrain force-pushed the vr-deletedColumn branch from 411bdda to a81ebc6 Compare July 22, 2022 21:48
@flyrain
Copy link
Contributor Author

flyrain commented Jul 22, 2022

Hi @aokolnychyi, this is ready for review. I have to apply the same changes to Spark 3.3, otherwise unit test won't pass.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I had only one nit (same in 3.2 and 3.3).

private boolean[] isDeleted;
private int[] rowIdMapping;

public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel we better make this a constructor and pass these arrays only once during the construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make the builder more generic so that it can also be used for creation of vectors without deletes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see now. Then it is fine.

ColumnVector[] readDataToColumnVectors() {
ColumnVector[] arrowColumnVectors = new ColumnVector[readers.length];

ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is probably where you can pass rowIdMapping and isDeleted as those two don't change.

if (hasIsDeletedColumn && rowIdMapping != null) {
// reset the row id mapping array, so that it doesn't filter out the deleted rows
for (int i = 0; i < numRowsToRead; i++) {
rowIdMapping[i] = i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

private boolean[] isDeleted;
private int[] rowIdMapping;

public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same for 3.3

@aokolnychyi aokolnychyi merged commit 2a6d17f into apache:master Jul 25, 2022
@aokolnychyi
Copy link
Contributor

Thanks, @flyrain! Great to have this done. Thanks for reviewing, @RussellSpitzer!

@flyrain
Copy link
Contributor Author

flyrain commented Jul 25, 2022

Thanks for the review, @aokolnychyi @RussellSpitzer. Per discussion with @aokolnychyi, I will file a followup to throw an exception when _deleted column is used in spark 2.4/3.0/3.1. It always return false no matter whether a row is deleted.

@flyrain
Copy link
Contributor Author

flyrain commented Jul 28, 2022

@aokolnychyi, checked the module Spark2.4/3.0/3.1. Metadata column _deleted isn't supported. For example, I've added the following tests into class TestSelect.

@Test
public void testSelectDeletedMetaColumn() {
  List<Object[]> expected = ImmutableList.of(
      row(1L, "a", 1.0F), row(2L, "b", 2.0F), row(3L, "c", Float.NaN));

  assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s where _deleted=false", tableName));

  expected = ImmutableList.of();
  assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s where _deleted=true", tableName));
}

It reports the following errors. We don't need to change anything in that sense.

cannot resolve '`_deleted`' given input columns: [table.id, table.data, table.doubleVal]; line 1 pos 26;
'Project [*]
+- 'Filter ('_deleted = false)
   +- SubqueryAlias `table`
      +- RelationV2 iceberg[id#12, data#13, doubleVal#14] (Options: [path=file:/var/folders/69/j9m_r8gj69753xfnsjnlsl_m0000gn/T/junit2794346723320750539/junit3760903...)

org.apache.spark.sql.AnalysisException: cannot resolve '`_deleted`' given input columns: [table.id, table.data, table.doubleVal]; line 1 pos 26;
'Project [*]
+- 'Filter ('_deleted = false)
   +- SubqueryAlias `table`
      +- RelationV2 iceberg[id#12, data#13, doubleVal#14] (Options: [path=file:/var/folders/69/j9m_r8gj69753xfnsjnlsl_m0000gn/T/junit2794346723320750539/junit3760903...)

And the change I did in class TestSparkParquetReadMetadataColumns.java, they are unit tests for internal classes. User cannot use them directly.
In conclusion, we are OK to leave it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants