Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly #6026

wypoon · 2022-10-21T00:07:19Z

There is a bug in Parquet vectorized reads reported in #5927.
This bug happens when reading a Parquet data file (using the BatchDataReader) that is bigger than the split size, and there are deletes that need to be applied to the data file. The cause of the bug is that ColumnarBatchReader#setRowGroupInfo is not called with the correct rowPosition, and that is because in ReadConf, generateOffsetToStartPos(Schema) returns null due to an optimization. (When this happens, the startRowPositions array is always populated with 0s, and thus ColumnarBatchReader#setRowGroupInfo gets called with rowPosition 0 even when the rowPosition is that of the second or subsequent row group. In ColumnarBatchReader, setRowGroupInfo initializes a rowStartPosInBatch field, which is used to determine where in the PositionDeleteIndex to start applying deletes from. When rowStartPosInBatch is incorrectly initialized, the indexes of positional deletes are not correctly aligned with the rows in the data file.)
The fix is to ensure that when there are deletes, the Schema has the _pos metadata column in it. Then ReadConf#generateOffsetToStartPos(Schema) will generate the necessary Map that is used to compute the startRowPositions.

Added a unit test that reproduces the problem without this fix. Without this fix, the test passes for non-vectorized read and fails for vectorized read. With this fix, the test passes for both cases.

wypoon · 2022-10-21T03:25:02Z

@flyrain I believe you implemented the support for row-level deletes in the vectorized reader. Can you please review this? Also @aokolnychyi @RussellSpitzer @chenjunjiedada.

chenjunjiedada

+1. Could you please help to backport to spark 3.1 and 3.2 as well?

wypoon · 2022-10-21T16:42:07Z

Thanks for reviewing, @chenjunjiedada. I do plan to port this to Spark 3.2 once this is approved by a committer. Let me look into 3.1 as well. I also think this bug needs to be fixed in the 1.0.x branch.

flyrain

+1 for the solution. Thanks @wypoon for the fix!
Not a blocker. We may do it in another PR, which I think we need more coverage for multiple parquet row groups in terms of deletes read. We can add an option here to generate multiple-row-group parquet files, and reuse the class TestSparkReaderDeletes.

I'm a bit confused of this behavior: ReadConf.startRowPositions is valid only if _pos column exists in the expectedSchema due to #1716. Are there use cases that _pos is absent and we still need ReadConf.startRowPositions? By looking at the class VectorizedParquetReader and ParquetReader who are consuming ReadConf.startRowPositions, it seems likely the schema doesn't have _pos.
cc @chenjunjiedada @aokolnychyi

spark/v3.3/build.gradle

...ark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestParquetMergeOnRead.java

wypoon · 2022-10-22T00:34:26Z

I'm a bit confused of this behavior: ReadConf.startRowPositions is valid only if _pos column exists in the expectedSchema due to #1716. Are there use cases that _pos is absent and we still need ReadConf.startRowPositions? By looking at the class VectorizedParquetReader and ParquetReader who are consuming ReadConf.startRowPositions, it seems likely the schema doesn't have _pos.

I too was surprised by the behavior. In my example, before my fix, when the query

select count(*) from default.test_iceberg where e is null

is run after the update, the Schema that is passed to ReadConf#generateOffsetToStartPos(Schema) is

{
  5: e: optional double
}

so it did not have _pos.

@flyrain are you asking if there are other cases where _pos will still be absent after this fix and we need ReadConf#startRowPositions() to return a valid startRowPositions?

chenjunjiedada · 2022-10-22T00:44:12Z

I'm a bit confused of this behavior: ReadConf.startRowPositions is valid only if _pos column exists in the expectedSchema due to #1716. Are there use cases that _pos is absent and we still need ReadConf.startRowPositions? By looking at the class VectorizedParquetReader and ParquetReader who are consuming ReadConf.startRowPositions, it seems likely the schema doesn't have _pos. cc @chenjunjiedada @aokolnychyi

The row group start positions are always computed but are only correct when it is projected right now. That's intended because we don't want to read the parquet footer one more time. But since the footer must be read at least once, we should be able to cache some content during the first access to avoid the current optimization logic and thus simply the logic to check _pos column.

flyrain · 2022-10-22T18:32:47Z

@flyrain are you asking if there are other cases where _pos will still be absent after this fix and we need ReadConf#startRowPositions() to return a valid startRowPositions?

Yes, I have concerns for this case. But I guess it is fine since VectorizedArrowReader::setRowGroupInfo() doesn't consume the rowPosition, which means it still works fine even rowPosition is off. The problem happens only if we want to rely on the row positions.

iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

Line 396 in 5688d59

public void setRowGroupInfo(

By digging a bit more, ReadConf.columnChunkMetadataForRowGroups keeps the row group metadata. Wondering if we can calculate the row positions while generating the columnChunkMetadataForRowGroups, so that the row position will always be correct and we don't have to read the metadata twice. What do you think, @chenjunjiedada? I guess that's what you mentioned to avoid optimization.

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

Line 237 in 5688d59

    
           private List<Map<ColumnPath, ColumnChunkMetaData>> getColumnChunkMetadataForRowGroups() {

chenjunjiedada · 2022-10-23T00:27:57Z

@flyrain, Correct, that is it.

…erDeletes.

wypoon · 2022-10-24T16:49:38Z

@flyrain for the test part, I followed your suggestion and added a test in TestSparkReaderDeletes instead (removing the earlier one).

wypoon · 2022-10-24T16:52:12Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

+        .validateDataFilesExist(posDeletes.second())
+        .commit();
+
+    Assert.assertEquals(193, rowSet(tblName, tbl, "*").size());


Without the fix, this assertion fails for the vectorized case.
There are 3 deletes applied to the first row group and 4 deletes applied to the second row group. Without the fix, the 3 deletes for the first row group are applied to the second as well (instead of the 4 that should be applied). Thus 6 rows are deleted (instead of 7) and the result is 194 rows, instead of the expected 193.

flyrain

+1 Thanks @wypoon for the fix.

flyrain · 2022-10-24T23:14:20Z

Merged. Thanks @wypoon. Thanks @chenjunjiedada for the review.

flyrain · 2022-10-25T20:01:42Z

@flyrain, Correct, that is it.

Hi @chenjunjiedada , filed PR #6056 based on the discussion. Please take a look. Thanks.

…rrectly (apache#6026)

…ader is set correctly apache#6026

Fix for issue apache#5927.

89ca684

github-actions bot added build spark labels Oct 21, 2022

chenjunjiedada approved these changes Oct 21, 2022

View reviewed changes

flyrain reviewed Oct 21, 2022

View reviewed changes

spark/v3.3/build.gradle Outdated Show resolved Hide resolved

...ark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestParquetMergeOnRead.java Outdated Show resolved Hide resolved

Instead of adding a test in spark-extensions, add on in TestSParkRead…

e41d75d

…erDeletes.

wypoon commented Oct 24, 2022

View reviewed changes

flyrain approved these changes Oct 24, 2022

View reviewed changes

wypoon added a commit to wypoon/iceberg that referenced this pull request Oct 24, 2022

Port apache#6026 to spark/v3.2

059c7b8

wypoon mentioned this pull request Oct 24, 2022

Spark 3.2: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly #6041

Merged

flyrain changed the title ~~Spark: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly~~ Spark3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly Oct 24, 2022

flyrain changed the title ~~Spark3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly~~ Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly Oct 24, 2022

flyrain merged commit c8a25d4 into apache:master Oct 24, 2022

wypoon mentioned this pull request Oct 25, 2022

Spark 3.1: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly #6046

Merged

rdblue added this to the Iceberg 1.1.0 Release milestone Oct 25, 2022

This was referenced Oct 26, 2022

Spark 3.2: #6041 follow-up/cleanup #6063

Merged

Vectorized reading of parquet in an updated table with 'merge-on-read' returns wrong results #5927

Closed

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023

Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set co…

1a1ca1b

…rrectly (apache#6026)

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

[CHERRY-PICK] Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchRe…

bae35ff

…ader is set correctly apache#6026

Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly #6026

Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly #6026

Uh oh!

Conversation

wypoon commented Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon commented Oct 21, 2022

Uh oh!

chenjunjiedada left a comment

Choose a reason for hiding this comment

Uh oh!

wypoon commented Oct 21, 2022

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wypoon commented Oct 22, 2022

Uh oh!

chenjunjiedada commented Oct 22, 2022

Uh oh!

flyrain commented Oct 22, 2022

Uh oh!

chenjunjiedada commented Oct 23, 2022

Uh oh!

wypoon commented Oct 24, 2022

Uh oh!

wypoon Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

flyrain commented Oct 24, 2022

Uh oh!

flyrain commented Oct 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wypoon commented Oct 21, 2022 •

edited

Loading

wypoon Oct 24, 2022 •

edited

Loading