-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 3.3: Ensure rowStartPosInBatch in ColumnarBatchReader is set correctly #6026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@flyrain I believe you implemented the support for row-level deletes in the vectorized reader. Can you please review this? Also @aokolnychyi @RussellSpitzer @chenjunjiedada. |
chenjunjiedada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Could you please help to backport to spark 3.1 and 3.2 as well?
|
Thanks for reviewing, @chenjunjiedada. I do plan to port this to Spark 3.2 once this is approved by a committer. Let me look into 3.1 as well. I also think this bug needs to be fixed in the 1.0.x branch. |
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the solution. Thanks @wypoon for the fix!
Not a blocker. We may do it in another PR, which I think we need more coverage for multiple parquet row groups in terms of deletes read. We can add an option here to generate multiple-row-group parquet files, and reuse the class TestSparkReaderDeletes.
I'm a bit confused of this behavior: ReadConf.startRowPositions is valid only if _pos column exists in the expectedSchema due to #1716. Are there use cases that _pos is absent and we still need ReadConf.startRowPositions? By looking at the class VectorizedParquetReader and ParquetReader who are consuming ReadConf.startRowPositions, it seems likely the schema doesn't have _pos.
cc @chenjunjiedada @aokolnychyi
...ark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestParquetMergeOnRead.java
Outdated
Show resolved
Hide resolved
I too was surprised by the behavior. In my example, before my fix, when the query is run after the update, the so it did not have @flyrain are you asking if there are other cases where |
The row group start positions are always computed but are only correct when it is projected right now. That's intended because we don't want to read the parquet footer one more time. But since the footer must be read at least once, we should be able to cache some content during the first access to avoid the current optimization logic and thus simply the logic to check |
Yes, I have concerns for this case. But I guess it is fine since iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Line 396 in 5688d59
By digging a bit more,
|
|
@flyrain, Correct, that is it. |
|
@flyrain for the test part, I followed your suggestion and added a test in |
| .validateDataFilesExist(posDeletes.second()) | ||
| .commit(); | ||
|
|
||
| Assert.assertEquals(193, rowSet(tblName, tbl, "*").size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix, this assertion fails for the vectorized case.
There are 3 deletes applied to the first row group and 4 deletes applied to the second row group. Without the fix, the 3 deletes for the first row group are applied to the second as well (instead of the 4 that should be applied). Thus 6 rows are deleted (instead of 7) and the result is 194 rows, instead of the expected 193.
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Thanks @wypoon for the fix.
|
Merged. Thanks @wypoon. Thanks @chenjunjiedada for the review. |
Hi @chenjunjiedada , filed PR #6056 based on the discussion. Please take a look. Thanks. |
…ader is set correctly apache#6026
There is a bug in Parquet vectorized reads reported in #5927.
This bug happens when reading a Parquet data file (using the
BatchDataReader) that is bigger than the split size, and there are deletes that need to be applied to the data file. The cause of the bug is thatColumnarBatchReader#setRowGroupInfois not called with the correctrowPosition, and that is because inReadConf,generateOffsetToStartPos(Schema)returns null due to an optimization. (When this happens, thestartRowPositionsarray is always populated with 0s, and thusColumnarBatchReader#setRowGroupInfogets called withrowPosition0 even when therowPositionis that of the second or subsequent row group. InColumnarBatchReader,setRowGroupInfoinitializes arowStartPosInBatchfield, which is used to determine where in thePositionDeleteIndexto start applying deletes from. WhenrowStartPosInBatchis incorrectly initialized, the indexes of positional deletes are not correctly aligned with the rows in the data file.)The fix is to ensure that when there are deletes, the
Schemahas the_posmetadata column in it. ThenReadConf#generateOffsetToStartPos(Schema)will generate the necessaryMapthat is used to compute thestartRowPositions.Added a unit test that reproduces the problem without this fix. Without this fix, the test passes for non-vectorized read and fails for vectorized read. With this fix, the test passes for both cases.