Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) #10399

wypoon · 2024-05-30T02:38:46Z

This resolves #193.

This is the 3rd cumulative PR that builds on top of #10228 and #10107.
#10228 implements page skipping for the non-vectorized read path and this PR picks up from there and implements it for the vectorized read path.

The implementation of the vectorized case is based on the implementation in Spark's Parquet reader (see apache/spark#32753), which is in Spark 3.2.
As already implemented in #10228, we convert an Iceberg filter to a Parquet filter and use the Parquet filter to filter row groups, calling ParquetFileReader#readFilteredRowGroup. We get row indexes from the returned PageReadStore by calling PageReadStore#getRowIndexes. Following the Spark implementation, we construct row ranges (using a simple RowRange (not the internal Parquet RowRange) class capturing start and end indexes) from the row indexes (a PrimitiveIterator.OfLong).
In order to synchronize the reading of the columns (since different columns can have different number of pages and thus pages in different columns start at different row indexes), we get the index of the first row in a page by calling DataPage#getFirstRowIndex and we need to track the current row index in the column and the current offset in the Arrow vector we are reading the column values into. Previously, the read state only needed to account for the number of rows read so far and the number of rows left to read, but now "reading" includes both actual reading into the Arrow vector and skipping rows. We therefore use a ReadState object (following Spark's ParquetReadState) to track the offset and row index as well as the current row range to read, and to advance them.
There is a lot of structurally similar code in several places in VectorizedParquetDefinitionLevelReader. I have refactored the similar code into one place in VectorizedParquetDefinitionLevelReader. Then I can modify the logic in this one place to perform the synchronization (skipping rows to the start of the row range). The rest of the changes implement how to skip rows for different column types.

- When converting an Iceberg filter to a Parquet filter, and then using the converted filter in Parquet to filter row groups, Parquet checks that the type of a column in a filter predicate matches the type in the Parquet file as a validation step before applying the filter. This validation fails for some cases, e.g., INT96 timestamp columns. When doing the conversion, we thus need to check that such a type mismatch does not occur and fail the conversion if it does. - When converting the Iceberg filter to a Parquet filter fails, we need to handle the failure, and in ReadConf, we need to use the internally computed total number of rows instead of the values returned by the ParquetFileReader's getFilteredRecordCount(). - In ParquetReader.FileIterator, since we have to handle both cases where a Parquet record filter is used and where it is not, we avoid the skipNextRowGroup() and readNextFilteredRowGroup() methods of ParquetFileReader and instead proceed row group by row group and call readFilteredRowGroup(int) with the index of the row group.

…tionLevelReader.

Now that we track the offset in the Arrow vector using the ReadState, we don't need to pass the number of values in the vector, which is not used as it is no longer the actual offset as it does not account for skipped rows.

wypoon · 2024-05-30T20:00:51Z

@rdblue @aokolnychyi @flyrain can you please review?
I plan to do some performance testing later and will update with findings.
@sunchao you may be interested in this too. The changes here are based on your work in Spark.

wypoon · 2024-05-30T20:29:54Z

For reference, the files that changed between #10228 and this PR are:

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedColumnIterator.java
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPageIterator.java
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java
parquet/src/main/java/org/apache/iceberg/parquet/BaseColumnIterator.java
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
parquet/src/main/java/org/apache/iceberg/parquet/ValuesAsBytesReader.java
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java

In addition,
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetPageSkipping.java
was updated to account for the vectorized case now performing page skipping as well.

I posted the diff at https://gist.github.com/wypoon/855ac077c518e729451b9a4cb06fa818.

github-actions · 2024-11-04T00:16:10Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-11-12T00:14:38Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

flyrain and others added 13 commits April 5, 2024 18:49

Parquet: Remove the row position since parquet row group has it natively

7e60a0e

Change ColumnarBatchReader for Spark 3.4 and 3.5

ecf7954

Retain old methods but deprecate them, to avoid breaking the API.

a640047

Fix inadvertent renaming of a parameter.

f884449

Fix inadvertent dropping of a final method modifier.

4271166

[WIP] Parquet page skipping for non-vectorized read path

181086b

Tweak text.

ed662bb

Add ReadState.

ef285fa

Refactor common code out of nested classes in VectorizedParquetDefini…

8a60601

…tionLevelReader.

Logic to skip rows.

b967d77

Refactoring.

ae04bd0

Now that we track the offset in the Arrow vector using the ReadState, we don't need to pass the number of values in the vector, which is not used as it is no longer the actual offset as it does not account for skipped rows.

Enable filter for vectorized read.

0adcc3f

github-actions bot added spark parquet arrow labels May 30, 2024

wypoon changed the title ~~[draft] Parquet page skipping (vectorized and non-vectorized)~~ Parquet: page skipping using filtered row groups (vectorized and non-vectorized) May 30, 2024

wypoon changed the title ~~Parquet: page skipping using filtered row groups (vectorized and non-vectorized)~~ Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) May 30, 2024

github-actions bot added the stale label Nov 4, 2024

github-actions bot closed this Nov 12, 2024

jshmchenxi mentioned this pull request Sep 10, 2025

Parquet: Support In predicate pushdown for ParquetFilters #14041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) #10399

Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) #10399

Uh oh!

wypoon commented May 30, 2024 •

edited

Loading

Uh oh!

wypoon commented May 30, 2024

Uh oh!

wypoon commented May 30, 2024

Uh oh!

github-actions bot commented Nov 4, 2024

Uh oh!

github-actions bot commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) #10399

Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) #10399

Uh oh!

Conversation

wypoon commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon commented May 30, 2024

Uh oh!

wypoon commented May 30, 2024

Uh oh!

github-actions bot commented Nov 4, 2024

Uh oh!

github-actions bot commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wypoon commented May 30, 2024 •

edited

Loading