Parquet: page skipping using filtered row groups for non-vectorized read #10228

wypoon · 2024-04-26T00:35:34Z

This builds on #10107.
It borrows and adapts code and the test TestSparkParquetPageSkipping from @zhongyujiang's #6967.
The difference in approach here is that we do not make use of any Parquet internal API. We simply convert the Iceberg filter to a Parquet filter and use ParquetFileReader#readFilteredRowGroup(int) and PageReadStore#getRowIndexes().
We borrow and adapt the code from #6967 for synchronizing the column readers (as each column might have different number of pages and so the columns might be at different row indexes when a filtered row group is read).
There are some limitations:
In this PR, we only implement the page skipping for the non-vectorized read path. We plan to work on the vectorized read path separately. In TestSparkParquetPageSkipping, we test both vectorized and non-vectorized reads and there one can see the difference in the rows that are read (as page skipping is not implemented for the vectorized path).
Due to the fact that before it performs the filtering, Parquet validates that the column type in predicates in the filter match the type of the column in the Parquet file, we have to skip using Parquet filtering in some cases, e.g., when a column is an INT96 timestamp.
Currently, ParquetFilters.ConvertFilterToParquet handles only a small set of operators, so e.g., a filter with IN does not get converted. This can be improved independently.

- When converting an Iceberg filter to a Parquet filter, and then using the converted filter in Parquet to filter row groups, Parquet checks that the type of a column in a filter predicate matches the type in the Parquet file as a validation step before applying the filter. This validation fails for some cases, e.g., INT96 timestamp columns. When doing the conversion, we thus need to check that such a type mismatch does not occur and fail the conversion if it does. - When converting the Iceberg filter to a Parquet filter fails, we need to handle the failure, and in ReadConf, we need to use the internally computed total number of rows instead of the values returned by the ParquetFileReader's getFilteredRecordCount(). - In ParquetReader.FileIterator, since we have to handle both cases where a Parquet record filter is used and where it is not, we avoid the skipNextRowGroup() and readNextFilteredRowGroup() methods of ParquetFileReader and instead proceed row group by row group and call readFilteredRowGroup(int) with the index of the row group.

wypoon · 2024-05-02T00:00:28Z

@zhongyujiang I would be happy to make you a co-author, but it was not easy to pull in commits from your PR directly. If you like, you can open a PR against my branch (even a dummy commit) and I can merge it and have you show up as a co-author.
@rdblue @aokolnychyi @flyrain please take a look.

wypoon · 2024-05-02T19:11:59Z

@sunchao @chenjunjiedada you may be interested in this.

github-actions · 2024-11-01T00:17:04Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-11-09T00:15:00Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

flyrain and others added 6 commits April 5, 2024 18:49

Parquet: Remove the row position since parquet row group has it natively

7e60a0e

Change ColumnarBatchReader for Spark 3.4 and 3.5

ecf7954

Retain old methods but deprecate them, to avoid breaking the API.

a640047

Fix inadvertent renaming of a parameter.

f884449

Fix inadvertent dropping of a final method modifier.

4271166

[WIP] Parquet page skipping for non-vectorized read path

181086b

github-actions bot added spark parquet arrow labels Apr 26, 2024

wypoon force-pushed the parquet_filtered_row_groups branch from 147ac9e to fa0bd05 Compare May 1, 2024 22:36

wypoon changed the title ~~[draft] Parquet page skipping (using filtered row groups)~~ Parquet: page skipping using filtered row groups for non-vectorized read May 1, 2024

Tweak text.

ed662bb

wypoon mentioned this pull request May 30, 2024

Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) #10399

Closed

github-actions bot added the stale label Nov 1, 2024

github-actions bot closed this Nov 9, 2024

This was referenced Sep 5, 2025

Parquet: page skipping using filtered row groups for non-vectorized read jshmchenxi/iceberg#1

Open

Parquet: Support In predicate pushdown for ParquetFilters #14041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet: page skipping using filtered row groups for non-vectorized read #10228

Parquet: page skipping using filtered row groups for non-vectorized read #10228

Uh oh!

wypoon commented Apr 26, 2024 •

edited

Loading

Uh oh!

wypoon commented May 2, 2024

Uh oh!

wypoon commented May 2, 2024

Uh oh!

github-actions bot commented Nov 1, 2024

Uh oh!

github-actions bot commented Nov 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Parquet: page skipping using filtered row groups for non-vectorized read #10228

Parquet: page skipping using filtered row groups for non-vectorized read #10228

Uh oh!

Conversation

wypoon commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon commented May 2, 2024

Uh oh!

wypoon commented May 2, 2024

Uh oh!

github-actions bot commented Nov 1, 2024

Uh oh!

github-actions bot commented Nov 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wypoon commented Apr 26, 2024 •

edited

Loading