Skip to content

Conversation

@raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jun 25, 2025

Description

Certain parquet writers like pyarrow can set this offset incorrectly.
That can lead to the parquet reader not idenitfying the row groups
within a split correctly. This code is changed to rely on first page
offset in ColumnChunk instead.

Additional context and related issues

Fixes #26058
Related to #24618

Newer versions of pyarrow (tested parquet-cpp-arrow version 20.0.0) don't have this problem, the file showing this problem is using parquet-cpp-arrow version 5.0.0

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg, Hive, Hudi, Delta Lake
* Fix incorrect results when reading from parquet files produced by old versions of pyarrow. ({issue}`26058`)

Certain parquet writers like pyarrow can set this offset incorrectly.
That can lead to the parquet reader not idenitfying the row groups
within a split correctly. This code is changed to rely on first page
offset in ColumnChunk instead.
This also removes logic in PredicateUtils#getFilteredRowGroups that is
now redudant
@raunaqmorarka raunaqmorarka merged commit 7b6977b into master Jun 25, 2025
71 checks passed
@raunaqmorarka raunaqmorarka deleted the raunaq/pqr-offset-fix branch June 25, 2025 08:28
@github-actions github-actions bot added this to the 477 milestone Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Incorrect results on parquet files written by parquet-cpp-arrow version 5.0.0

4 participants