Skip to content

Conversation

ayushbansal07
Copy link
Contributor

@ayushbansal07 ayushbansal07 commented Oct 8, 2025

Rationale for this change

Support in pyarrow to read parquet files with page header statistics which contain values larger than 8MiB.

What changes are included in this PR?

  • Add max_page_header_size in ReaderOptions ReaderProperties.
  • Remove max_page_header_size_ from SerializedPageReader and started reading it from properties_. Had to make properties_ inside SerializedPageReader non-const for set_max_page_header_size() api.

Are these changes tested?

Yes, tried with a sample python code.

Are there any user-facing changes?

Yes

Copy link

github-actions bot commented Oct 8, 2025

⚠️ GitHub issue #46404 has been automatically assigned in GitHub to PR creator.

// data_page_filter_. Performs basic checks on values in the page header.
// Fills in data_page_statistics.
bool ShouldSkipPage(EncodedStatistics* data_page_statistics);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest keeping it as const. We can change the function set_max_page_header_size to do nothing after this PR and mark it as deprecated.

arrow::TimeUnit::type coerce_int96_timestamp_unit = arrow::TimeUnit::NANO;
Type::type binary_type = Type::BINARY;
Type::type list_type = Type::LIST;
uint32_t max_page_header_size = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0 but not a meaningful default?

};

// 16 MB is the default maximum page header size
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's make this int32_t not uint32_t.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 9, 2025
/// Set the size of the buffered stream buffer in bytes.
void set_buffer_size(int64_t size) { buffer_size_ = size; }

/// Return the size of the buffered stream buffer. 0 means default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, can you make sure the docstrings actually describe the methods inside of blindly copy-pasting them?

Comment on lines +108 to +110
uint32_t max_page_header_size() const { return max_page_header_size_; }
/// Set the size of the buffered stream buffer in bytes. 0 means default
void set_max_page_header_size(uint32_t size) { max_page_header_size_ = size; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these to take int32_t? We try to avoid unsigned integers in public API.

};

// 16 MB is the default maximum page header size
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's make this int32_t not uint32_t.

the canonical `arrow.uuid` extension type).
max_page_header_size : int, default None
If not None, override the maximum size of a page header.
Deafults to 16MB, which should be sufficient for most Parquet files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Deafults to 16MB, which should be sufficient for most Parquet files.
Defaults to 16MB, which should be sufficient for most Parquet files.

thrift_container_size_limit=None, filesystem=None,
page_checksum_verification=False, arrow_extensions_enabled=True):
page_checksum_verification=False, arrow_extensions_enabled=True,
max_page_header_size=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, but is the new argument also available on pq.read_table and friends?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants