-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-46404: [C++][Python] add support for max_page_header_size for pyarrow #47758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-46404: [C++][Python] add support for max_page_header_size for pyarrow #47758
Conversation
|
// data_page_filter_. Performs basic checks on values in the page header. | ||
// Fills in data_page_statistics. | ||
bool ShouldSkipPage(EncodedStatistics* data_page_statistics); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest keeping it as const
. We can change the function set_max_page_header_size
to do nothing after this PR and mark it as deprecated.
arrow::TimeUnit::type coerce_int96_timestamp_unit = arrow::TimeUnit::NANO; | ||
Type::type binary_type = Type::BINARY; | ||
Type::type list_type = Type::LIST; | ||
uint32_t max_page_header_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 0 but not a meaningful default?
}; | ||
|
||
// 16 MB is the default maximum page header size | ||
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; | |
constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also let's make this int32_t
not uint32_t
.
/// Set the size of the buffered stream buffer in bytes. | ||
void set_buffer_size(int64_t size) { buffer_size_ = size; } | ||
|
||
/// Return the size of the buffered stream buffer. 0 means default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, can you make sure the docstrings actually describe the methods inside of blindly copy-pasting them?
uint32_t max_page_header_size() const { return max_page_header_size_; } | ||
/// Set the size of the buffered stream buffer in bytes. 0 means default | ||
void set_max_page_header_size(uint32_t size) { max_page_header_size_ = size; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change these to take int32_t
? We try to avoid unsigned integers in public API.
}; | ||
|
||
// 16 MB is the default maximum page header size | ||
static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also let's make this int32_t
not uint32_t
.
the canonical `arrow.uuid` extension type). | ||
max_page_header_size : int, default None | ||
If not None, override the maximum size of a page header. | ||
Deafults to 16MB, which should be sufficient for most Parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deafults to 16MB, which should be sufficient for most Parquet files. | |
Defaults to 16MB, which should be sufficient for most Parquet files. |
thrift_container_size_limit=None, filesystem=None, | ||
page_checksum_verification=False, arrow_extensions_enabled=True): | ||
page_checksum_verification=False, arrow_extensions_enabled=True, | ||
max_page_header_size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok, but is the new argument also available on pq.read_table
and friends?
Rationale for this change
Support in pyarrow to read parquet files with page header statistics which contain values larger than 8MiB.
What changes are included in this PR?
Are these changes tested?
Yes, tried with a sample python code.
Are there any user-facing changes?
Yes
pyarrow
#46404