-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Avoid reading unusually large parquet footers #25973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a safeguard against excessively large Parquet footers by introducing a configurable max-footer-read-size limit.
- Introduce
maxFooterReadSizeinParquetReaderOptionsand expose it viaParquetReaderConfig. - Implement a guard in
MetadataReaderto throw aParquetCorruptionExceptionwhen the footer exceeds the configured size. - Propagate the new option to all Parquet reader entry points (Iceberg, Hudi, Hive, Delta Lake) and update tests and documentation accordingly.
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java | Add setter/getter for parquet.max-footer-read-size and remove legacy mapping |
| lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java | Add overloads and guard in readFooter to enforce the footer size limit |
| lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java | Extend options and builder with maxFooterReadSize, defaulting to 15 MB |
| plugin/**/IcebergPageSourceProvider.java, HudiPageSourceProvider.java, DeltaLakePageSourceProvider.java, ParquetPageSourceFactory.java | Pass options.getMaxFooterReadSize() into MetadataReader.readFooter |
| docs/src/main/sphinx/object-storage/file-formats.md | Document the new parquet.max-footer-read-size session property |
Comments suppressed due to low confidence (3)
lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java:96
- There’s no test covering the new exception path when a footer exceeds the configured limit. Consider adding a unit test that simulates a large footer and asserts that
ParquetCorruptionExceptionis thrown.
if (maxFooterReadSize.isPresent() && completeFooterSize > maxFooterReadSize.get().toBytes()) {
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java:157
- [nitpick] The builder field
maxFooterSizeis inconsistent with the rest of the API which usesmaxFooterReadSize. Rename it tomaxFooterReadSizefor clarity and consistency.
private DataSize maxFooterSize;
docs/src/main/sphinx/object-storage/file-formats.md:104
- [nitpick] The list item formatting in the Sphinx document is incorrect. Change
* -to-(or align with the surrounding list style) so the new property renders properly.
* - `parquet.max-footer-read-size`
f1bd9b6 to
16971bf
Compare
Praveen2112
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Two minor questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need something for ORC footer as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this situation in ORC yet, even in parquet you need really bad configuration to reach this situation
lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java
Outdated
Show resolved
Hide resolved
Reading unusually large parquet footers can lead to workers going into full GC and crashing when decoding the footer in org.apache.parquet.format.Util#readFileMetaData This is usually caused by misconfigured parquet writers producing too many row groups per file. This change adds a guard rail to fail reads of such files gracefully.
16971bf to
29a4ef3
Compare
Description
Reading unusually large parquet footers can lead to workers
going into full GC and crashing when decoding the footer in
org.apache.parquet.format.Util#readFileMetaData
This is usually caused by misconfigured parquet writers producing
too many row groups per file. This change adds a guard rail to
fail reads of such files gracefully.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: