Skip to content

Conversation

@raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jun 10, 2025

Description

Reading unusually large parquet footers can lead to workers
going into full GC and crashing when decoding the footer in
org.apache.parquet.format.Util#readFileMetaData
This is usually caused by misconfigured parquet writers producing
too many row groups per file. This change adds a guard rail to
fail reads of such files gracefully.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive, Delta Lake, Iceberg, Hudi
* Prevent workers from going into full GC or crashing when decoding unusually large parquet footers. ({issue}`25973`)

@cla-bot cla-bot bot added the cla-signed label Jun 10, 2025
@github-actions github-actions bot added docs hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector redshift Redshift connector labels Jun 10, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a safeguard against excessively large Parquet footers by introducing a configurable max-footer-read-size limit.

  • Introduce maxFooterReadSize in ParquetReaderOptions and expose it via ParquetReaderConfig.
  • Implement a guard in MetadataReader to throw a ParquetCorruptionException when the footer exceeds the configured size.
  • Propagate the new option to all Parquet reader entry points (Iceberg, Hudi, Hive, Delta Lake) and update tests and documentation accordingly.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.

Show a summary per file
File Description
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java Add setter/getter for parquet.max-footer-read-size and remove legacy mapping
lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java Add overloads and guard in readFooter to enforce the footer size limit
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java Extend options and builder with maxFooterReadSize, defaulting to 15 MB
plugin/**/IcebergPageSourceProvider.java, HudiPageSourceProvider.java, DeltaLakePageSourceProvider.java, ParquetPageSourceFactory.java Pass options.getMaxFooterReadSize() into MetadataReader.readFooter
docs/src/main/sphinx/object-storage/file-formats.md Document the new parquet.max-footer-read-size session property
Comments suppressed due to low confidence (3)

lib/trino-parquet/src/main/java/io/trino/parquet/reader/MetadataReader.java:96

  • There’s no test covering the new exception path when a footer exceeds the configured limit. Consider adding a unit test that simulates a large footer and asserts that ParquetCorruptionException is thrown.
if (maxFooterReadSize.isPresent() && completeFooterSize > maxFooterReadSize.get().toBytes()) {

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java:157

  • [nitpick] The builder field maxFooterSize is inconsistent with the rest of the API which uses maxFooterReadSize. Rename it to maxFooterReadSize for clarity and consistency.
private DataSize maxFooterSize;

docs/src/main/sphinx/object-storage/file-formats.md:104

  • [nitpick] The list item formatting in the Sphinx document is incorrect. Change * - to - (or align with the surrounding list style) so the new property renders properly.
* - `parquet.max-footer-read-size`

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Two minor questions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need something for ORC footer as well ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this situation in ORC yet, even in parquet you need really bad configuration to reach this situation

Reading unusually large parquet footers can lead to workers
going into full GC and crashing when decoding the footer in
org.apache.parquet.format.Util#readFileMetaData
This is usually caused by misconfigured parquet writers producing
too many row groups per file. This change adds a guard rail to
fail reads of such files gracefully.
@raunaqmorarka raunaqmorarka merged commit 33f5659 into master Jun 11, 2025
70 of 73 checks passed
@raunaqmorarka raunaqmorarka deleted the raunaq/parq-footer branch June 11, 2025 07:56
@github-actions github-actions bot added this to the 477 milestone Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector redshift Redshift connector

Development

Successfully merging this pull request may close these issues.

3 participants