Skip to content

Conversation

@raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Sep 14, 2023

Description

Currently the logic in org.apache.parquet.format.converter.ParquetMetadataConverter#toParquetStatistics skips writing min/max row-group statistics if they are longer than 4Kb. This is changed to write stats with truncation to allow readers to perform row-group pruning based on query predicates.

Additional context and related issues

Related issue #19052
Parquet spec updated at apache/parquet-format#216 to clarify that truncation of min/max row-group stats is allowed

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Delta, Iceberg
* Improve performance of filtering on columns with long strings stored in parquet files. ({issue}`19038`)

@findinpath
Copy link
Contributor

Build is 🔴

https://github.com/trinodb/trino/actions/runs/6189779177/job/16805119578?pr=19038

 src/main/java/io/trino/parquet/writer/ParquetMetadataUtils.java:[73,115] (blocks) LeftCurly: '{' at column 115 should be on a new line.

@findepi
Copy link
Member

findepi commented Sep 15, 2023

Let's make sure @ebyhr reviews this

@findepi
Copy link
Member

findepi commented Sep 20, 2023

Added a bunch of comments, the most important being

so you can start reading from these two

@raunaqmorarka raunaqmorarka force-pushed the pqw-truncate branch 4 times, most recently from a0895a2 to c2dcfba Compare September 26, 2023 08:26
The current logic skips writing min/max statistics if they are longer than 4Kb
This is changed to write stats with truncation to allow readers to perform filtering
@raunaqmorarka raunaqmorarka merged commit b9bc338 into trinodb:master Oct 28, 2023
@raunaqmorarka raunaqmorarka deleted the pqw-truncate branch October 28, 2023 03:06
@github-actions github-actions bot added this to the 432 milestone Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

8 participants