-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Respect Parquet Logical type when reading statistics #26203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect Parquet Logical type when reading statistics #26203
Conversation
Adds a new test on evaluating ShortDecimal with Int64 and pad zeros for binary statistics.
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java
Outdated
Show resolved
Hide resolved
d4b7d8a to
148f3bf
Compare
|
Thanks for the review. AC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR ensures that Parquet logical decimal types are respected when reading statistics, avoiding incorrect skipping of row groups when Trino and Parquet scales or precisions differ.
- Adds extensive tests covering short and long decimals with various annotations and underlying Parquet types.
- Extends
TupleDomainParquetPredicatewith helper methods (getShortDecimal,getLongDecimal) to correctly rescale statistics values according to logical annotations. - Updates visibility of
isDecimalRescaledand refactors statistics padding in tests.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| lib/trino-parquet/src/test/java/io/trino/parquet/TestTupleDomainParquetPredicate.java | Adds new test cases for short/long decimal annotations and extends binaryColumnStats with padding |
| lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java | Changes isDecimalRescaled from private to public to allow reuse |
| lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java | Introduces getShortDecimal/getLongDecimal helpers and integrates logical type checks into statistics reading |
Comments suppressed due to low confidence (2)
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java:333
- Consider reverting the visibility of isDecimalRescaled back to package-private (or moving it to a dedicated utility) to avoid unnecessarily expanding the public API surface.
public static boolean isDecimalRescaled(DecimalLogicalTypeAnnotation decimalAnnotation, DecimalType trinoType)
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java:496
- [nitpick] Rename the parameter
columnTypeto something liketrinoTypeorexpectedDecimalTypeto clarify that it represents the target Trino decimal type, not the column descriptor’s primitive type.
private static long getShortDecimal(Object value, DecimalType columnType, ColumnDescriptor column)
Description
When reading statistics from FileMetadata we respect only the Trino type i.e when we map a parquet file having long decimal to a short decimal type in Trino we tend to read the statistics as short decimal type and we end up on skipping row groups when reading the data.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: