Skip to content

Conversation

@Praveen2112
Copy link
Member

@Praveen2112 Praveen2112 commented Jul 15, 2025

Description

When reading statistics from FileMetadata we respect only the Trino type i.e when we map a parquet file having long decimal to a short decimal type in Trino we tend to read the statistics as short decimal type and we end up on skipping row groups when reading the data.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Hive, Iceberg, DeltaLake, Hudi
*  Fix skipping of row groups when the trino type is different from logical types in case of parquet files.({issue}`26203`)

Adds a new test on evaluating ShortDecimal with Int64 and
pad zeros for binary statistics.
@raunaqmorarka raunaqmorarka added bug Something isn't working correctness labels Jul 15, 2025
@Praveen2112 Praveen2112 force-pushed the praveen/parquet_predicate_fix branch from d4b7d8a to 148f3bf Compare July 15, 2025 13:00
@Praveen2112
Copy link
Member Author

Thanks for the review. AC

@Praveen2112 Praveen2112 requested a review from lukasz-stec July 16, 2025 11:33
@raunaqmorarka raunaqmorarka requested a review from Copilot July 16, 2025 21:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR ensures that Parquet logical decimal types are respected when reading statistics, avoiding incorrect skipping of row groups when Trino and Parquet scales or precisions differ.

  • Adds extensive tests covering short and long decimals with various annotations and underlying Parquet types.
  • Extends TupleDomainParquetPredicate with helper methods (getShortDecimal, getLongDecimal) to correctly rescale statistics values according to logical annotations.
  • Updates visibility of isDecimalRescaled and refactors statistics padding in tests.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
lib/trino-parquet/src/test/java/io/trino/parquet/TestTupleDomainParquetPredicate.java Adds new test cases for short/long decimal annotations and extends binaryColumnStats with padding
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java Changes isDecimalRescaled from private to public to allow reuse
lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java Introduces getShortDecimal/getLongDecimal helpers and integrates logical type checks into statistics reading
Comments suppressed due to low confidence (2)

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ColumnReaderFactory.java:333

  • Consider reverting the visibility of isDecimalRescaled back to package-private (or moving it to a dedicated utility) to avoid unnecessarily expanding the public API surface.
    public static boolean isDecimalRescaled(DecimalLogicalTypeAnnotation decimalAnnotation, DecimalType trinoType)

lib/trino-parquet/src/main/java/io/trino/parquet/predicate/TupleDomainParquetPredicate.java:496

  • [nitpick] Rename the parameter columnType to something like trinoType or expectedDecimalType to clarify that it represents the target Trino decimal type, not the column descriptor’s primitive type.
    private static long getShortDecimal(Object value, DecimalType columnType, ColumnDescriptor column)

@Praveen2112 Praveen2112 merged commit af38a3c into trinodb:master Jul 18, 2025
63 checks passed
@github-actions github-actions bot added this to the 477 milestone Jul 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cla-signed correctness

Development

Successfully merging this pull request may close these issues.

2 participants