Skip to content

Fix parquet predicate pushdown for INT96 timestamp values#5083

Merged
martint merged 1 commit intotrinodb:masterfrom
pettyjamesm:parquet-int96-timestamp-fix
Sep 4, 2020
Merged

Fix parquet predicate pushdown for INT96 timestamp values#5083
martint merged 1 commit intotrinodb:masterfrom
pettyjamesm:parquet-int96-timestamp-fix

Conversation

@pettyjamesm
Copy link
Copy Markdown
Member

Added in prestosql#4104, predicate pushdown for parquet INT96 timestamp values can result in incorrect results even when stats appear valid by checking min <= max. Parquet writers that produced statistics at all were comparing min and max values as BINARY which is oblivious to the semantics of how INT96 timestamps are encoded making them unusable.

Comparison of INT96 values for statistics was removed in PARQUET-1065 for all cases except when min == max, which would not be affected by the comparison issue or other byte order issues that existed with parquet BINARY types at the time. Any parquet file that contains INT96 statistics where min != max would have to have been written by an older parquet writer that compared the values incorrectly, making those statistics unusable.

This change disables parquet predicate pushdown on INT96 timestamps except for when all rows have the same value (min == max).

@cla-bot cla-bot bot added the cla-signed label Sep 4, 2020
@pettyjamesm pettyjamesm requested a review from martint September 4, 2020 16:48
Added in prestosql#4104, predicate pushdown for parquet INT96 timestamp
values can result in incorrect results even when stats appear valid by
checking min <= max. Parquet writers that produced statistics at all
were comparing min and max values as BINARY which is oblivious to the
semantics of how INT96 timestamps are encoded making them unusable.

Comparison of INT96 values for statistics was removed in PARQUET-1065
for all cases except when min == max, which would not be affected by
the comparison issue or other byte order issues that existed with
parquet BINARY types at the time. Any parquet file that contains INT96
statistics where min != max would have to have been written by an older
parquet writer that compared the values incorrectly, making those
statistics unusable.

This change disables parquet predicate pushdown on INT96 timestamps
except for when all rows have the same value (ie: min == max).
@pettyjamesm pettyjamesm force-pushed the parquet-int96-timestamp-fix branch from 32a0afe to 18aedf5 Compare September 4, 2020 16:54
@martint martint merged commit 1ffa839 into trinodb:master Sep 4, 2020
@martint martint added this to the 341 milestone Sep 4, 2020
@pettyjamesm pettyjamesm deleted the parquet-int96-timestamp-fix branch September 4, 2020 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants