Add pushdown for parquet timestamp predicate#4104
Conversation
presto-hive/src/test/java/io/prestosql/plugin/hive/TestHiveIntegrationSmokeTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
That's not too long, but would be slightly better to extract QueryInfo fullQueryInfo var
presto-parquet/src/main/java/io/prestosql/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/main/java/io/prestosql/parquet/predicate/ParquetTimestampStatistics.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/io/prestosql/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/io/prestosql/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/io/prestosql/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/io/prestosql/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/io/prestosql/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
|
Updated, thanks |
1e45799 to
fba7bcc
Compare
fba7bcc to
80ee8a5
Compare
|
Merged, thanks! |
|
Does this cover For reference there was an older PR #1999 that was adding pushdown for timestamps on |
|
Hey Ryan, thanks for pointing that out. I think you're right, this only works for the legacy int96 timestamps. I didn't realize it was deprecated. Do you know if Presto's Parquet read/writer supports the new format or should we add an issue to add that in? There's probably some more changed needed to work in the parametric timestamp types that Dain has been working on too. |
|
@alexjo2144 I'm not sure what the writer actually writes out currently (my write path isn't through Presto currently). For the read path I'm pretty sure it can read int64 TIMESTAMP_MILLIS but I'm not sure about int64 logically typed as TIMESTAMP_MICROS. I thought of this because there was a post in Slack asking if TIMESTAMP_MICROS is supported on the read path here Here's the documentation about timestamp handling in Parquet. One of the newer features is support for nanos precision stored in an int64 (with nano precision in int96 being deprecated, see here) |
Added in prestosql#4104, predicate pushdown for parquet INT96 timestamp values can result in incorrect results even when stats appear valid by checking min <= max. Parquet writers that produced statistics at all were comparing min and max values as BINARY which is oblivious to the semantics of how INT96 timestamps are encoded making them unusable. Comparison of INT96 values for statistics was removed in PARQUET-1065 for all cases except when min == max, which would not be affected by the comparison issue or other byte order issues that existed with parquet BINARY types at the time. Any parquet file that contains INT96 statistics where min != max would have to have been written by an older parquet writer that compared the values incorrectly, making those statistics unusable. This change disables parquet predicate pushdown on INT96 timestamps except for when all rows have the same value (ie: min == max).
Added in prestosql#4104, predicate pushdown for parquet INT96 timestamp values can result in incorrect results even when stats appear valid by checking min <= max. Parquet writers that produced statistics at all were comparing min and max values as BINARY which is oblivious to the semantics of how INT96 timestamps are encoded making them unusable. Comparison of INT96 values for statistics was removed in PARQUET-1065 for all cases except when min == max, which would not be affected by the comparison issue or other byte order issues that existed with parquet BINARY types at the time. Any parquet file that contains INT96 statistics where min != max would have to have been written by an older parquet writer that compared the values incorrectly, making those statistics unusable. This change disables parquet predicate pushdown on INT96 timestamps except for when all rows have the same value (ie: min == max).
|
This improvement really helped with one of our queries, but I want to understand a bit more of what's actually happening with Presto and taking advantage of the parquet metadata... What does the |
|
So, Parquet files are separated out into |
No description provided.