-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-21599: Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns #3742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
Outdated
Show resolved
Hide resolved
|
A note about the commit message: the summary says 'remove predicate ..' which is not true any more with this patch. So best to reword it. Also remove reference to virtual columns since the patch is not making any changes for that. |
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
Outdated
Show resolved
Hide resolved
asolimando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only please add the missing javadoc on newly introduced public methods
d35f018 to
c6b94cc
Compare
asolimando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, for the future avoid force pushing/rebasing while the review process is still on-going because this kills the possibility to review only the delta from the previous review round
Parquet supports column pruning and this information is captured by
ReadContext#getRequestedSchema.
Creating and applying filters on columns that are not present in the
requested Parquet schema can lead to wrong results since missing columns
are populated with null values.
Align predicate push-down and column pruning optimizations to use the
same schema ("requestedSchema") to avoid evaluating predicates on nulls.
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing this PR, I got the impression that the solution may be simpler and more general. I left some comments under the JIRA ticket and pushed an alternative fix here.
@soumyakanti3578 @amansinha100 @asolimando let me know your thoughts.
This reverts commit 7e16714. The approach caused various failures especially to tests with schema evolutions so as explained in the JIRA cannot be used.
This reverts commit d1908ce.
Various existing APIs: setColumnNameList setColumnTypeList getColumnNames
FetchTask#initFetch already sets the partition columns among other things. Column name, types, etc, are not set in the constructor so setting partitions here seems a bit out of place.
|
Kudos, SonarCloud Quality Gate passed! |
|
@zabetak LGTM. Thanks for cleaning this up! |
amansinha100
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated changes look good to me. +1 .
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742 (cherry picked from commit eb57ac9)
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742








What changes were proposed in this pull request?
Partition columns are getting removed from Parquet metadata (schema).
Why are the changes needed?
When a Parquet data file contains partition columns, and the query filters on those partition columns, we can get wrong results. By removing the partition columns from the schema, we avoid creating Filter predicates on those columns.
Does this PR introduce any user-facing change?
No
How was this patch tested?
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite=true -Dqfile=parquet_partition_col.qThis test returns correct results.