HIVE-21599: Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns #3742

soumyakanti3578 · 2022-11-09T00:15:38Z

What changes were proposed in this pull request?

Partition columns are getting removed from Parquet metadata (schema).

Why are the changes needed?

When a Parquet data file contains partition columns, and the query filters on those partition columns, we can get wrong results. By removing the partition columns from the schema, we avoid creating Filter predicates on those columns.

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite=true -Dqfile=parquet_partition_col.q
This test returns correct results.

ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java

ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java

amansinha100 · 2022-11-10T04:34:47Z

A note about the commit message: the summary says 'remove predicate ..' which is not true any more with this patch. So best to reword it. Also remove reference to virtual columns since the patch is not making any changes for that.

ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java

asolimando

LGTM, only please add the missing javadoc on newly introduced public methods

… the conf from TableScanOp

asolimando

+1, for the future avoid force pushing/rebasing while the review process is still on-going because this kills the possibility to review only the delta from the previous review round

Parquet supports column pruning and this information is captured by ReadContext#getRequestedSchema. Creating and applying filters on columns that are not present in the requested Parquet schema can lead to wrong results since missing columns are populated with null values. Align predicate push-down and column pruning optimizations to use the same schema ("requestedSchema") to avoid evaluating predicates on nulls.

zabetak

While reviewing this PR, I got the impression that the solution may be simpler and more general. I left some comments under the JIRA ticket and pushed an alternative fix here.

@soumyakanti3578 @amansinha100 @asolimando let me know your thoughts.

This reverts commit 7e16714. The approach caused various failures especially to tests with schema evolutions so as explained in the JIRA cannot be used.

This reverts commit d1908ce.

Various existing APIs: setColumnNameList setColumnTypeList getColumnNames

FetchTask#initFetch already sets the partition columns among other things. Column name, types, etc, are not set in the constructor so setting partitions here seems a bit out of place.

…ery plan

sonarqubecloud · 2022-11-21T17:33:24Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

soumyakanti3578 · 2022-11-21T21:36:41Z

@zabetak LGTM. Thanks for cleaning this up!

amansinha100

Updated changes look good to me. +1 .

…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742

…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742 (cherry picked from commit eb57ac9)

…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando) Closes apache#3742

kgyrtkirk added tests pending tests unstable and removed tests pending tests unstable labels Nov 9, 2022

soumyakanti3578 marked this pull request as ready for review November 9, 2022 22:48

kgyrtkirk added tests passed and removed tests pending labels Nov 9, 2022

amansinha100 reviewed Nov 10, 2022

View reviewed changes

asolimando reviewed Nov 10, 2022

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java Outdated Show resolved Hide resolved

asolimando reviewed Nov 10, 2022

View reviewed changes

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java Outdated Show resolved Hide resolved

asolimando requested changes Nov 10, 2022

View reviewed changes

soumyakanti3578 added 4 commits November 10, 2022 10:26

Remove partition columns from the parquet schema

f431f72

Add PARTITION_COLUMNS to IOConstants, to add partition column info to…

22ecffb

… the conf from TableScanOp

Check nulls for TableMetadata, resolve List.size() code smell

5a901ca

Address review comments

c6b94cc

soumyakanti3578 force-pushed the HIVE-21599-1 branch from d35f018 to c6b94cc Compare November 10, 2022 18:28

asolimando approved these changes Nov 11, 2022

View reviewed changes

zabetak added 2 commits November 14, 2022 15:48

Revert fix based on new config properties

d1908ce

kgyrtkirk added tests pending and removed tests passed labels Nov 14, 2022

zabetak reviewed Nov 14, 2022

View reviewed changes

kgyrtkirk added tests unstable and removed tests pending labels Nov 14, 2022

zabetak added 8 commits November 17, 2022 15:48

Revert "Do not create filters on pruned schema columns"

8ea38f5

This reverts commit 7e16714. The approach caused various failures especially to tests with schema evolutions so as explained in the JIRA cannot be used.

Revert "Revert fix based on new config properties"

736d5cd

This reverts commit d1908ce.

Set/Get "partition.column" using existing variable in serdeConstants

45d02f8

Rename to setPartitionColumnNames to align with existing APIs

880dbf5

Various existing APIs: setColumnNameList setColumnTypeList getColumnNames

Drop seemingly redundant call from FetchOperator

c310060

FetchTask#initFetch already sets the partition columns among other things. Column name, types, etc, are not set in the constructor so setting partitions here seems a bit out of place.

Remove explains from tests since the changes are not targeting the qu…

2cb2d99

…ery plan

Add some comments in the qtest repro

79e88ee

Remove unused imports from ParquetRecordReaderBase

b8f94fb

kgyrtkirk added tests pending and removed tests unstable labels Nov 21, 2022

kgyrtkirk added tests passed and removed tests pending labels Nov 21, 2022

amansinha100 approved these changes Nov 22, 2022

View reviewed changes

zabetak closed this in eb57ac9 Nov 22, 2022

xicm mentioned this pull request Dec 15, 2022

[HUDI-5308] Hive3 query returns null when the where clause has a partition field apache/hudi#7355

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-21599: Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns #3742

HIVE-21599: Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns #3742

Uh oh!

soumyakanti3578 commented Nov 9, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amansinha100 commented Nov 10, 2022

Uh oh!

Uh oh!

Uh oh!

asolimando left a comment

Uh oh!

asolimando left a comment

Uh oh!

zabetak left a comment

Uh oh!

sonarqubecloud bot commented Nov 21, 2022

Uh oh!

soumyakanti3578 commented Nov 21, 2022

Uh oh!

amansinha100 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HIVE-21599: Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns #3742

HIVE-21599: Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns #3742

Uh oh!

Conversation

soumyakanti3578 commented Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amansinha100 commented Nov 10, 2022

Uh oh!

Uh oh!

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 21, 2022

Uh oh!

soumyakanti3578 commented Nov 21, 2022

Uh oh!

amansinha100 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

soumyakanti3578 commented Nov 9, 2022 •

edited

Loading