Skip to content

Conversation

@soumyakanti3578
Copy link
Contributor

@soumyakanti3578 soumyakanti3578 commented Nov 9, 2022

What changes were proposed in this pull request?

Partition columns are getting removed from Parquet metadata (schema).

Why are the changes needed?

When a Parquet data file contains partition columns, and the query filters on those partition columns, we can get wrong results. By removing the partition columns from the schema, we avoid creating Filter predicates on those columns.

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite=true -Dqfile=parquet_partition_col.q
This test returns correct results.

@amansinha100
Copy link

A note about the commit message: the summary says 'remove predicate ..' which is not true any more with this patch. So best to reword it. Also remove reference to virtual columns since the patch is not making any changes for that.

Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only please add the missing javadoc on newly introduced public methods

Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, for the future avoid force pushing/rebasing while the review process is still on-going because this kills the possibility to review only the delta from the previous review round

Parquet supports column pruning and this information is captured by
ReadContext#getRequestedSchema.

Creating and applying filters on columns that are not present in the
requested Parquet schema can lead to wrong results since missing columns
are populated with null values.

Align predicate push-down and column pruning optimizations to use the
same schema ("requestedSchema") to avoid evaluating predicates on nulls.
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this PR, I got the impression that the solution may be simpler and more general. I left some comments under the JIRA ticket and pushed an alternative fix here.

@soumyakanti3578 @amansinha100 @asolimando let me know your thoughts.

This reverts commit 7e16714.

The approach caused various failures especially to tests with schema
evolutions so as explained in the JIRA cannot be used.
Various existing APIs:
setColumnNameList
setColumnTypeList
getColumnNames
FetchTask#initFetch already sets the partition columns among other
things.

Column name, types, etc, are not set in the constructor so setting
partitions here seems a bit out of place.
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@soumyakanti3578
Copy link
Contributor Author

@zabetak LGTM. Thanks for cleaning this up!

Copy link

@amansinha100 amansinha100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated changes look good to me. +1 .

@zabetak zabetak closed this in eb57ac9 Nov 22, 2022
dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Dec 15, 2022
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando)

Closes apache#3742
DongWei-4 pushed a commit to DongWei-4/hive that referenced this pull request Dec 29, 2022
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando)

Closes apache#3742

(cherry picked from commit eb57ac9)
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando)

Closes apache#3742
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…ntain partition column (Soumyakanti Das reviewed by Stamatis Zampetakis, Aman Sinha, Alessandro Solimando)

Closes apache#3742
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants