Skip to content

Conversation

@groupcache4321
Copy link
Contributor

@groupcache4321 groupcache4321 commented Nov 23, 2022

Description

From https://trino.io/blog/2020/08/14/dereference-pushdown.html: "Another future improvement will be the pushdown of predicates on subfields for data stored in Parquet format. Although the pruning of nested fields occurs with Parquet, the predicates are not yet pushed down into the reader."

This PR enables Parquet page source to use statistics for nested fields.

Additional context and related issues

Related ORC commit: 5069a55
#9928

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Hive
* Improve performance of queries with selective filters on primitive fields in ROW type columns. ({issue}`15163`)

@cla-bot cla-bot bot added the cla-signed label Nov 23, 2022
@groupcache4321 groupcache4321 marked this pull request as ready for review November 23, 2022 08:53
@groupcache4321
Copy link
Contributor Author

@raunaqmorarka and @phd3 Hi, can you help to take a look?

@groupcache4321
Copy link
Contributor Author

I had waited 10 days before getting code reviewed..

@groupcache4321
Copy link
Contributor Author

hi, the commit message has been reorganized, please take a further look.

@findepi
Copy link
Member

findepi commented Dec 13, 2022

@phd3 does this add nested field predicate pushdown for Iceberg (potentially answering #8759 a better way)?

cc @alexjo2144

@raunaqmorarka
Copy link
Member

@leetcode-1533 PTAL at #15388

@phd3
Copy link
Member

phd3 commented Dec 15, 2022

@findepi this makes use of unenforced predicates on "subfields" which are still primitive - on the workers for file level pruning. IIUC #8759 is about struct/map/list columns which seems orthogonal.

@findepi
Copy link
Member

findepi commented Dec 19, 2022

IIUC #8759 is about struct/map/list columns which seems orthogonal.

orthogonal -- in some sense, yes
maybe we don't need #8759 at all; we surely need this one

@raunaqmorarka
Copy link
Member

@leetcode-1533 can you please rebase this on latest master

@groupcache4321
Copy link
Contributor Author

hi, I re-based to * [2023-02-09] [11942a4f91] | Support REFRESH MATERIALIZED VIEW when storage differs in schema {{Piotr Findeisen}} (origin/master, origin/HEAD)

And have changed the unit test in #15388

@github-actions github-actions bot added the hive Hive connector label Mar 18, 2023
@groupcache4321
Copy link
Contributor Author

hi, it has been rebased with the latest master branch

@groupcache4321
Copy link
Contributor Author

For dereferenceSubFieldTypes(), I am actually leaning toward @alexjo2144's suggestion, i.e. respecting useColumnName option also when dereferencing subtypes(though it is not a 'column' name, but rather a subtype's name) Would like know community's opinion on this.

@groupcache4321
Copy link
Contributor Author

Current status:
When dereferencing top column name, respect “useColumnName” option(when calling getBaseColumnParquetType())
When dereferencing nested field name, only use columnName, implemented at dereferenceSubFieldTypes().

After chatting with @raunaqmorarka , I did not implement: “respect “useColumnName” option when dereferencing nested columns”, since we agreed that: "this change is not necessary for the nested columns predicate pushdown, Unless it is a pre-requisite for your change, I would prefer to avoid changing that”.

dereferenceSubFieldTypes() is extracted from getColumnType() method currently nothing has changed. My change just built on top of the dereferenceSubFieldTypes().

To highlight that the variable is after dereferencing.
@groupcache4321 groupcache4321 force-pushed the dereferenceparquet branch 2 times, most recently from f669a2a to dabc54d Compare April 21, 2023 18:34
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comment

@groupcache4321 groupcache4321 force-pushed the dereferenceparquet branch 2 times, most recently from 350dc56 to ba5bffe Compare April 24, 2023 18:52
@groupcache4321
Copy link
Contributor Author

hello, can you help to merge in this.. So I could rebase the iceberg related changes..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants