Skip to content

Conversation

@groupcache4321
Copy link
Contributor

@groupcache4321 groupcache4321 commented Apr 20, 2023

Description

From https://trino.io/blog/2020/08/14/dereference-pushdown.html: "Another future improvement will be the pushdown of predicates on subfields for data stored in Parquet format. Although the pruning of nested fields occurs with Parquet, the predicates are not yet pushed down into the reader."

This PR enables Parquet page source to use statistics for nested fields in the iceberg connector.

Additional context and related issues

Related ORC commit: 5069a55
Fixes #9928
Hive change PR: #15163

Release notes

() This is not user-visible or docs only and no release notes are required.
() Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve performance of queries with filters on fields in ROW type columns stored in parquet files.

@cla-bot cla-bot bot added the cla-signed label Apr 20, 2023
@github-actions github-actions bot added hive Hive connector iceberg Iceberg connector tests:hive labels Apr 20, 2023
@groupcache4321 groupcache4321 force-pushed the iceberg_dereferenceparquet_2 branch from d296d47 to b9ae609 Compare April 28, 2023 01:43
@groupcache4321 groupcache4321 changed the title Iceberg dereferenceparquet 2 Implement predicate push down for parquet dereference column in Iceberg Apr 28, 2023
@groupcache4321 groupcache4321 force-pushed the iceberg_dereferenceparquet_2 branch from b9ae609 to bf77929 Compare April 28, 2023 01:50
@groupcache4321 groupcache4321 marked this pull request as ready for review April 28, 2023 01:50
@groupcache4321
Copy link
Contributor Author

Fixing product test: "2023-04-28T05:52:04.9369644Z tests | 2023-04-28 11:37:04 INFO: FAILURE / io.trino.tests.product.iceberg.TestIcebergSparkCompatibility.testIdBasedFieldMapping [PARQUET, 2] (Groups: iceberg_jdbc, profile_specific_tests, iceberg, iceberg_rest) took 3.7 seconds
"

@groupcache4321 groupcache4321 force-pushed the iceberg_dereferenceparquet_2 branch from bf77929 to 2323605 Compare April 30, 2023 06:52
@groupcache4321
Copy link
Contributor Author

I fixed the bug and submitted the PR for another check

@groupcache4321 groupcache4321 force-pushed the iceberg_dereferenceparquet_2 branch 2 times, most recently from 1194880 to 234a076 Compare April 30, 2023 23:02
Copy link
Contributor

@findinpath findinpath May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertNoDataRead is currently a bit misleading because it builds on processedInputDataSize and not physicalInputDataSize

However this change relieves the engine of dealing with additional computations when the data leaves the parquet reader. Very good catch @leetcode-1533

No change requested in this PR

@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 16, 2024
@mosabua
Copy link
Member

mosabua commented Jan 16, 2024

👋 @leetcode-1533 @findinpath @findepi - this PR has become inactive. If you're still interested in working on it, please let us know.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 5 ?

@raunaqmorarka raunaqmorarka force-pushed the iceberg_dereferenceparquet_2 branch from 77e5049 to 38c58a3 Compare January 18, 2024 12:24
@raunaqmorarka raunaqmorarka merged commit 3a67a0a into trinodb:master Jan 18, 2024
@github-actions github-actions bot added this to the 437 milestone Jan 18, 2024
@findinpath
Copy link
Contributor

Thank you @leetcode-1533 for this contribution.

@mosabua
Copy link
Member

mosabua commented Jan 18, 2024

Thanks @raunaqmorarka and @findinpath for finishing it up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector iceberg Iceberg connector stale

Development

Successfully merging this pull request may close these issues.

Predicate pushdown for nested fields in Parquet reader

4 participants