Skip to content

Data skipping for Hudi connector#17899

Closed
codope wants to merge 7 commits intotrinodb:masterfrom
codope:data-skipping
Closed

Data skipping for Hudi connector#17899
codope wants to merge 7 commits intotrinodb:masterfrom
codope:data-skipping

Conversation

@codope
Copy link
Contributor

@codope codope commented Jun 14, 2023

Description

Use Hudi column stats to skip data and improve query latency.
Stacked on top of #16034

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Support for data skipping with column statistics in Hudi connector.

@cla-bot cla-bot bot added the cla-signed label Jun 14, 2023
@codope codope added the hudi Hudi connector label Jun 14, 2023
@tooptoop4
Copy link
Contributor

any benchmark?

@codope
Copy link
Contributor Author

codope commented Jun 15, 2023

@tooptoop4 We benchmarked with 6-month GitHub archive data (220 GB, 450 million records) and observed significant improvements with column stats and clustering enabled.
Screenshot 2023-06-15 at 11 10 30 PM
The design is almost similar to prestodb/presto#18606 and @xiarixiaoyao also benchmarked with a bigger dataset - SSB (1.5 TB, 12 billion records) - and the results were quite impressive as you can see in that PR.

@dertodestod
Copy link

First of all a big thanks from me for the work on the connector and on hudi in general 👍. Is there a rough ETA on when this will be merged and be available in trino? Thanks again.

@mosabua
Copy link
Member

mosabua commented Sep 9, 2023

@codope do you plan to rebase and update this PR soon?

@songpcmusic
Copy link

Is there any plan to support the use case where an operation such as element_at(map_column, 'key') =|>|< 'value' is required?

(Note: The =|>|< symbol is not a standard operator in most query languages, so for an accurate translation, it would be helpful to provide additional context or clarification on what the intended operation is.)

@codope
Copy link
Contributor Author

codope commented Jan 26, 2024

Is there any plan to support the use case where an operation such as element_at(map_column, 'key') =|>|< 'value' is required?

(Note: The =|>|< symbol is not a standard operator in most query languages, so for an accurate translation, it would be helpful to provide additional context or clarification on what the intended operation is.)

The current PR does not support that. In Hudi 1.0, we are adding functinal indexes which can support skipping data based on function/expression on column(s).

@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 16, 2024
@mosabua
Copy link
Member

mosabua commented Feb 16, 2024

👋 @codope .. I assume you are continuing this work at some stage and will leave the PR open.

@codope
Copy link
Contributor Author

codope commented Feb 17, 2024

I am closing this PR. Once we upgrade the Hudi version in Trino (with the updated Hadoop-independent abstraction), we'll revive or create a new PR for data skipping.

@codope codope closed this Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

5 participants