Data skipping for Hudi connector by codope · Pull Request #17899 · trinodb/trino

codope · 2023-06-14T13:38:19Z

Description

Use Hudi column stats to skip data and improve query latency.
Stacked on top of #16034

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Support for data skipping with column statistics in Hudi connector.

tooptoop4 · 2023-06-15T10:11:45Z

any benchmark?

codope · 2023-06-15T17:43:30Z

@tooptoop4 We benchmarked with 6-month GitHub archive data (220 GB, 450 million records) and observed significant improvements with column stats and clustering enabled.

The design is almost similar to prestodb/presto#18606 and @xiarixiaoyao also benchmarked with a bigger dataset - SSB (1.5 TB, 12 billion records) - and the results were quite impressive as you can see in that PR.

dertodestod · 2023-08-11T13:45:09Z

First of all a big thanks from me for the work on the connector and on hudi in general 👍. Is there a rough ETA on when this will be merged and be available in trino? Thanks again.

mosabua · 2023-09-09T03:22:10Z

@codope do you plan to rebase and update this PR soon?

songpcmusic · 2024-01-11T07:58:18Z

Is there any plan to support the use case where an operation such as element_at(map_column, 'key') =|>|< 'value' is required?

(Note: The =|>|< symbol is not a standard operator in most query languages, so for an accurate translation, it would be helpful to provide additional context or clarification on what the intended operation is.)

codope · 2024-01-26T04:40:55Z

Is there any plan to support the use case where an operation such as element_at(map_column, 'key') =|>|< 'value' is required?

(Note: The =|>|< symbol is not a standard operator in most query languages, so for an accurate translation, it would be helpful to provide additional context or clarification on what the intended operation is.)

The current PR does not support that. In Hudi 1.0, we are adding functinal indexes which can support skipping data based on function/expression on column(s).

github-actions · 2024-02-16T17:14:16Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

mosabua · 2024-02-16T19:12:22Z

👋 @codope .. I assume you are continuing this work at some stage and will leave the PR open.

codope · 2024-02-17T04:44:04Z

I am closing this PR. Once we upgrade the Hudi version in Trino (with the updated Hadoop-independent abstraction), we'll revive or create a new PR for data skipping.

codope added 7 commits February 6, 2023 19:44

Add async split processing for Hudi connector

4b6be98

Add split size estimation and test/bug fixes

99bb1c1

Add support for MoR snapshot query

7294763

Set column index in jobconf

d79a7c2

Add support for data skipping in Hudi connector

1817af7

Minor refactoring

833c977

Use trino bundle and fix partition name

1461847

cla-bot bot added the cla-signed label Jun 14, 2023

codope added the hudi Hudi connector label Jun 14, 2023

github-actions bot added the stale label Feb 16, 2024

codope closed this Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data skipping for Hudi connector#17899

Data skipping for Hudi connector#17899
codope wants to merge 7 commits intotrinodb:masterfrom
codope:data-skipping

codope commented Jun 14, 2023

Uh oh!

tooptoop4 commented Jun 15, 2023

Uh oh!

codope commented Jun 15, 2023

Uh oh!

dertodestod commented Aug 11, 2023

Uh oh!

mosabua commented Sep 9, 2023

Uh oh!

songpcmusic commented Jan 11, 2024

Uh oh!

codope commented Jan 26, 2024

Uh oh!

github-actions bot commented Feb 16, 2024

Uh oh!

mosabua commented Feb 16, 2024

Uh oh!

codope commented Feb 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Conversation

codope commented Jun 14, 2023

Description

Additional context and related issues

Release notes

Uh oh!

tooptoop4 commented Jun 15, 2023

Uh oh!

codope commented Jun 15, 2023

Uh oh!

dertodestod commented Aug 11, 2023

Uh oh!

mosabua commented Sep 9, 2023

Uh oh!

songpcmusic commented Jan 11, 2024

Uh oh!

codope commented Jan 26, 2024

Uh oh!

github-actions bot commented Feb 16, 2024

Uh oh!

mosabua commented Feb 16, 2024

Uh oh!

codope commented Feb 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants