Skip to content

Support data skipping for Hudi connector#24784

Open
codope wants to merge 2 commits intoprestodb:masterfrom
codope:upgrade-hudi-io
Open

Support data skipping for Hudi connector#24784
codope wants to merge 2 commits intoprestodb:masterfrom
codope:upgrade-hudi-io

Conversation

@codope
Copy link
Contributor

@codope codope commented Mar 23, 2025

Description

  • Upgrade Hudi version to 0.15.0.
  • Support data skipping using column stats for Hudi connector.
  • Integrate with more efficient HoodieStorage and HoodieStorageConfiguration APIs.

Motivation and Context

Hudi has a metadata table that supports efficient file listing, column stats and other indexes. Up until now, only files index was integrated in the Hudi connector. This PR adds support for column_stats index as well.

Impact

More efficient queries by data skipping on top of partition pruning.

Test Plan

Added tests to validate data skipping with Hudi table as test artifacts. Previously, we have run this through SSB benchmark in cluster. See the results in #18606

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Upgrade Hudi version to 0.15.0.

Hudi Connector Changes
* Support data skipping with column_stats in Hudi connector.

@codope codope requested review from a team, 7c00 and vinothchandar as code owners March 23, 2025 15:16
@codope codope requested a review from jaystarshot March 23, 2025 15:16
@codope
Copy link
Contributor Author

codope commented Mar 23, 2025

@tdcmeehan @xiarixiaoyao @pratyakshsharma I have rewritten the data skipping support from #18606 with upgraded Hudi version. The design still remains the same as described in the original PR. However, in Hudi 0.15.0 we introduced HoodieStorage and HoodieStorageConfiguration and this patch works with those APIs. Please take a look.

@steveburnett
Copy link
Contributor

Suggest rebasing could help with passing the tests that failed earlier.

Co-authored-by: xiarixiaoyao <mengtao0326@qq.com>
@steveburnett
Copy link
Contributor

Thanks for the release note! Suggested changes:

== RELEASE NOTES ==

General Changes
* Upgrade Hudi version to 0.15.0.

Hudi Connector Changes
* Support data skipping with column_stats in Hudi connector. 

@steveburnett
Copy link
Contributor

Please resolve the file conflict.

@ethanyzhang
Copy link
Contributor

Hi @codope, could you help look into the test failures? We can give this another round of review.

@steveburnett
Copy link
Contributor

This PR "Upgrade Hudi version to 0.15.0."

#25055 and #22855 also address this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants