Improve Hudi connector performance#16034
Conversation
@findinpath and @electrum should be the reviewers here. |
...in/trino-hudi/src/main/java/io/trino/plugin/hudi/query/HudiReadOptimizedDirectoryLister.java
Outdated
Show resolved
Hide resolved
...in/trino-hudi/src/main/java/io/trino/plugin/hudi/query/HudiReadOptimizedDirectoryLister.java
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/partition/HiveHudiPartitionInfo.java
Outdated
Show resolved
Hide resolved
|
There are a lot of changes in this PR and (besides |
|
@findinpath @ksoullpwk Thanks for reviewing the PR. I will address your comments this week. |
99bb1c1 to
6a7d5ef
Compare
@findinpath Can you please review again? |
|
🦗 |
|
@findinpath @electrum gentle ping to review the PR. |
e576682 to
3ba028d
Compare
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/ForHudiSplitSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiModule.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiModule.java
Outdated
Show resolved
Hide resolved
...in/trino-hudi/src/main/java/io/trino/plugin/hudi/query/HudiReadOptimizedDirectoryLister.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/partition/HudiPartitionInfoLoader.java
Outdated
Show resolved
Hide resolved
3ba028d to
49b1807
Compare
|
@ksoullpwk Could you please review again or one of the trino maintainers chime in? We need to get this in, so Hudi users can migrate to the hudi connector away from hive connector. |
ksoullpwk
left a comment
There was a problem hiding this comment.
As I see, some configs have changed to this PR. Don't forget to update the document https://github.com/trinodb/trino/blob/master/docs/src/main/sphinx/connector/hudi.rst.
|
Apologies for the delay on this. Can you rebase? We completed the removal of Hadoop from the Hudi connector and can now start merging these other changes. |
c81328a to
e044a9b
Compare
|
@electrum Thanks for removing Hadoop from Hudi connector. We are also working on exposing APIs in Hudi such that Hadoop is not required and then there won't be need to duplicate classes in hudi-trino module. Meanwhile, I have rebased and this PR is ready for another review. |
|
@electrum gentle ping |
e044a9b to
d0d520a
Compare
|
@electrum I've rebased and addressed all comments. Please take a pass again. |
|
@trinodb/maintainers Please review this PR. |
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
d0d520a to
aadcade
Compare
electrum
left a comment
There was a problem hiding this comment.
A few comments, otherwise looks good
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiModule.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiPartitionManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSessionProperties.java
Outdated
Show resolved
Hide resolved
|
Thanks @electrum @ksoullpwk for quick feedback. I've updated the docs and addressed other comments. PR is ready to merge. |
|
this pr use hive metastore's getPartition function when load parition info, it will take a long time if there are many partitions. |
|
I optimized it in 406 version, brings 40x to 50x performance improment in my environment |
@yx-keith This is great! Do you want to raise a PR so that we can port over your changes to the latest Trino version? I can help with the review. If you have some benchmarks on a large dataset (possibly thousands of partitions), feel free to share it as well. Once you have the PR, I can also help with benchmarking against TPC-DS dataset which has about ~2k partitions (good enough to bring out the perf diff). |
@codope |
|
@yx-keith I would suggest to rebase your changes on top of latest master code and run your test with the dataset (744 partitions) that you have again. Ideally, metastore rpc happens for only matching partitions. It was fixed in Also, check this comment where we discussed the additional metastore calls - #16034 (comment) |
|
> @yx-keith I would suggest to rebase you changes on top of latest master code and run your test with the dataset (744 partitions) that you have again. Ideally, metastore rpc happens for only matching partitions. It was fixed in
|
Description
Previously, the query execution would wait for all the split generation to complete and splits were loaded in a single thread. With this PR, split generation and processing can happen asynchronously.
Additional context and related issues
Fixes apache/hudi#7643
Fixes #15564
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: