Skip to content

Improve Hudi connector performance#16034

Merged
electrum merged 3 commits intotrinodb:masterfrom
codope:fix-perf-aync-split
Jun 14, 2023
Merged

Improve Hudi connector performance#16034
electrum merged 3 commits intotrinodb:masterfrom
codope:fix-perf-aync-split

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Feb 9, 2023

Description

Previously, the query execution would wait for all the split generation to complete and splits were loaded in a single thread. With this PR, split generation and processing can happen asynchronously.

Additional context and related issues

Fixes apache/hudi#7643
Fixes #15564

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Feb 9, 2023
@codope codope added the hudi Hudi connector label Feb 9, 2023
@martint martint requested review from findepi and hashhar February 9, 2023 18:57
@findepi
Copy link
Copy Markdown
Member

findepi commented Feb 27, 2023

martint requested review from findepi and hashhar 2 weeks ago

@findinpath and @electrum should be the reviewers here.

@findepi findepi requested review from electrum and removed request for findepi and hashhar February 27, 2023 11:14
@findinpath
Copy link
Copy Markdown
Contributor

There are a lot of changes in this PR and (besides TestHudiConfig) no tests to accompanying them. Can you consider adding more test coverage for your changes?

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Mar 21, 2023

@findinpath @ksoullpwk Thanks for reviewing the PR. I will address your comments this week.

@codope codope force-pushed the fix-perf-aync-split branch from 99bb1c1 to 6a7d5ef Compare March 23, 2023 14:13
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Mar 23, 2023

There are a lot of changes in this PR and (besides TestHudiConfig) no tests to accompanying them. Can you consider adding more test coverage for your changes?

@findinpath Can you please review again?
Would appreciate if you can help with what kind of tests we can add here. There is no new functionality. This PR is a performance enhancement. Sure, there is some change in the behavior but existing tests will go through that path. So functional correctness is already tested.

@tooptoop4
Copy link
Copy Markdown
Contributor

🦗

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Apr 11, 2023

@findinpath @electrum gentle ping to review the PR.

@codope codope force-pushed the fix-perf-aync-split branch 2 times, most recently from e576682 to 3ba028d Compare April 17, 2023 14:35
@vinothchandar
Copy link
Copy Markdown

vinothchandar commented May 2, 2023

cc @umehrot2 from Athena/EMR and @bvaradar from Robinhood, who were asking about this.

@codope codope force-pushed the fix-perf-aync-split branch from 3ba028d to 49b1807 Compare May 15, 2023 08:09
@vinothchandar
Copy link
Copy Markdown

@ksoullpwk Could you please review again or one of the trino maintainers chime in? We need to get this in, so Hudi users can migrate to the hudi connector away from hive connector.

Copy link
Copy Markdown

@ksoullpwk ksoullpwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see, some configs have changed to this PR. Don't forget to update the document https://github.com/trinodb/trino/blob/master/docs/src/main/sphinx/connector/hudi.rst.

@electrum
Copy link
Copy Markdown
Member

electrum commented Jun 3, 2023

Apologies for the delay on this. Can you rebase? We completed the removal of Hadoop from the Hudi connector and can now start merging these other changes.

@codope codope force-pushed the fix-perf-aync-split branch 2 times, most recently from c81328a to e044a9b Compare June 5, 2023 09:24
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jun 5, 2023

@electrum Thanks for removing Hadoop from Hudi connector. We are also working on exposing APIs in Hudi such that Hadoop is not required and then there won't be need to duplicate classes in hudi-trino module. Meanwhile, I have rebased and this PR is ready for another review.

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jun 7, 2023

@electrum gentle ping

@codope codope force-pushed the fix-perf-aync-split branch from e044a9b to d0d520a Compare June 9, 2023 12:40
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jun 9, 2023

@electrum I've rebased and addressed all comments. Please take a pass again.

@codope codope requested a review from a team June 10, 2023 13:02
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jun 10, 2023

@trinodb/maintainers Please review this PR.

@codope codope force-pushed the fix-perf-aync-split branch from d0d520a to aadcade Compare June 12, 2023 14:57
Copy link
Copy Markdown
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, otherwise looks good

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jun 13, 2023

Thanks @electrum @ksoullpwk for quick feedback. I've updated the docs and addressed other comments. PR is ready to merge.

@github-actions github-actions bot added the docs label Jun 13, 2023
@electrum electrum merged commit e14075a into trinodb:master Jun 14, 2023
@github-actions github-actions bot added this to the 420 milestone Jun 15, 2023
@yx-keith
Copy link
Copy Markdown

yx-keith commented Aug 3, 2023

this pr use hive metastore's getPartition function when load parition info, it will take a long time if there are many partitions.

@yx-keith
Copy link
Copy Markdown

yx-keith commented Aug 3, 2023

I optimized it in 406 version, brings 40x to 50x performance improment in my environment

743c414

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Aug 3, 2023

I optimized it in 406 version, brings 40x to 50x performance improment in my environment

743c414

@yx-keith This is great! Do you want to raise a PR so that we can port over your changes to the latest Trino version? I can help with the review. If you have some benchmarks on a large dataset (possibly thousands of partitions), feel free to share it as well. Once you have the PR, I can also help with benchmarking against TPC-DS dataset which has about ~2k partitions (good enough to bring out the perf diff).

@yx-keith
Copy link
Copy Markdown

yx-keith commented Aug 3, 2023

I optimized it in 406 version, brings 40x to 50x performance improment in my environment
743c414

@yx-keith This is great! Do you want to raise a PR so that we can port over your changes to the latest Trino version? I can help with the review. If you have some benchmarks on a large dataset (possibly thousands of partitions), feel free to share it as well. Once you have the PR, I can also help with benchmarking against TPC-DS dataset which has about ~2k partitions (good enough to bring out the perf diff).

@codope
yes, thank you, Is it necessary to merge my code into master branch and then raise a PR?
I use table and dataset in this issue:
apache/hudi#7643
there are 744 partitions

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Aug 3, 2023

@yx-keith I would suggest to rebase your changes on top of latest master code and run your test with the dataset (744 partitions) that you have again. Ideally, metastore rpc happens for only matching partitions. It was fixed in

if (hudiPartitionInfo.doesMatchPredicates() || partitionName.equals("")) {
List<HivePartitionKey> partitionKeys = hudiPartitionInfo.getHivePartitionKeys();

Also, check this comment where we discussed the additional metastore calls - #16034 (comment)

@yx-keith
Copy link
Copy Markdown

yx-keith commented Aug 3, 2023

> @yx-keith I would suggest to rebase you changes on top of latest master code and run your test with the dataset (744 partitions) that you have again. Ideally, metastore rpc happens for only matching partitions. It was fixed in

if (hudiPartitionInfo.doesMatchPredicates() || partitionName.equals("")) {
List<HivePartitionKey> partitionKeys = hudiPartitionInfo.getHivePartitionKeys();

Also, check this comment where we discussed the additional metastore calls - #16034 (comment)

ok, this is my pr
#18511
@codope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

8 participants