Skip to content

Improve listing performance of Hudi tables#17244

Merged
arunthirupathi merged 1 commit intoprestodb:masterfrom
codope:presto-hudi-metadata
Mar 4, 2022
Merged

Improve listing performance of Hudi tables#17244
arunthirupathi merged 1 commit intoprestodb:masterfrom
codope:presto-hudi-metadata

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Feb 1, 2022

  • Integrate metadata-based listing for Hudi tables. This is
    enabled by a session property.
  • Implement a new DirectoryLister for Hudi that uses
    HoodieTableFileSystemView to fetch data files.
  • Bump Hudi version to 0.10.1 to use above features.
  • Replace hudi-common, hudi-hadoop-mr by
    hudi-presto-bundle.

Test plan - (Please fill in how you tested your changes)

  • Synced a Hudi table to Hive and queried through Presto.
  • Add unit test for HudiDirectoryLister.
== RELEASE NOTES ==

Hive Changes
* Upgrade Hudi version to 0.10.1.
* Support metadata-based listing and bootstrap for Hudi tables.

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Feb 1, 2022

@arunthirupathi this is rework of #17084 with thinned presto bundle. Could you please review?
cc @vinothchandar

@arunthirupathi
Copy link
Copy Markdown

Can you post the mvn dependency difference before this change and after this change ?

What is the size of the shaded jar ?

I made a pass on this PR and it looks good. I need to sync up to this PR and do an internal build to ensure it does not regress things.

@arunthirupathi
Copy link
Copy Markdown

I will update by this Wednesday on whether this PR works with our internal dependencies correctly.

pom.xml Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are their still excludes if the hudi-presto-bundle is shaded ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unnecessary exclusions. There are some hudi modules which we need to exclude and then there is protobuf, which if not excluded, was causing the build to fail due to some conflicting resource files.

@codope codope force-pushed the presto-hudi-metadata branch 3 times, most recently from 1334c99 to 9f2225a Compare February 17, 2022 13:31
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Feb 17, 2022

Can you post the mvn dependency difference before this change and after this change ?

What is the size of the shaded jar ?

I made a pass on this PR and it looks good. I need to sync up to this PR and do an internal build to ensure it does not regress things.

The gist contains the full dependency tree for master and this branch. Below is the diff:

sagars@Sagars-MacBook-Pro presto % git diff --no-index presto_hive_master_dep.txt presto_hive_hudi_dep.txt
diff --git a/presto_hive_master_dep.txt b/presto_hive_hudi_dep.txt
index 83b59e2092..dd99c891e9 100644
--- a/presto_hive_master_dep.txt
+++ b/presto_hive_hudi_dep.txt
@@ -19,8 +19,7 @@
 [INFO] +- com.facebook.presto:presto-expressions:jar:0.271-SNAPSHOT:compile
 [INFO] +- com.facebook.presto:presto-cache:jar:0.271-SNAPSHOT:compile
 [INFO] |  \- org.alluxio:alluxio-shaded-client:jar:2.7.0:compile
-[INFO] +- org.apache.hudi:hudi-common:jar:0.9.0:compile
-[INFO] +- org.apache.hudi:hudi-hadoop-mr:jar:0.9.0:compile
+[INFO] +- org.apache.hudi:hudi-presto-bundle:jar:0.10.1:compile
 [INFO] +- com.facebook.presto:presto-memory-context:jar:0.271-SNAPSHOT:compile
 [INFO] +- com.facebook.presto:presto-hive-common:jar:0.271-SNAPSHOT:compile
 [INFO] +- com.facebook.presto:presto-hive-metastore:jar:0.271-SNAPSHOT:compile
@@ -248,6 +247,6 @@
 [INFO] ------------------------------------------------------------------------
 [INFO] BUILD SUCCESS
 [INFO] ------------------------------------------------------------------------
-[INFO] Total time:  1.788 s
-[INFO] Finished at: 2022-02-17T18:41:17+05:30
+[INFO] Total time:  2.052 s
+[INFO] Finished at: 2022-02-17T18:56:17+05:30
 [INFO] ------------------------------------------------------------------------

The size of the shaded jar is 16MB.

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Feb 18, 2022

@arunthirupathi Could you please take a look again? I have removed unnecessary excludes. Also, was there any issue you fuond out running this internally?

@arunthirupathi
Copy link
Copy Markdown

arunthirupathi commented Feb 18, 2022

There was a minor change required due to PathFilter to Optional < PathFilter > , but the build passed. I am going to kick off integration tests to see if it causes any issues.

@arunthirupathi
Copy link
Copy Markdown

the integration tests succeeded, so this is good. I will take final pass on Tuesday and merge this in.

@arunthirupathi arunthirupathi self-requested a review February 19, 2022 04:11
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Feb 21, 2022

the integration tests succeeded, so this is good. I will take final pass on Tuesday and merge this in.

Glad to hear this. Thanks for the update.

@arunthirupathi
Copy link
Copy Markdown

Can you please rebase and push ? There is a breaking change in our internal repo, I have submitted a PR for it. Once that is approved, I will merge this change and the other change together.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought it will be easier, if you can undo this change to this file.

Where you need the filter to always work, pass in path filter where it always returns true. () -> true

This will make it easier for me to merge this PR.

Copy link
Copy Markdown
Contributor Author

@codope codope Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PathFilter was added to DirectoryLister due to Hudi. After this patch, we don't really need it. I was planning to remove that as a follow-up to this PR if that's ok with you? Let me know if you think it's a blocker then i can make that change with this PR itself.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us do it with this PR itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.. will update this PR in a day or two.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the PR

@codope codope force-pushed the presto-hudi-metadata branch from 9f2225a to cd36a24 Compare February 28, 2022 11:47
- Integrate metadata-based listing for Hudi tables. This is
enabled by a session property.
- Implement a new DirectoryLister for Hudi that uses
HoodieTableFileSystemView to fetch data files.
- Bump Hudi version to 0.10.1 to use above features.
- Add unit tests for HudiDirectoryLister.
- Replace hudi-common, hudi-hadoop-mr by hudi-presto-bundle.
- Remove PathFilter from DirectoryLister interface.
@codope codope force-pushed the presto-hudi-metadata branch from cd36a24 to d4f9f0b Compare March 1, 2022 05:15
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Mar 1, 2022

@arunthirupathi This PR is ready now. I have removed the path filter form DirectoryLister interface and tested locally.

@arunthirupathi
Copy link
Copy Markdown

This change looks good, since this breaks API, I need to have a PR for our internal changes get it approved and then merge this in at the same time.

@arunthirupathi arunthirupathi merged commit ef1fd25 into prestodb:master Mar 4, 2022
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Mar 5, 2022

Thanks @arunthirupathi for shepherding and landing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants