-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5411] Avoid virtual key info for COW table in the input format #7527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // NOTE: Fetching virtual key info is a costly operation as it needs to load the commit metadata. | ||
| // This is only needed for MOR realtime splits. Hence, for COW tables, this can be avoided. | ||
| Option<HoodieVirtualKeyInfo> virtualKeyInfoOpt = tableMetaClient.getTableType().equals(COPY_ON_WRITE) ? Option.empty() : getHoodieVirtualKeyInfo(tableMetaClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change. Earlier it used to be simply
Option<HoodieVirtualKeyInfo> virtualKeyInfoOpt = getHoodieVirtualKeyInfo(tableMetaClient);.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find @codope! Glad we're able to identify the root-cause of that slow-down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we take this fix one step further -- instead of fetching virtual key-info here let's push it inside createFileStatusUnchecked by pass the meta-client in there
ed2f76f to
f25fdb4
Compare
Push virtual key fetch inside createFileStatusUnchecked for MOR input format
f25fdb4 to
bd71935
Compare
…pache#7527) Push virtual key fetch inside createFileStatusUnchecked for MOR input format
…pache#7527) Push virtual key fetch inside createFileStatusUnchecked for MOR input format
Change Logs
Fetching virtual key involves reading from commit metadata or data file (
TableSchemaResolver) which is a costly operation. This is only needed for schema projection in the case of MOR table (realtime splits). So, we can avoid it in case of COW table.NOTE: This is stacked on top of #7526
Impact
Improves performance of hive-compatible query engines that depend on input format implementation in Hudi, e.g. trino-hive connector. Tested listing on a TPC-DS table with 1824 partitions.
Without this change (1.5 minutes):
With this change (18 seconds):
Risk level (write none, low medium or high below)
medium
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist