Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented Dec 21, 2022

Change Logs

Fetching virtual key involves reading from commit metadata or data file (TableSchemaResolver) which is a costly operation. This is only needed for schema projection in the case of MOR table (realtime splits). So, we can avoid it in case of COW table.

NOTE: This is stacked on top of #7526

Impact

Improves performance of hive-compatible query engines that depend on input format implementation in Hudi, e.g. trino-hive connector. Tested listing on a TPC-DS table with 1824 partitions.

Without this change (1.5 minutes):

trino:default> select count(*) from store_sales;
  _col0
---------
 2750311
(1 row)

Query 20221221_054403_00003_t63mx, FINISHED, 1 node
Splits: 1,832 total, 1,832 done (100.00%)
1:29 [2.75M rows, 28.5MB] [30.8K rows/s, 327KB/s]

With this change (18 seconds):

trino:default> select count(*) from store_sales;
  _col0
---------
 2750311
(1 row)

Query 20221221_055625_00002_knx5g, FINISHED, 1 node
Splits: 1,832 total, 1,832 done (100.00%)
17.30 [2.75M rows, 28.5MB] [169K rows/s, 1.75MB/s]

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope added priority:blocker Production down; release blocker area:query-engine Query engine integrations labels Dec 21, 2022
Comment on lines 254 to 256
// NOTE: Fetching virtual key info is a costly operation as it needs to load the commit metadata.
// This is only needed for MOR realtime splits. Hence, for COW tables, this can be avoided.
Option<HoodieVirtualKeyInfo> virtualKeyInfoOpt = tableMetaClient.getTableType().equals(COPY_ON_WRITE) ? Option.empty() : getHoodieVirtualKeyInfo(tableMetaClient);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change. Earlier it used to be simply
Option<HoodieVirtualKeyInfo> virtualKeyInfoOpt = getHoodieVirtualKeyInfo(tableMetaClient);.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find @codope! Glad we're able to identify the root-cause of that slow-down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we take this fix one step further -- instead of fetching virtual key-info here let's push it inside createFileStatusUnchecked by pass the meta-client in there

@codope codope force-pushed the HUDI-5411-no-schema-reading branch from ed2f76f to f25fdb4 Compare December 23, 2022 04:53
Push virtual key fetch inside createFileStatusUnchecked for MOR input format
@codope codope force-pushed the HUDI-5411-no-schema-reading branch from f25fdb4 to bd71935 Compare December 24, 2022 03:50
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit a882f44 into apache:master Dec 24, 2022
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
…pache#7527)

Push virtual key fetch inside createFileStatusUnchecked for MOR input format
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…pache#7527)

Push virtual key fetch inside createFileStatusUnchecked for MOR input format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:query-engine Query engine integrations priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants