[HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet #6612

loukey-lj · 2022-09-06T13:37:24Z

Change Logs

In the cow scenario, updating just one piece of data in the file would require rewriting the entire file, which is very inefficient. In reality, many update datasets are long-tail datasets, so it is necessary to improve the efficiency of cow update. This pr is a solution for how to speed up cow update. This requires expanding the index structure. www.sf-tech.com.cn combined record_level_index with this solution in the production environment last year and achieved a good performance improvement.

How partial update works

a. Add one more member variable(Integer rowGroupId) into the class HoodieRecordLocation.

public class HoodieRecordLocation implements Serializable {
 protected String instantTime;
 protected String fileId;
 /**
 * the index of key in parquet rowGroup num.
 */
 protected Integer rowGroupNum;
 }

b. Number of rowgroup of a Parquet starts from 0 which continously increases util BlockSize reaches hoodie.parquet.block.size. Since every record in parquet belongs to a rowgroup, we can simply use parquet API to locate rowgroup num of new record which needs to be written into corresponding parquet file, and then record rowgroup num into hoodieRecordLocation of each hoodieRecord. HoodieRecordLocations will be collected into WriteStatus which will be updated to the index on batch.

c. At phase of tagging index, rowgroup num will be queried out, so that they can be used to accelerate updating files.

Concrete flow of upserting is as below:

steps of writing a parquet file on cow

(upserting) data preparing

At phase of tag indexing, find out HoodieRecord.currentLocation.rowGroupNum of updating records, if rowgroup num is empty, record does implicitly not exists, which means current operation is a INSERT, otherwise DELETE or UPDATE. At next, rowgroup nums are used to make grouping by of the updating records so as to collect all rowgroups which should be updated.

rowgroup updating

The process of updating rowgroup is divided into 5 steps.
1. Deserializing and decompressing the columns which need to be combined and assembled into a List<Pari<rowKey,Pari<offset,record>> structure, where offset represents record's row number in rowgroup(every rowgroup's row number starts with zero).
2. Using HoodieRecordPayload#getInsertValue to deserialize the upserting data, then invoking HoodieRecordPayload#getInsertValue to combine the updating rows.
3. Converting combined data into column structure, just like [{"name":"zs","age":10},{"name":"ls","age":20}] ==> {"name":["sz","ls"],"age":[10,20]}
4. Iterating rowgroups' columns. if column needn't be updated then writing column by datapage without decompression and deserialization.
5. If column needs to be updated then write columns one by one.
insert handling

Impact

NA

Risk level

NA

Documentation Update

RFC

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

同步 hudi master

nsivabalan · 2022-09-07T00:46:48Z

can you please fill in PR description and template.

danny0405 · 2022-09-07T02:05:15Z

Overall an interesting idea, let put the details in the document.

hudi-bot · 2022-09-07T05:47:57Z

CI report:

a990d7b Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

loukey-lj · 2022-09-11T04:09:26Z

@nsivabalan @danny0405 Thanks for review. I updated comment.

guanziyue · 2022-09-13T03:03:29Z

Hi loukey-lj, excited to hear a fantastic idea.
May I know if you have tried part of this idea? For example, updating parquet file actually is not bounded to hudi framework. We can have a unit test by directly rewrite a file only depending on parquet api. As far as I know, parquet file requests schema to be unique among all row groups. Do we have a mechanism to solve this once the row group we write in latest commit has an evolved or devolved schema?

nsivabalan · 2022-11-02T03:34:58Z

@loukey-lj : can you respond to @guanziyue 's comment above. I will review this patch by this week.

loukey-lj · 2022-12-15T03:19:15Z

@loukey-lj : can you respond to @guanziyue 's comment above. I will review this patch by this week.

Yes, this optimization is applicable to other frameworks. For hudi, its advantage is that it can get rowgroups and store them in the index while updating the index. For schema evolution, we currently only support adding fields. Different rowgroups in the Parquet file can have different schmeas, but this is unknown to the query side. If schema changes are not considered, I can submit a small demo

guanziyue · 2022-12-15T06:31:20Z

@loukey-lj : can you respond to @guanziyue 's comment above. I will review this patch by this week.

Yes, this optimization is applicable to other frameworks. For hudi, its advantage is that it can get rowgroups and store them in the index while updating the index. For schema evolution, we currently only support adding fields. Different rowgroups in the Parquet file can have different schmeas, but this is unknown to the query side. If schema changes are not considered, I can submit a small demo

Thanks for your reply. Agree that this idea can improve performance a lot theoretically. It worries me that current parquet implementation or interface cannot fully support this idea. Looking forward to this RFC!

loukey-lj · 2022-12-15T07:01:13Z

I don't know if I can fully support schema evolution. I hope to improve this function with the help of the community. I will write a small demo as soon as possible

loukey-lj · 2022-12-22T13:38:46Z

From this class, maybe you can have a general understanding of the parquet partial update implementation
https://github.com/loukey-lj/hudi/tree/partial-update
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodiePartialUpdateHandle.java

guanziyue · 2022-12-23T01:06:07Z

From this class, maybe you can have a general understanding of the parquet partial update implementation https://github.com/loukey-lj/hudi/tree/partial-update hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodiePartialUpdateHandle.java

Wow! This code shows your idea clearly. Thanks for your clarification. I found parquet internal API is used in this code. I believe the schema evolution problem I mentioned can be resolved by this way. Looking forward to this RFC!

nsivabalan

I definitely see a good benefit for partial update use-cases. Have left 2 minor comments. please loop me in once you have the impl.

nsivabalan · 2023-01-02T22:54:36Z

rfc/rfc-58/rfc-58.md

+
+*   In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting.  Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.


So, this could be effective only incase of partial updates? In other words, for most commonly used payloads like OverwriteWithLatestAvroPayload, DefaultHoodieRecordPayload etc, this might cause unnecessary overhead right?

It has nothing to do with what payload is used. It is important to know which columns need to be updated and which columns do not need to be updated. If we know which columns need to be updated, even if OverwriteWithLatestAvroPayload is used, it can be partially updated. The copy of rowGroup is applicable to all Payloads. My current scenario is based on merge into. The updated columns come from the syntax parsing of SQL, and then are set in conf

I get it. my point was. in case of OverwriteWithLatestAvroPayload, new record is going to contain every column. and unless we read the old record from disk and deser, we never know which column is being updated. Infact, we have an optimization here, where in we don't even deser old record from storage incase of OverwriteWithLatestAvroPayload, bcoz we are going to overide entire record anyways.

yeah, SQL merge into uses ExpressionPayload and hence I def see a real benefit. but other payloads, its very much impl dependent as I have explained above.

nsivabalan · 2023-01-02T22:55:37Z

rfc/rfc-58/rfc-58.md

+
+    2.  Using HoodieRecordPayload#getInsertValue to deserialize the upserting data, then invoking HoodieRecordPayload#getInsertValue to combine the updating rows. 
+
+    3.  Converting combined data into column structure, just like `[{"name":"zs","age":10},{"name":"ls","age":20}] ==> {"name":["sz","ls"],"age":[10,20]}`


I would assume w/ impl, we will decide whether to take this path depending on the payload impl used. we don't want to incur additional overhead for the ones which may not be effective (for eg OverwriteWithLatestAvroPayload, DefaultHoodieRecordPayload)

vinothchandar · 2023-03-22T23:13:43Z

@loukey-lj still interested in driving this? Its a great idea.

loukey-lj · 2023-04-04T09:03:55Z

@loukey-lj still interested in driving this? Its a great idea.

Of course, hopefully the community will merge this RFC first

vinothchandar

Just trying to understand what the expected gains out of box. Could you please grab RFC-66 (the next number), update the table with list of RFC as well. We can land this RFC after that. RFC-58 is now taken.

vinothchandar · 2023-04-13T01:33:53Z

rfc/rfc-58/rfc-58.md

+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression.  To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one,  just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.


Two questions:

a) is there a way to copy over unchanged columns as well within each row group? or do this at the page level?

b) IIUC I think this helps in cases where the parquet file has multiple row groups and only few of them are changed? would you expect to see any performance improvements with the default 120 MB file size, with 120MB block size? i.e with just one row group in the parquet file

a) If the column is not updated, then the page does not need to be decompressed, and if the data in the page is updated, the page needs to be deserialized and read out one by one
b）Our rowgroup size is 30M, if the parquet file has only one rowgroup, it will not benefit from rowgroup skipping

yihua · 2023-05-19T19:11:14Z

@loukey-lj I updated the RFC number for you.

waitingF · 2023-05-24T06:48:40Z

Is this RFC only valid for SQL update scenarios, because it can parse out which columns have been updated from SQL statement. But in other scenarios, such as the "mysql -> debezium -> kafka -> hudi" scenario, we have no way of knowing which columns are updated unless additional calculations are spent, so it can't be applied immediately, right?

loukey-lj · 2023-05-26T02:32:47Z

Is this RFC only valid for SQL update scenarios, because it can parse out which columns have been updated from SQL statement. But in other scenarios, such as the "mysql -> debezium -> kafka -> hudi" scenario, we have no way of knowing which columns are updated unless additional calculations are spent, so it can't be applied immediately, right?

This applies not only to partial field update scenarios, but also to entire row updates

yihua · 2023-05-30T22:17:27Z

Hi @loukey-lj thanks for putting up the RFC and the great ideas on improving the write performance in Hudi! I'll merge this RFC now.

waitingF · 2023-12-27T10:05:16Z

@loukey-lj @yihua hi, any progress on this improvement? very look forword to this.

loukey-lj and others added 11 commits January 26, 2021 22:45

Merge pull request #1 from apache/master

ee9ae14

同步 hudi master

Merge branch 'apache:master' into master

3fb10f7

Merge branch 'apache:master' into master

0fa6348

Merge branch 'apache:master' into master

9427b43

Merge branch 'apache:master' into master

99bf8f2

Merge branch 'apache:master' into master

83f0414

Merge branch 'apache:master' into master

fb01262

Merge branch 'apache:master' into master

e1dfa9d

Merge branch 'apache:master' into master

c35a685

Merge branch 'apache:master' into master

05caed5

A more effective HoodieMergeHandler for COW table with parquet

a990d7b

yihua added writer-core priority:high Significant impact; potential bugs labels Sep 13, 2022

nsivabalan self-assigned this Oct 19, 2022

xushiyan changed the title ~~[HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet~~ [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet Oct 31, 2022

xushiyan added the rfc Request for comments label Oct 31, 2022

danny0405 self-assigned this Nov 2, 2022

loukey-lj mentioned this pull request Dec 28, 2022

[HUDI-53] Record Level Index #7429

Closed

4 tasks

nsivabalan reviewed Jan 2, 2023

View reviewed changes

Merge branch 'apache:master' into HUDI-4790

58f9d30

vinothchandar self-assigned this Feb 16, 2023

vinothchandar reviewed Apr 13, 2023

View reviewed changes

yihua changed the title ~~[RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet~~ [HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet May 2, 2023

yihua added 3 commits May 19, 2023 12:09

Update RFC number

b614a94

Remove changes in README

1a1a618

Fix changes

28bebba

yihua approved these changes May 30, 2023

View reviewed changes

yihua merged commit 5420387 into apache:master May 30, 2023

vinothchandar mentioned this pull request Jul 5, 2023

[RFC-69] Hudi 1.X #8679

Merged

4 tasks

hudi-bot mentioned this pull request Dec 9, 2025

A more effective HoodieMergeHandler for COW table with parquet #15381

Open


		* In current version of Hudi, a complex De/serialization and De/compression happens every time upserting long tail data on COW, which causes giant CPU/IO cost.

		* The purpose of current RFC aims to decrease costs of De/serialization and De/compression in upserting. Try to think about the reality, if we know which row groups need to be updated and even more the columns need to be updated in these row groups, we can skip much data's de/serialization and de/compression. That brings giant improvement.


		2. Using HoodieRecordPayload#getInsertValue to deserialize the upserting data, then invoking HoodieRecordPayload#getInsertValue to combine the updating rows.

		3. Converting combined data into column structure, just like `[{"name":"zs","age":10},{"name":"ls","age":20}] ==> {"name":["sz","ls"],"age":[10,20]}`


		## Abstract

		To provide a more effective HoodieMergeHandler for COW table with parquet. Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization and De/compression. To decrease this cost, a 'surgery' is introduced, which rebuilds a new parquet from an old one, just copying unchanged rowGroups and overwriting changed rowGroups when updating parquet files.

[HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet #6612

[HUDI-4790][RFC-68] a more effective HoodieMergeHandler for COW table with parquet #6612

Uh oh!

Conversation

loukey-lj commented Sep 6, 2022 • edited by yihua Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

How partial update works

steps of writing a parquet file on cow

Impact

Risk level

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan commented Sep 7, 2022

Uh oh!

danny0405 commented Sep 7, 2022

Uh oh!

hudi-bot commented Sep 7, 2022

CI report:

Uh oh!

loukey-lj commented Sep 11, 2022

Uh oh!

guanziyue commented Sep 13, 2022

Uh oh!

nsivabalan commented Nov 2, 2022

Uh oh!

loukey-lj commented Dec 15, 2022

Uh oh!

guanziyue commented Dec 15, 2022

Uh oh!

loukey-lj commented Dec 15, 2022

Uh oh!

loukey-lj commented Dec 22, 2022

Uh oh!

guanziyue commented Dec 23, 2022

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 2, 2023

Choose a reason for hiding this comment

Uh oh!

loukey-lj Jan 3, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 3, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 3, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 2, 2023

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Mar 22, 2023

Uh oh!

loukey-lj commented Apr 4, 2023

Uh oh!

vinothchandar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

loukey-lj Apr 23, 2023

Choose a reason for hiding this comment

Uh oh!

yihua commented May 19, 2023

Uh oh!

waitingF commented May 24, 2023

Uh oh!

loukey-lj commented May 26, 2023

Uh oh!

yihua commented May 30, 2023

Uh oh!

waitingF commented Dec 27, 2023

Uh oh!

Reviewers

loukey-lj commented Sep 6, 2022 •

edited by yihua

Loading

vinothchandar left a comment •

edited

Loading